# Lecture 04 – part I

## Pandas Basics

[Pandas](#https://pandas.pydata.org/) is the most popular data container in Python for data manipulation and analysis. Pandas has two primary data structures: `Series` and `DataFrames`. Series are similar to Python lists or numpy vectors: they are one dimensional. They are more flexible asy can contain mixed types! A Series object  also has an index which is printed along the values when it goes _toString_. Pandas Series are the main building blocks of the Pandas DataFrames.

In [1]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")

In [2]:
ps = pd.Series(["a", 2, np.pi, 36])
print(ps)

0           a
1           2
2    3.141593
3          36
dtype: object


In [3]:
# values only
print(ps.values)

['a' 2 3.141592653589793 36]


In [4]:
# indices only
print(ps.index)

RangeIndex(start=0, stop=4, step=1)


We can use the list slicing indices to acces data...

In [5]:
print(ps[1:3])

1           2
2    3.141593
dtype: object


There are two other syntax options for data access:
- `.loc[]` provides access using the _index values_
- `iloc[]` uses the index positions

In [6]:
ps = pd.Series(
    data=[
        "mozzarella caprese",
        "Wiener Schnitzel",
        "Schwartwalder Kirschtorte",
        "lemonade",
        "whiskey",
    ],
    index=["appetizer", "main course", "dessert", "beverage", "alcohol"],
)

In [7]:
ps

appetizer             mozzarella caprese
main course             Wiener Schnitzel
dessert        Schwartwalder Kirschtorte
beverage                        lemonade
alcohol                          whiskey
dtype: object

In [8]:
ps.loc[["appetizer", "dessert", "beverage"]]

appetizer           mozzarella caprese
dessert      Schwartwalder Kirschtorte
beverage                      lemonade
dtype: object

In [9]:
ps.iloc[1:3]

main course             Wiener Schnitzel
dessert        Schwartwalder Kirschtorte
dtype: object

DataFrames are 'two-dimensional, size-mutable, potentially heterogeneous tabular data'. Each DataFrame is eventually a collection of Pandas Series.

There are mupltiple ways to create Pandas dataframes.

In [10]:
dc_city_pop = {
    'Tokyo': 37339804,
    'Delhi': 31181376,
    'Shanghai': 27795702,
    'Sao Paulo': 22237472,
    'Mexico City': 21918936,
    'Dhaka': 21741090,
    'Cairo': 21322750,
    'Beijing': 20896820,
    'Mumbai': 20667656,
    'Osaka': 19110616
}

Note: population numbers are from [2018](https://www.archdaily.com/906605/the-20-largest-cities-in-the-world-of-2018).

In [11]:
ps_city_pop = pd.Series(dc_city_pop)
ps_city_pop

Tokyo          37339804
Delhi          31181376
Shanghai       27795702
Sao Paulo      22237472
Mexico City    21918936
Dhaka          21741090
Cairo          21322750
Beijing        20896820
Mumbai         20667656
Osaka          19110616
dtype: int64

In [12]:
print(ps_city_pop.index)
print(ps_city_pop.values)

Index(['Tokyo', 'Delhi', 'Shanghai', 'Sao Paulo', 'Mexico City', 'Dhaka',
       'Cairo', 'Beijing', 'Mumbai', 'Osaka'],
      dtype='object')
[37339804 31181376 27795702 22237472 21918936 21741090 21322750 20896820
 20667656 19110616]


In [13]:
dc_city_countries = {
    "Tokyo": "Japan",
    "Delhi": "India",
    "Shanghai": "China",
    "Sao Paulo": "Brazil",
    "Mexico City": "Mexico",
    "Dhaka": "Bangladesh",
    "Cairo": "Egypt",
    "Beijing": "China",
    "Mumbai": "India",
    "Osaka": "Japan",
}

In [14]:
ps_city_countries = pd.Series(dc_city_countries)
ps_city_countries

Tokyo               Japan
Delhi               India
Shanghai            China
Sao Paulo          Brazil
Mexico City        Mexico
Dhaka          Bangladesh
Cairo               Egypt
Beijing             China
Mumbai              India
Osaka               Japan
dtype: object

In [15]:
print(ps_city_countries.index)
print(ps_city_countries.values)

Index(['Tokyo', 'Delhi', 'Shanghai', 'Sao Paulo', 'Mexico City', 'Dhaka',
       'Cairo', 'Beijing', 'Mumbai', 'Osaka'],
      dtype='object')
['Japan' 'India' 'China' 'Brazil' 'Mexico' 'Bangladesh' 'Egypt' 'China'
 'India' 'Japan']


In [16]:
df_cities = pd.concat([ps_city_pop, ps_city_countries], axis=1)
df_cities

Unnamed: 0,0,1
Tokyo,37339804,Japan
Delhi,31181376,India
Shanghai,27795702,China
Sao Paulo,22237472,Brazil
Mexico City,21918936,Mexico
Dhaka,21741090,Bangladesh
Cairo,21322750,Egypt
Beijing,20896820,China
Mumbai,20667656,India
Osaka,19110616,Japan


In [17]:
df_cities.columns = ['population', 'country']
df_cities

Unnamed: 0,population,country
Tokyo,37339804,Japan
Delhi,31181376,India
Shanghai,27795702,China
Sao Paulo,22237472,Brazil
Mexico City,21918936,Mexico
Dhaka,21741090,Bangladesh
Cairo,21322750,Egypt
Beijing,20896820,China
Mumbai,20667656,India
Osaka,19110616,Japan


<br>Slicing based on index and/or column.
- using `.iloc[]` based on position

In [None]:
# some rows
df_cities.iloc[2:5]

In [None]:
# some rows and some columns
df_cities.iloc[2:5, 1]

- using `.loc[]` based on the index value and/or column name

In [None]:
# list of cities (note the double squared brackets)
df_cities.loc[["Shanghai", "Dhaka", "Osaka"]]

In [None]:
# list of cities + a column
df_cities.loc[["Shanghai", "Dhaka", "Osaka"], "country"]

In [None]:
# a range of cities from the index
df_cities.loc["Tokyo":"Sao Paulo"]

- sicing/filtering based on cell value

In [None]:
df_cities[df_cities.population > 30_000_000] # for human readibility you can use underscore as thousand separator

In [None]:
df_cities[df_cities.country.isin(["Japan", "India", "Brazil"])]

In [None]:
df_cities[
    ~df_cities.country.isin(["Japan", "India", "Brazil"])
]  # tilde (~) for 'not in'

<br>Move index to column

In [None]:
df_cities.reset_index(drop=False, inplace=True)
df_cities

<br>You can rename certain columns using a dictionary with _old name_ as key and _new name_ as value.

In [None]:
df_cities.rename({"index": "city"}, axis="columns", inplace=True)
df_cities

We can also create a dataframe from a list of lists.

In [None]:
data = []
data.append(["Tokyo", 37339804, "Japan"])
data.append(["Delhi", 31181376, "India"])
data.append(["Shanghai", 27795702, "China"])
data.append(["Sao Paulo", 22237472, "Brazil"])
data.append(["Mexico City", 21918936, "Mexico"])
data.append(["Dhaka", 21741090, "Bangladesh"])
data.append(["Cairo", 21322750, "Egypt"])
data.append(["Beijing", 20896820, "China"])
data.append(["Mumbai", 20667656, "India"])
data.append(["Osaka", 19110616, "Japan"])

In [None]:
data

In [None]:
df_cities_ = pd.DataFrame(data=data, columns=["city", "population", "country"])
df_cities_

Reorder columns 

In [None]:
df_cities_ = df_cities_[["city", "country", "population"]]
df_cities_

Metadata on the dataframe colums

In [None]:
df_cities.info()

In [None]:
df_cities.shape # results in a tuple

In [None]:
df_cities.shape[0] # rowcount