# AICE1006 - Data Analytics

## Lecture 3 - Pandas Introduction


**Zhiwu Huang**  <br/>
Lecturer (Assistant Professor) <br/>
Vision, Learning and Control (VLC) Research Group <br/>
School of Electronics and Computer Science (ECS) <br/>
University of Southampton<br/>

*Office Hour: Wed 2PM-3PM, Please book in advance.* <br/>
``Zhiwu.Huang@soton.ac.uk``

<br/>
<br/>
<!-- <br/> -->

Credit: Marco Forgione, Researcher, USI-SUPSI


<!-- The workhorse of numerical mathematics and machine learning in Python -->

# Pandas

<!-- ## Marco Forgione -->




Working with **tabular** and structured data


# Pandas in a nutshell

Pandas combines the high-performance **array computations** of numpy with flexible
data manipulation of **spreadsheets** (like Excel) and databases. 

It provides:

* A 1D labeled array object: Series
* A 2D tabular object: **DataFrame**
* Functions and methods to operate on Series and DataFrames


## Pandas Series

A Series may be seen as a generalized 1D numpy array  **with an index**:

In [1]:
import pandas as pd # import pandas with the shorthand convention name pd.
import numpy as np

In [2]:
ser = pd.Series(data=np.random.randn(5), index = ['a', 'b', 'c', 'd', 'e'])
ser

a    1.328487
b   -0.013700
c   -0.057315
d    1.614995
e    2.092239
dtype: float64

In [3]:
ser.values # data.values is a numpy array!

array([ 1.3284867 , -0.01370042, -0.05731541,  1.61499486,  2.09223925])

In [4]:
ser.index # data.index is a pandas Index object

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:
type(ser), type(ser.values), type(ser.index)

(pandas.core.series.Series, numpy.ndarray, pandas.core.indexes.base.Index)

Access to the values with square brackets notations, using the keys as indexes:

In [6]:
ser['a'] # dictionary-like access

1.3284867016058077

## Pandas Series

A Series also looks like a **specialized** dictionary. In fact, it can be constructed from a dictionary:

In [7]:
population_dict = {'Bellinzona': 17_744,
                   'Lugano': 62_615,
                   'Mendrisio': 11_554,
                   'Stabio': 4_510,
                   'Lausanne': 140_000, 
                   'Bern': 133_115}
population_dict # a dictionary

{'Bellinzona': 17744,
 'Lugano': 62615,
 'Mendrisio': 11554,
 'Stabio': 4510,
 'Lausanne': 140000,
 'Bern': 133115}

In [8]:
population_ser = pd.Series(population_dict)
population_ser # a pandas Series

Bellinzona     17744
Lugano         62615
Mendrisio      11554
Stabio          4510
Lausanne      140000
Bern          133115
dtype: int64

* A Series is like a dictionary with a fixed  **type** for its values (int64 in the example above).
* Trade-off flexibility/performance!

## Pandas Dataframe

It is like a 2D **numpy array** (matrix), with an index both for **columns** and for **rows**. <br/>
In fact, it can be constructed by specifying a 2D numpy array, column names, and row names:

In [9]:
population_mat = np.array([[54_615, 9_554, 3_510],
                           [57_615, 10_554, 4_510],
                           [60_615, 11_554, 4_510],
                           [62_615, 10_111, 5_510]])
df_population = pd.DataFrame(data=population_mat, 
                         columns=['Lugano', 'Mendrisio', 'Stabio'], # column names
                         index=[2000, 2005, 2010, 2015])  # row names
df_population

Unnamed: 0,Lugano,Mendrisio,Stabio
2000,54615,9554,3510
2005,57615,10554,4510
2010,60615,11554,4510
2015,62615,10111,5510


In [10]:
df_population.values # this is a 2D numpy array

array([[54615,  9554,  3510],
       [57615, 10554,  4510],
       [60615, 11554,  4510],
       [62615, 10111,  5510]])

In [11]:
df_population.columns # column names is an Index object

Index(['Lugano', 'Mendrisio', 'Stabio'], dtype='object')

In [12]:
df_population.index # row names is also an Index object. In this case, an Int64Index object

Index([2000, 2005, 2010, 2015], dtype='int64')

In [13]:
df_population['Lugano'] # typical use case: dict-like column access (returns a pd.Series)

2000    54615
2005    57615
2010    60615
2015    62615
Name: Lugano, dtype: int64

In [14]:
type(df_population['Lugano'])

pandas.core.series.Series

In [15]:
# df_population.loc[2000] # Access by row name is also possible (see later)

### How to construct a DataFrame?

Sometimes using the constructor ``pd.DataFrame``. 

* From a 2D numpy array, a list of column names, and a list of row names:

In [16]:
x = np.random.randn(4, 3) # a random 4x3 matrix
columns = ['A', 'B', 'C'] # optional, but almost always used
row_names = ['a', 'b', 'c', 'd'] # optional, but often omitted
df_x = pd.DataFrame(data=x, columns=columns, index=row_names) # a dataframe with random values
df_x

Unnamed: 0,A,B,C
a,-0.801712,-0.529326,-0.174762
b,0.281065,-0.75039,-0.844179
c,0.589612,-0.181168,-0.360268
d,-0.591618,0.158722,1.207003


* From a dictionary. Each key of the dictionary becomes a column name, each value (1D numpy array or list) becomes a column:

In [17]:
time = np.arange(4) # array([0, 1, 2, 3, 4])
time_str  = ["zero", "one", "two", "three"]
val  = time**2/2 # 0, 1, 4, 9, 16
dict_data = {"time": time, "time_str": time_str, "val": val}
df_data = pd.DataFrame(dict_data)
df_data

Unnamed: 0,time,time_str,val
0,0,zero,0.0
1,1,one,0.5
2,2,two,2.0
3,3,three,4.5


NOTE: If we do not specify row name, pandas defines a *default* integer index 0, 1, 2, ... for the rows. In practice, it is rather common to have column names only.


NOTE: The columns of ``df_data`` have different data types:

In [18]:
df_data.dtypes # in this case, the columns have different data types!

time          int64
time_str     object
val         float64
dtype: object

### How to construct a DataFrame?



More often, a DataFrame is read from an **external source**, e.g.:

* csv file -> pd.read_csv()
* excel file -> pd.read_xls()
* ...

In [19]:
df_cities = pd.read_csv('worldcities.csv') # from https://simplemaps.com/data/world-cities
df_cities.head(3) # return  first 3 cities in the dataframe

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881


In [20]:
df_cities.size # many rows!

170423

Reader methods have several optional arguments. For instance, ``pd.read_csv`` has the options:
 * sep : separator for the fields. Default=","
 * header: line number of the header (column names)
 * ...

It is hard to remember all options by heart. Look up in the documentation, when you need!

In [21]:
#pd.read_csv?

### Indexing and Selection

The DataFrame provides an intuitive *dictionary-like* access to its columns (with square brackets).


* If a *single column name* is given, the result is a Series:

In [22]:
df_cities.head(4) # returns the 4 first rows in the dataframe

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629


In [23]:
ser_city = df_cities['city'] # dict-like access, returns a pd.Series
ser_city.head(4)

0          Tokyo
1       New York
2    Mexico City
3         Mumbai
Name: city, dtype: object

* If a *list of column names* is given, the result is a DataFrame:

In [24]:
df_geo = df_cities[['city', 'lat', 'lng']] # returns a DataFrame
df_geo.head(3)

Unnamed: 0,city,lat,lng
0,Tokyo,35.685,139.7514
1,New York,40.6943,-73.9249
2,Mexico City,19.4424,-99.131


In [25]:
df_city = df_cities[['city']] # still returns a DataFrame. We passed a list of just one column name!
df_city.head(3)

Unnamed: 0,city
0,Tokyo
1,New York
2,Mexico City


### Loc indexer

The ``.loc`` indexer enables access to a DataFrame by row and column names

In [26]:
df_cities = df_cities.set_index('city_ascii') # Set the city_ascii column as row index
# df_cities.set_index("city_ascii", inplace=True) # Alternative with inplace operation. It is slightly faster
df_cities.head(4)

Unnamed: 0_level_0,city,lat,lng,country,iso2,iso3,admin_name,capital,population,id
city_ascii,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629


In [27]:
# now we can access a row by name!
df_cities.loc['Lausanne'] # this is like df_cities.loc['Lausanne', :] 

city             Lausanne
lat               46.5304
lng                  6.65
country       Switzerland
iso2                   CH
iso3                  CHE
admin_name           Vaud
capital             admin
population       265702.0
id             1756055099
Name: Lausanne, dtype: object

In [28]:
# we can also access an entry by row and column names:
df_cities.loc['Lausanne', 'country'] 

'Switzerland'

In [29]:
df_cities.loc[:, 'country'] # this is like df_cities['country']!

city_ascii
Tokyo                  Japan
New York       United States
Mexico City           Mexico
Mumbai                 India
Sao Paulo             Brazil
                   ...      
Timmiarmiut        Greenland
Cheremoshna          Ukraine
Ambarchik             Russia
Nordvik               Russia
Ennadai               Canada
Name: country, Length: 15493, dtype: object

### Iloc indexer

The ``.iloc`` indexer refers to an **implicit index** 0,1,2,... for rows and columns, corresponding to the order in which the elements appear in the DataFrame:

In [30]:
df_cities.head(5) 

Unnamed: 0_level_0,city,lat,lng,country,iso2,iso3,admin_name,capital,population,id
city_ascii,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
Sao Paulo,São Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519


In [31]:
df_cities.iloc[2, 2] # it works like the square bracket notation of a numpy array.

-99.131

In [32]:
# iloc can be used together with slicing
df_cities.iloc[0:5, :] # same as df_CH.head(5)

Unnamed: 0_level_0,city,lat,lng,country,iso2,iso3,admin_name,capital,population,id
city_ascii,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
Sao Paulo,São Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519


### Boolean indexing

Boolean indexing refers to selecting **rows** according to logical conditions on column values:

In [33]:
# simple condition
df_cities_CH = df_cities[df_cities['country'] == 'Switzerland'] 
df_cities_CH.head(4)

Unnamed: 0_level_0,city,lat,lng,country,iso2,iso3,admin_name,capital,population,id
city_ascii,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Geneva,Geneva,46.21,6.14,Switzerland,CH,CHE,Genève,admin,1240000.0,1756810813
Zurich,Zürich,47.38,8.55,Switzerland,CH,CHE,Zürich,admin,1108000.0,1756539143
Bern,Bern,46.9167,7.467,Switzerland,CH,CHE,Bern,primary,275329.0,1756374318
Basel,Basel,47.5804,7.59,Switzerland,CH,CHE,Basel-Stadt,admin,830000.0,1756731313


And, or, and not logic conditions may be implemented using the operators ``&``, ``|``, ``~``

In [34]:
# two conditions in 'and'
df_big_cities_CH = df_cities[(df_cities['country'] == 'Switzerland') & (df_cities['population'] >= 1000000)] # looks for cities in Switzerland with population > 1000000
df_big_cities_CH

Unnamed: 0_level_0,city,lat,lng,country,iso2,iso3,admin_name,capital,population,id
city_ascii,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Geneva,Geneva,46.21,6.14,Switzerland,CH,CHE,Genève,admin,1240000.0,1756810813
Zurich,Zürich,47.38,8.55,Switzerland,CH,CHE,Zürich,admin,1108000.0,1756539143


In [35]:
# two conditions in 'or'
df_cities_CH_GE = df_cities[(df_cities['country'] == 'Switzerland') | (df_cities['country'] == 'Germany')] # looks for cities in Switzerland or in Germany
df_cities_CH_GE.sample(4)

Unnamed: 0_level_0,city,lat,lng,country,iso2,iso3,admin_name,capital,population,id
city_ascii,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Schwyz,Schwyz,47.02,8.648,Switzerland,CH,CHE,Schwyz,admin,14177.0,1756270644
Solothurn,Solothurn,47.212,7.537,Switzerland,CH,CHE,Solothurn,,14853.0,1756021237
Stralsund,Stralsund,54.3004,13.1,Germany,DE,DEU,Mecklenburg-Western Pomerania,minor,61368.0,1276640152
Jena,Jena,50.9304,11.58,Germany,DE,DEU,Thuringia,minor,104712.0,1276659978


The ``.isin`` method comes in handy to check for equality for different possible values (common task):

In [36]:
df_cities_CH_GE = df_cities[df_cities['country'].isin(["Switzerland", "Germany"])] # looks for cities in Switzerland or in Germany

### Hierarchical Index

Pandas supports hierarchical index (or multi-index) both for rows and for columns

In [37]:
# some fake data for illustration
data = np.round(np.random.rand(4, 6), 1)
data[:, 1::2] *= 4
data[:, ::2] *= 20
data[:, ::2] += 50
data[:, 1::2] += 34

# create the DataFrame
df_bio = pd.DataFrame(data, columns=[["Bob", "Bob", "Guido", "Guido", "Sue", "Sue"], ["HR", "BT", "HR", "BT", "HR", "BT"]],
                          index=[[2019, 2019, 2020, 2020], [1, 2, 1, 2]])
df_bio

Unnamed: 0_level_0,Unnamed: 1_level_0,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,Unnamed: 1_level_1,HR,BT,HR,BT,HR,BT
2019,1,56.0,35.6,60.0,38.0,66.0,35.2
2019,2,68.0,37.6,62.0,36.0,54.0,37.6
2020,1,58.0,35.6,68.0,36.0,56.0,37.2
2020,2,50.0,36.8,64.0,36.0,64.0,36.4


The dataframe above represents 
* for three subjects: Bob, Guido, Sue
* two type of measurements: Hearth Rate (HR) and Body  Temperature (BT)
* over two years: 2013, 2014
* two visits per year: 1, 2



We are representing data in 4 dimensions!

### Hierarchical Index

Tuples are used to access hierarchically indexed columns and rows:

In [38]:
df_bio

Unnamed: 0_level_0,Unnamed: 1_level_0,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,Unnamed: 1_level_1,HR,BT,HR,BT,HR,BT
2019,1,56.0,35.6,60.0,38.0,66.0,35.2
2019,2,68.0,37.6,62.0,36.0,54.0,37.6
2020,1,58.0,35.6,68.0,36.0,56.0,37.2
2020,2,50.0,36.8,64.0,36.0,64.0,36.4


In [39]:
df_bio["Sue"] # all data for Sue

Unnamed: 0,Unnamed: 1,HR,BT
2019,1,66.0,35.2
2019,2,54.0,37.6
2020,1,56.0,37.2
2020,2,64.0,36.4


In [40]:
df_bio[("Sue", "HR")] # BT for Sue

2019  1    66.0
      2    54.0
2020  1    56.0
      2    64.0
Name: (Sue, HR), dtype: float64

In [41]:
df_bio.loc[2019] # All data in 2019

Unnamed: 0_level_0,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,HR,BT,HR,BT,HR,BT
1,56.0,35.6,60.0,38.0,66.0,35.2
2,68.0,37.6,62.0,36.0,54.0,37.6


In [42]:
df_bio.loc[(2019, 1)] # All data for 2019, visit 1

Bob    HR    56.0
       BT    35.6
Guido  HR    60.0
       BT    38.0
Sue    HR    66.0
       BT    35.2
Name: (2019, 1), dtype: float64

In [43]:
df_bio.loc[(2019, 1), "Sue"]  # All data for 2019, visit 1, subject Sue

HR    66.0
BT    35.2
Name: (2019, 1), dtype: float64