# Session 4. Data Assembly

## By now, you should be able to load data into Pandas and do some basic visualizations. We will focus on various data cleaning and curation tasks.

## This session will cover:

1. Concatenating data
2. Merging data sets

# 1. Let's load and install some libraries

In [1]:
import pandas as pd

# 2. Let's load some datasets

In [2]:
## The air_quality_no2_long.csv data set provides ùëÅùëÇ2 values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.
air_quality_no2 = pd.read_csv('https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/air_quality_no2_long.csv',parse_dates=True)

In [3]:
air_quality_no2.head()

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,¬µg/m¬≥
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,¬µg/m¬≥
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,¬µg/m¬≥
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,¬µg/m¬≥
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,¬µg/m¬≥


In [4]:
## The air_quality_pm25_long.csv data set provides ùëÉùëÄ25 values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.

air_quality_pm25 = pd.read_csv('https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/air_quality_pm25_long.csv',parse_dates=True)

In [5]:
air_quality_pm25.head()

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,¬µg/m¬≥
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,¬µg/m¬≥
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,¬µg/m¬≥
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,¬µg/m¬≥
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,¬µg/m¬≥


In [6]:
## The air quality measurement station coordinates are stored in a data file air_quality_stations.csv

stations_coord = pd.read_csv("https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/air_quality_stations.csv")

In [7]:
stations_coord.head()

Unnamed: 0,location,coordinates.latitude,coordinates.longitude
0,BELAL01,51.23619,4.38522
1,BELHB23,51.1703,4.341
2,BELLD01,51.10998,5.00486
3,BELLD02,51.12038,5.02155
4,BELR833,51.32766,4.36226


In [8]:
## The air quality parameters metadata are stored in a data file air_quality_parameters.csv

air_quality_parameters = pd.read_csv("https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/air_quality_parameters.csv")

In [9]:
air_quality_parameters

Unnamed: 0,id,description,name
0,bc,Black Carbon,BC
1,co,Carbon Monoxide,CO
2,no2,Nitrogen Dioxide,NO2
3,o3,Ozone,O3
4,pm10,Particulate matter less than 10 micrometers in...,PM10
5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
6,so2,Sulfur Dioxide,SO2


# 3. Concatenation

### the following figure provides a visual representation of the operation we want to perform
<img src="https://pandas.pydata.org/pandas-docs/stable/_images/08_concat_row1.svg">



## We want to combine the measurements of ùëÅùëÇ2 and ùëÉùëÄ25, two tables with a similar structure, in a single table

In [10]:
## the parameter axis=0 indicates concatenation by rows.
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)

In [11]:
air_quality

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,¬µg/m¬≥
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,¬µg/m¬≥
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,¬µg/m¬≥
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,¬µg/m¬≥
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,¬µg/m¬≥
...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,¬µg/m¬≥
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,¬µg/m¬≥
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,¬µg/m¬≥
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,¬µg/m¬≥


# 4. Join

### the following figure provides a visual representation of the operation we want to perform
<img src="https://pandas.pydata.org/pandas-docs/stable/_images/08_merge_left.svg">



## We want to add the station coordinates, provided by the stations metadata table, to the corresponding rows in the measurements table.

In [12]:
air_quality_geolocated = pd.merge(air_quality, stations_coord,how='left', on='location')

In [13]:
air_quality_geolocated

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,coordinates.latitude,coordinates.longitude
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,¬µg/m¬≥,51.20966,4.43182
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,¬µg/m¬≥,51.20966,4.43182
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,¬µg/m¬≥,51.20966,4.43182
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,¬µg/m¬≥,51.20966,4.43182
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,¬µg/m¬≥,51.20966,4.43182
...,...,...,...,...,...,...,...,...,...
4177,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,¬µg/m¬≥,51.49467,-0.13193
4178,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,¬µg/m¬≥,51.49467,-0.13193
4179,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,¬µg/m¬≥,51.49467,-0.13193
4180,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,¬µg/m¬≥,51.49467,-0.13193


## We want to add the parameter full description and name, provided by the parameters metadata table, to the measurements table

In [14]:
air_quality_complete = pd.merge(air_quality_geolocated, air_quality_parameters,how='left', left_on='parameter', right_on='id')

In [15]:
air_quality_complete

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,coordinates.latitude,coordinates.longitude,id,description,name
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,¬µg/m¬≥,51.20966,4.43182,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,¬µg/m¬≥,51.20966,4.43182,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,¬µg/m¬≥,51.20966,4.43182,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,¬µg/m¬≥,51.20966,4.43182,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,¬µg/m¬≥,51.20966,4.43182,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
...,...,...,...,...,...,...,...,...,...,...,...,...
4177,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,¬µg/m¬≥,51.49467,-0.13193,no2,Nitrogen Dioxide,NO2
4178,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,¬µg/m¬≥,51.49467,-0.13193,no2,Nitrogen Dioxide,NO2
4179,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,¬µg/m¬≥,51.49467,-0.13193,no2,Nitrogen Dioxide,NO2
4180,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,¬µg/m¬≥,51.49467,-0.13193,no2,Nitrogen Dioxide,NO2


# 5. Challenge yourself ! 

## 5.1. Can you plot the evolution of air quality accross time and city ?

### Hint refer to the pandas documentation

https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html