In [0]:
import numpy as np
import pandas as pd

# Combine Data

When we have two datasets that need to be combined for analysis, we use Pandas' `concatenate` and `merge` functions.

Let's explore these functions using several long form air quality datasets

In [0]:
# Load dataset
air_quality_no2 = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv",
                              parse_dates=True)
# Select columns
air_quality_no2 = air_quality_no2[["date.utc", "location",
                                   "parameter", "value"]]

air_quality_no2.head()                            

Unnamed: 0,date.utc,location,parameter,value
0,2019-06-21 00:00:00+00:00,FR04014,no2,20.0
1,2019-06-20 23:00:00+00:00,FR04014,no2,21.8
2,2019-06-20 22:00:00+00:00,FR04014,no2,26.5
3,2019-06-20 21:00:00+00:00,FR04014,no2,24.9
4,2019-06-20 20:00:00+00:00,FR04014,no2,21.4


In [0]:
# Load dataset
air_quality_pm25 = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_pm25_long.csv",
                              parse_dates=True)
# Select columns
air_quality_pm25 = air_quality_pm25[["date.utc", "location",
                                   "parameter", "value"]]

air_quality_pm25.head()   

Unnamed: 0,date.utc,location,parameter,value
0,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0
1,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5
2,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5
3,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0
4,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5


In [0]:
# Load station coordinate dataset
stations_coord = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_stations.csv")

stations_coord.head()

Unnamed: 0,location,coordinates.latitude,coordinates.longitude
0,BELAL01,51.23619,4.38522
1,BELHB23,51.1703,4.341
2,BELLD01,51.10998,5.00486
3,BELLD02,51.12038,5.02155
4,BELR833,51.32766,4.36226


In [0]:
# Load parameter dataset
air_quality_parameters  = pd.read_csv("https://github.com/pandas-dev/pandas/blob/master/doc/data/air_quality_parameters.csv")

air_quality_parameters

Unnamed: 0,id,description,name
0,bc,Black Carbon,BC
1,co,Carbon Monoxide,CO
2,no2,Nitrogen Dioxide,NO2
3,o3,Ozone,O3
4,pm10,Particulate matter less than 10 micrometers in...,PM10
5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
6,so2,Sulfur Dioxide,SO2


# Concatenate Dataframes

![Concatenate](https://drive.google.com/uc?id=1Led65VvuJgnGDMKlHoAh7QRADz0Nw96X)

We can use `concat()` to combine our two air quality datasets into a single Dataframe.

In [0]:
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0).sort_index()

air_quality.head()

Unnamed: 0,date.utc,location,parameter,value
0,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0
0,2019-06-21 00:00:00+00:00,FR04014,no2,20.0
1,2019-06-20 23:00:00+00:00,FR04014,no2,21.8
1,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5
2,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5


We chose to concatenate over `axis=0`, so the rows wer combined. We also notice that the original indexes are retained.

We can verify that all rows were combined by looking at the shape of the original dataframes versus the new dataframe:

In [0]:
air_quality_pm25.shape

(1110, 4)

In [0]:
air_quality_no2.shape

(2068, 4)

In [0]:
air_quality.shape

(3178, 4)

If we wish to reset the index so that they are not duplicated, we can do so with `reset_index()`.

In [0]:
# Reset the index, retaining the old index
air_quality.reset_index()

Unnamed: 0,index,date.utc,location,parameter,value
0,0,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0
1,0,2019-06-21 00:00:00+00:00,FR04014,no2,20.0
2,1,2019-06-20 23:00:00+00:00,FR04014,no2,21.8
3,1,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5
4,2,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5
...,...,...,...,...,...
3173,2063,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0
3174,2064,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0
3175,2065,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0
3176,2066,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0


Note that the original indicies are still contained in the dataframe. If we don't need to keep them, we can drop them by setting `drop=True`.

In [0]:
# Reset the index, dropping the old index
air_quality = air_quality.reset_index(drop=True)

air_quality.head()

Unnamed: 0,date.utc,location,parameter,value
0,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0
1,2019-06-21 00:00:00+00:00,FR04014,no2,20.0
2,2019-06-20 23:00:00+00:00,FR04014,no2,21.8
3,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5
4,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5


# Merge

![Merge Left](https://drive.google.com/uc?id=1-u_RoscsfLWuRWDe8YYxrdvvv3TLgUQ7)

`merge()` enables us to combine two dataframes with a common identifier.

We already combined our two datasets into the new **air_quality** dataframe, but we have a more dataframes to combine.

Let's start with the station coordinate dataframe and remind ourselves what the data looks like:

In [0]:
air_quality.head()

Unnamed: 0,date.utc,location,parameter,value
0,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0
1,2019-06-21 00:00:00+00:00,FR04014,no2,20.0
2,2019-06-20 23:00:00+00:00,FR04014,no2,21.8
3,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5
4,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5


In [0]:
stations_coord.head()

Unnamed: 0,location,coordinates.latitude,coordinates.longitude
0,BELAL01,51.23619,4.38522
1,BELHB23,51.1703,4.341
2,BELLD01,51.10998,5.00486
3,BELLD02,51.12038,5.02155
4,BELR833,51.32766,4.36226


We would like to have both the measurements and station coordinates in the same table. Because both tables contain a **location** column, we combine our data on that key. 

In [0]:
air_quality = pd.merge(air_quality, stations_coord,
                        how='left', on='location')

air_quality.head()

Unnamed: 0,date.utc,location,parameter,value,coordinates.latitude,coordinates.longitude
0,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,51.20966,4.43182
1,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,48.83724,2.3939
2,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,48.83722,2.3939
3,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,48.83724,2.3939
4,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,48.83722,2.3939


Notice that we have our same **air quality** dataframe with two additional columns. 

Now let's combine the last dataset:

In [0]:
air_quality_parameters

Unnamed: 0,id,description,name
0,bc,Black Carbon,BC
1,co,Carbon Monoxide,CO
2,no2,Nitrogen Dioxide,NO2
3,o3,Ozone,O3
4,pm10,Particulate matter less than 10 micrometers in...,PM10
5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
6,so2,Sulfur Dioxide,SO2


The **air_quality_parameters** dataframe contains a description and name for each parameter id. But notice that it doesn't have any column names in common with our **air_quality** dataframe.

Notice that the **parameter** column in the **air_quality** dataframe matches the **id** column in the **air_quality_parameters** dataframe.

Using this informatino, we can still merge the two dataframes by setting additional parameters `left_on` and `right_on`.

In [0]:
air_quality = pd.merge(air_quality, air_quality_parameters,
                       how='left', left_on='parameter', right_on='id')

air_quality.head()

Unnamed: 0,date.utc,location,parameter,value,id_x,description_x,name_x,id_y,description_y,name_y
0,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
1,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,no2,Nitrogen Dioxide,NO2,no2,Nitrogen Dioxide,NO2
2,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,no2,Nitrogen Dioxide,NO2,no2,Nitrogen Dioxide,NO2
3,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
4,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5


We have now combined all four of our original datasets into a single Pandas Dataframe!

# Summary

- Multiple tables can be concatenated both column as row wise using the `concat` function.
- For database-like merging/joining of tables, use the `merge` function.