In [None]:
import numpy as np
import pandas as pd

# Combine Data

When we have two datasets that need to be combined for analysis, we use Pandas' `concatenate` and `merge` functions.

Let's explore these functions using several long form air quality datasets

In [None]:
# Load dataset
air_quality_no2 = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv",
                              parse_dates=True)
# Select columns
air_quality_no2 = air_quality_no2[["date.utc", "location",
                                   "parameter", "value"]]

air_quality_no2.head()                            

In [None]:
# Load dataset
air_quality_pm25 = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_pm25_long.csv",
                              parse_dates=True)
# Select columns
air_quality_pm25 = air_quality_pm25[["date.utc", "location",
                                   "parameter", "value"]]

air_quality_pm25.head()   

In [None]:
# Load station coordinate dataset
stations_coord = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_stations.csv")

stations_coord.head()

In [None]:
# Load parameter dataset
air_quality_parameters  = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_parameters.csv")

air_quality_parameters

# Concatenate Dataframes

![Concatenate](https://drive.google.com/uc?id=1Led65VvuJgnGDMKlHoAh7QRADz0Nw96X)

We can use `concat()` to combine our two air quality datasets into a single Dataframe.

In [None]:
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0).sort_index()

air_quality.head()

We chose to concatenate over `axis=0`, so the rows wer combined. We also notice that the original indexes are retained.

We can verify that all rows were combined by looking at the shape of the original dataframes versus the new dataframe:

In [None]:
air_quality_pm25.shape

In [None]:
air_quality_no2.shape

In [None]:
air_quality.shape

If we wish to reset the index so that they are not duplicated, we can do so with `reset_index()`.

In [None]:
# Reset the index, retaining the old index
air_quality.reset_index()

Note that the original indicies are still contained in the dataframe. If we don't need to keep them, we can drop them by setting `drop=True`.

In [None]:
# Reset the index, dropping the old index
air_quality = air_quality.reset_index(drop=True)

air_quality.head()

# Merge

![Merge Left](https://drive.google.com/uc?id=1-u_RoscsfLWuRWDe8YYxrdvvv3TLgUQ7)

`merge()` enables us to combine two dataframes with a common identifier.

We already combined our two datasets into the new **air_quality** dataframe, but we have a more dataframes to combine.

Let's start with the station coordinate dataframe and remind ourselves what the data looks like:

In [None]:
air_quality.head()

In [None]:
stations_coord.head()

We would like to have both the measurements and station coordinates in the same table. Because both tables contain a **location** column, we combine our data on that key. 

In [None]:
air_quality = pd.merge(air_quality, stations_coord,
                        how='left', on='location')

air_quality.head()

Notice that we have our same **air quality** dataframe with two additional columns. 

Now let's combine the last dataset:

In [None]:
air_quality_parameters

The **air_quality_parameters** dataframe contains a description and name for each parameter id. But notice that it doesn't have any column names in common with our **air_quality** dataframe.

Notice that the **parameter** column in the **air_quality** dataframe matches the **id** column in the **air_quality_parameters** dataframe.

Using this informatino, we can still merge the two dataframes by setting additional parameters `left_on` and `right_on`.

In [None]:
air_quality = pd.merge(air_quality, air_quality_parameters,
                       how='left', left_on='parameter', right_on='id')

air_quality.head()

We have now combined all four of our original datasets into a single Pandas Dataframe!

# Summary

- Multiple tables can be concatenated both column as row wise using the `concat` function.
- For database-like merging/joining of tables, use the `merge` function.