<img src=../images/gdd-logo.png width=300px align=right>

# Combining Datasets

In this notebook we'll ways to combine datasets: concatenating, merging and joining.

- [Concatenating datasets](#c)
    - <mark>[Exercise: Concatenating](#e0) </mark>
- [Merging datasets](#mj)
    - <mark>[Assignment](#e1) </mark>

<a id='c'></a>
## Concatenating datasets

This time, let's imagine you didn't recieve the `chickweight` dataset in its entirety. 

Instead you recieve **four separate datasets**, one for each diet.

In [None]:
import pandas as pd

In [None]:
diet_1 = pd.read_csv('../data/diet_1.csv')
diet_2 = pd.read_csv('../data/diet_2.csv')
diet_3 = pd.read_csv('../data/diet_3.csv')
diet_4 = pd.read_csv('../data/diet_4.csv')

You can see from the below that the dataframe `diet_1` has in fact only got information where `diet` is equal to 1.

In [None]:
diet_1['diet'].unique()

To recreate the `chickweight` dataset, you would need to **vertically stack** these datasets on top of one another to do so. 

The `pd.concat()` method can be used to do this.

In [None]:
chickweight_concat = pd.concat([diet_1, diet_2, diet_3, diet_4])
chickweight_concat['diet'].unique()

<a id='e0'></a>
## <mark> Exercise: Concatenating </mark>

1. The above results in the our original dataset, but the indexes are not the same if you read the full data in. Alter the code so that the resulting index is the default value `0 - 578`

<details>
<summary><font style="color:blue;font-weight:bold">SHOW HINT</font></summary>
  
There are two ways to do this:
    
1. Use the method `.reset_index()` on the resulting DataFrame.
2. Use a parameter in the [pd.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)
    
</details>

2. Concatenate `diet_1` and `diet_2` together horizontally. Why do we get the missing `NaN` values?

<details>
<summary><font style="color:blue;font-weight:bold">SHOW HINT</font></summary>
  
You will need to use the `axis=` parameter to specify you want to concatenate across the columns (horizontally).
    
</details>

**Answers**: Uncomment and run the following to see an answer.

In [None]:
# %load ../answers/Combining_Datasets/ex-concat-1.py

In [None]:
# %load ../answers/Combining_Datasets/ex-concat-2.py

### Adding extra information 

Concatenation is the act of putting two datasets together, either on top of one another or side-by-side. 

Now you will see how to merge data in, for example if you have some new data about the names of the diets:

In [None]:
diet_names = pd.read_csv('../data/diet_names.csv')
diet_names

You may want to add this information to the original data:

In [None]:
chickweight = (
    pd.read_csv('../data/chickweight.csv') 
      .rename(str.lower, axis='columns')
)

If you do this with the `pd.concat()` function, you notice that the output is not what you might expect.

In [None]:
pd.concat([chickweight, diet_names], axis=1)

<mark>**Question:** Why do you get the above dataframe as a result?</mark>

<details>
    <summary><font color=blue>Show answer</font></summary>
  
The concatenation occurs on the indexes of the two input dataframes. The index on the `diet_names` only goes up to `3`, where in `chickweight` goes up to `578`.
    
When concatenating it matches the same index values to one another. 
    
</details>

Instead for each value of diet you want to merge on the corresponding diet name.

This is where the **merge** function comes in.

----
<img src="../images/06_Combining_Datasets/join.png" width="200" height="240" align="right"/>

<a id='mj'></a>
## Merging DataFrames

Merging DataFrames is the process of combining two or more similar DataFrames into a single one. 

Merging can be used to add or append variables to a dataset to add information to a DataFrame that exists in another DataFrame.

For example, to **merge** on the diet names, you can use the `pd.merge()` function specifying:
- The names of the DataFrames: `chickweight`, `diet_names`
- On which column to perform the merge from the **left** and **right** dataframes*

*Note: The left DataFrame is the first argument.

In [None]:
pd.merge(
    chickweight,
    diet_names, 
    left_on="diet",
    right_on='diet_id'
)

Note how there are two diet columns that specify the diet ID numbers. That is because the names of the columns are different. Look what happens if you rename the column before the merge:

In [None]:
pd.merge(
    chickweight.rename(columns={'diet': 'diet_id'}),
    diet_names, 
    left_on='diet_id',
    right_on='diet_id'
)

Note that it is not necessary to specify which column you merge on from the left and right since the column names are now same.

Instead you can use the keyword `on=` and use the column named `diet_id`. 

In [None]:
pd.merge(
    chickweight.rename(columns={'diet': 'diet_id'}),
    diet_names, 
    on='diet_id'
)

A good thing to consider is what happens when you have two columns that have the same name, which **are not** the column on which to perform the merge. 

To demonstrate, imagine if there was a column in the diet_names DataFrame that was also called `weight`, which specified the amount of feed you receive in a bag of the chicken feed brand.

In [None]:
diet_names = (
    diet_names
    .assign(weight = ['1400g', '1500g', '1350g', '1450g'])
)

diet_names

What would happen now there are two columns with the same name?

In [None]:
pd.merge(
    chickweight.rename(columns={'diet': 'diet_id'}),
    diet_names, 
    on='diet_id'
).head()

Notice that the columns have a suffix. You can control this with the keyword `suffixes=` where the argument is a list of the suffixes you want to use in order of the DataFrames you have passed in previously.

In [None]:
pd.merge(
    chickweight.rename(columns={'diet': 'diet_id'}),
    diet_names, 
    on='diet_id',
    suffixes=['_chick', '_feed']
).head()

<a id='e1'></a>
## <mark> Assignment : Find the fattest chicken per diet</mark>

1. Use the `.groupby()` method to find the max chick per diet.
2. merge this information to the original chickweight dataset.
3. Find the fattest chickwen per diet by identifying when the value in the weight column is equal to the value in your new column


In [None]:
# %load ../answers/Combining_Datasets/ex-fattest-chick-merge.py