<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Combining Data</h1>
</div>

© Copyright Machine Learning Plus

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. Joining DataFrames</h2>
</div>

__When to use__

1. When you have information in two different dataframes be joined as one.
2. When you create new column aggregated from existing dataset.

__Example__

1. You have products data in two or more files. One file contains the pricing and product category information and another contains the monthly sales. And you want to join the pricing information to the sales data. 
2. In Titanic dataset, you want to create a new column called average Fare for a class.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df.head()

Let's create the average fare a class using groupby. Then, merge the result to the original dataframe.

In [None]:
df_avgfare = df.groupby('Pclass').agg({"Fare": lambda x: np.round(np.mean(x), 2)})
df_avgfare.rename(columns={"Fare": "Avg_Fare"}, inplace=True)
df_avgfare

In [None]:
pd.merge(df, df_avgfare, on="Pclass")

__Same logic using `transform()` method.__

In [None]:
df.groupby('Pclass')['Fare'].transform(np.mean)

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Types of Joins</h2>
</div>

__When to use__

You have two or more datasets that have a common column between them. And you want to merge the datasets as one.

The merging can happen in multiple ways:

1. Left Join  (all rows in left df is retained)
2. Right Join (all rows in right df is retained)
3. Inner Join (only common rows are retained)
4. Outer Join (all rows from both datasets are retained)

In [None]:
import pandas as pd
import numpy as np

In [None]:
df1 = pd.read_csv("Datasets/table1.csv")
df2 = pd.read_csv("Datasets/table2.csv")

print(df1.shape)
print(df1.head(), "\n")

print(df2.shape)
print(df2.head(), "\n")

In [None]:
df3 = df2.rename(columns={"carname": "car"})
df3

__Left Join__

In [None]:
df_left = pd.merge(df1, df2, on='carname', how='left')
print(df_left.shape)
df_left

__When column names are not the same__

In [None]:
pd.merge(df1, df3, left_on='carname', right_on="car", how='left')

__Right Join__

In [None]:
df_right = pd.merge(df1, df2, on='carname', how='right')
print(df_right.shape)
df_right

__Inner Join (default)__

In [None]:
df_inner = pd.merge(df1, df2, on='carname', how='inner')
print(df_inner.shape)
df_inner

__Full Join__

In [None]:
df_outer = pd.merge(df1, df2, on='carname', how='outer')
print(df_outer.shape)
df_outer

__Alternate Syntax__

In [None]:
df1.merge(df2, how='inner', on='carname')

### Challenge

Get the rows that are NOT common between `df1` and `df2`.

```python
import pandas as pd
import numpy as np
df1 = pd.read_csv("Datasets/table1.csv")
df2 = pd.read_csv("Datasets/table2.csv")
```

Code Link: https://git.io/JsBMo

In [None]:
# Solution
import pandas as pd
import numpy as np
df1 = pd.read_csv("Datasets/table1.csv")
df2 = pd.read_csv("Datasets/table2.csv")

In [None]:
df_inner = df1.merge(df2, on='carname', how='inner')
print(df_inner, '\n')

df_outer = df1.merge(df2, on='carname', how='outer')
print(df_outer)

In [None]:
# Exclude common rows.
df_outer.loc[~df_outer.carname.isin(df_inner.carname), :]

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Concatenating DataFrames Row-wise and Column-wise</h2>
</div>

__When to use__

You want to concatenate two datasets either row-wise or column wise. 
1. The datasets have the same rows but different columns -> Append column wise
2. The datasets have the same columns but different rows -> Append row wise

In [None]:
import pandas as pd
import numpy as np

In [None]:
# df2 and df3 have same columns but differnt rows. But one of the columns have a different name
# df2 and df4 have same rows but different columns.

df2 = pd.read_csv("Datasets/table2.csv")
df3 = pd.read_csv("Datasets/table3.csv")
df4 = pd.read_csv("Datasets/table4.csv")

print(df2.shape)
print(df2.head(), "\n")

print(df3.shape)
print(df3.head(), "\n")

print(df4.shape)
print(df4.head())

(13, 4)
            carname     wt   qsec  vs
0    Toyota Corolla  1.835  19.90   1
1     Toyota Corona  2.465  20.01   1
2  Dodge Challenger  3.520  16.87   0
3       AMC Javelin  3.435  17.30   0
4        Camaro Z28  3.840  15.41   0 

(19, 4)
                 car     wt   qsec  vs
0          Mazda RX4  2.620  16.46   0
1      Mazda RX4 Wag  2.875  17.02   0
2         Datsun 710  2.320  18.61   1
3     Hornet 4 Drive  3.215  19.44   1
4  Hornet Sportabout  3.440  17.02   0 

(13, 4)
            carname   mpg  cyl   disp
0    Toyota Corolla  33.9    4   71.1
1     Toyota Corona  21.5    4  120.1
2  Dodge Challenger  15.5    8  318.0
3       AMC Javelin  15.2    8  304.0
4        Camaro Z28  13.3    8  350.0


Though the column are aligned in order, concatenation happens as per the column name.

In [None]:
pd.concat([df2, df3], axis='rows')

Unnamed: 0,carname,wt,qsec,vs,car
0,Toyota Corolla,1.835,19.9,1,
1,Toyota Corona,2.465,20.01,1,
2,Dodge Challenger,3.52,16.87,0,
3,AMC Javelin,3.435,17.3,0,
4,Camaro Z28,3.84,15.41,0,
5,Pontiac Firebird,3.845,17.05,0,
6,Fiat X1-9,1.935,18.9,1,
7,Porsche 914-2,2.14,16.7,0,
8,Lotus Europa,1.513,16.9,1,
9,Ford Pantera L,3.17,14.5,0,


Rename the columns to align them correctly.

In [None]:
pd.concat([df2, df3.rename(columns={"car": "carname"})], axis='rows')

Unnamed: 0,carname,wt,qsec,vs
0,Toyota Corolla,1.835,19.9,1
1,Toyota Corona,2.465,20.01,1
2,Dodge Challenger,3.52,16.87,0
3,AMC Javelin,3.435,17.3,0
4,Camaro Z28,3.84,15.41,0
5,Pontiac Firebird,3.845,17.05,0
6,Fiat X1-9,1.935,18.9,1
7,Porsche 914-2,2.14,16.7,0
8,Lotus Europa,1.513,16.9,1
9,Ford Pantera L,3.17,14.5,0


Append the datasets along the columns.

In [None]:
pd.concat([df2, df4.drop(columns='carname')], axis='columns')

Unnamed: 0,carname,wt,qsec,vs,mpg,cyl,disp
0,Toyota Corolla,1.835,19.9,1,33.9,4,71.1
1,Toyota Corona,2.465,20.01,1,21.5,4,120.1
2,Dodge Challenger,3.52,16.87,0,15.5,8,318.0
3,AMC Javelin,3.435,17.3,0,15.2,8,304.0
4,Camaro Z28,3.84,15.41,0,13.3,8,350.0
5,Pontiac Firebird,3.845,17.05,0,19.2,8,400.0
6,Fiat X1-9,1.935,18.9,1,27.3,4,79.0
7,Porsche 914-2,2.14,16.7,0,26.0,4,120.3
8,Lotus Europa,1.513,16.9,1,30.4,4,95.1
9,Ford Pantera L,3.17,14.5,0,15.8,8,351.0
