# PANDAS MERGE

pandas.merge(df1, df2) is equivalent to df1.merge(df2)

Data Source: https://towardsdatascience.com/all-the-pandas-merge-you-should-know-for-combining-datasets-526b9ecaf184

In [39]:
import pandas as pd

customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})

product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag', 'Shoes', 'Smartphone', 'Books', 'Oil', 'Laptop'],
    'Category': ['Fashion', 'Fashion', 'Fashion', 'Electronics', 'Study', 'Grocery', 'Electronics'],
    'Price': [299.0,1350.5,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai', 'Bangalore']
})
display(customer, product)

Unnamed: 0,id,name,age,Product_ID,Purchased_Product,City
0,1,Olivia,20,101,Watch,Mumbai
1,2,Aditya,25,0,,Delhi
2,3,Cory,15,106,Oil,Bangalore
3,4,Isabell,10,0,,Chennai
4,5,Dominic,30,103,Shoes,Chennai
5,6,Tyler,65,104,Smartphone,Delhi
6,7,Samuel,35,0,,Kolkata
7,8,Daniel,18,0,,Delhi
8,9,Jeremy,23,107,Laptop,Mumbai


Unnamed: 0,Product_ID,Product_name,Category,Price,Seller_City
0,101,Watch,Fashion,299.0,Delhi
1,102,Bag,Fashion,1350.5,Mumbai
2,103,Shoes,Fashion,2999.0,Chennai
3,104,Smartphone,Electronics,14999.0,Kolkata
4,105,Books,Study,145.0,Delhi
5,106,Oil,Grocery,110.0,Chennai
6,107,Laptop,Electronics,79999.0,Bangalore


**INNER JOIN**

Returns a DataFrame with only those rows that have common characteristics or similar values. 

An inner join requires each row in the two joined dataframes to have matching column values (The column on which we are performing inner join). This is similar to the intersection of two set.

In [2]:
# merge() performs inner_join by default.
# Takes arguments of two datframes and column name on which we want to perform inner join

# Showing all the products sold online and who purchased them
merged_Df = pd.merge(product, customer, on= ['Product_ID'])
# Equivalent to product.merge(customer, on= ['Product_ID'])
display(merged_Df)

print('-------------')

# Showing Product and Buyer from the same location
# Can be done with 'on' parameter is both had same 'City' column_name
merged_Df = pd.merge(product, customer, left_on= ['Product_ID','Seller_City'], right_on= ['Product_ID','City'])  
display(merged_Df)

Unnamed: 0,Product_ID,Product_name,Category,Price,Seller_City,id,name,age,Purchased_Product,City
0,101,Watch,Fashion,299.0,Delhi,1,Olivia,20,Watch,Mumbai
1,103,Shoes,Fashion,2999.0,Chennai,5,Dominic,30,Shoes,Chennai
2,104,Smartphone,Electronics,14999.0,Kolkata,6,Tyler,65,Smartphone,Delhi
3,106,Oil,Grocery,110.0,Chennai,3,Cory,15,Oil,Bangalore
4,107,Laptop,Electronics,79999.0,Bangalore,9,Jeremy,23,Laptop,Mumbai


-------------


Unnamed: 0,Product_ID,Product_name,Category,Price,Seller_City,id,name,age,Purchased_Product,City
0,103,Shoes,Fashion,2999.0,Chennai,5,Dominic,30,Shoes,Chennai


In [3]:
# In case of diffrent column name
merged_Df = pd.merge(product, customer, left_on= 'Product_name', right_on= 'Purchased_Product')
display(merged_Df)

Unnamed: 0,Product_ID_x,Product_name,Category,Price,Seller_City,id,name,age,Product_ID_y,Purchased_Product,City
0,101,Watch,Fashion,299.0,Delhi,1,Olivia,20,101,Watch,Mumbai
1,103,Shoes,Fashion,2999.0,Chennai,5,Dominic,30,103,Shoes,Chennai
2,104,Smartphone,Electronics,14999.0,Kolkata,6,Tyler,65,104,Smartphone,Delhi
3,106,Oil,Grocery,110.0,Chennai,3,Cory,15,106,Oil,Bangalore
4,107,Laptop,Electronics,79999.0,Bangalore,9,Jeremy,23,107,Laptop,Mumbai


In [4]:
product.merge(customer, how='inner', left_on ='Product_name', right_on ='Purchased_Product', indicator= True)

Unnamed: 0,Product_ID_x,Product_name,Category,Price,Seller_City,id,name,age,Product_ID_y,Purchased_Product,City,_merge
0,101,Watch,Fashion,299.0,Delhi,1,Olivia,20,101,Watch,Mumbai,both
1,103,Shoes,Fashion,2999.0,Chennai,5,Dominic,30,103,Shoes,Chennai,both
2,104,Smartphone,Electronics,14999.0,Kolkata,6,Tyler,65,104,Smartphone,Delhi,both
3,106,Oil,Grocery,110.0,Chennai,3,Cory,15,106,Oil,Bangalore,both
4,107,Laptop,Electronics,79999.0,Bangalore,9,Jeremy,23,107,Laptop,Mumbai,both


**FULL JOIN**

Full Join, also known as Full Outer Join, returns all those records which either have a match in the left or right dataframe and remaining rows in both the dfs.

When rows in both the dataframes do not match, the resulting dataframe will have NaN for every column of the dataframe that lacks a matching row.

In [40]:
product.merge(customer, how='outer', left_on ='Product_ID', right_on ='Product_ID', indicator= 'True', validate= '1:m')

Unnamed: 0,Product_ID,Product_name,Category,Price,Seller_City,id,name,age,Purchased_Product,City,True
0,101,Watch,Fashion,299.0,Delhi,1.0,Olivia,20.0,Watch,Mumbai,both
1,102,Bag,Fashion,1350.5,Mumbai,,,,,,left_only
2,103,Shoes,Fashion,2999.0,Chennai,5.0,Dominic,30.0,Shoes,Chennai,both
3,104,Smartphone,Electronics,14999.0,Kolkata,6.0,Tyler,65.0,Smartphone,Delhi,both
4,105,Books,Study,145.0,Delhi,,,,,,left_only
5,106,Oil,Grocery,110.0,Chennai,3.0,Cory,15.0,Oil,Bangalore,both
6,107,Laptop,Electronics,79999.0,Bangalore,9.0,Jeremy,23.0,Laptop,Mumbai,both
7,0,,,,,2.0,Aditya,25.0,,Delhi,right_only
8,0,,,,,4.0,Isabell,10.0,,Chennai,right_only
9,0,,,,,7.0,Samuel,35.0,,Kolkata,right_only


**LEFT JOIN**
Returns common data of both the df along with all the data of the left df.   

**RIGHT JOIN**
Returns common data of both the df along with all the data of the right df.   

In [6]:
df_customer = pd.DataFrame({
    'id': [1, 1, 2, 2],
    'name': ['Tom', 'Jenny', 'James', 'Dan'],
})
df_info = pd.DataFrame({
    'id': [1, 1, 2, 2],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})
display(df_customer, df_info)

Unnamed: 0,id,name
0,1,Tom
1,1,Jenny
2,2,James
3,2,Dan


Unnamed: 0,id,age,sex
0,1,31,F
1,1,20,M
2,2,40,M
3,2,70,F


In [7]:
temp_df = df_customer.groupby('id')['name'].apply(','.join).reset_index(drop= True)
temp_df

0    Tom,Jenny
1    James,Dan
Name: name, dtype: object

In [8]:
# left join
pd.merge(df_customer, df_info, how='left', on='id', sort= True, indicator= True)

Unnamed: 0,id,name,age,sex,_merge
0,1,Tom,31,F,both
1,1,Tom,20,M,both
2,1,Jenny,31,F,both
3,1,Jenny,20,M,both
4,2,James,40,M,both
5,2,James,70,F,both
6,2,Dan,40,M,both
7,2,Dan,70,F,both


In [29]:
# Right join
pd.merge(df_customer, df_info, how='right', on='id', sort= True, indicator= True, validate= 'm:m')

Unnamed: 0,id,name,age,sex,_merge
0,1,Tom,31,F,both
1,1,Jenny,31,F,both
2,1,Tom,20,M,both
3,1,Jenny,20,M,both
4,2,James,40,M,both
5,2,Dan,40,M,both
6,2,James,70,F,both
7,2,Dan,70,F,both


**1.** We can also pass axis arguments in list

**2.** `merge()` validate param. 
> 1. 'one_to_one (1:1) : check whether the keys are unique in both df or not.  
> 2. 'many_to_one (m:1) : check whether the keys are unique to right df or not.
> 3. 'one_to_many (1:m) : check whether the keys are unique to left df or not.
> 4. 'many_to_many (m:m) : allowed, but does not result in checks.