## Merging Data in Pandas

Generally data will be scattered. It will be into multiple files or multiple tables & u may need to merge this into a single df. e.g might have 1 table for 1 date & need to analyse a month's data so need to merge 30 files cz u need all data together cz analysis is for a month.

## Concatenate vs Merge
Concatenate, when 2 df everything same i.e. all columns etc and u just need to add data.
Merge, when we have 2 different dataframes & we might be interested in their elements based on column col's/ indices

### 1. Concatenate

In [36]:
# Generally dict we have {keya:valuea, keyb:valueb} but you can also have valuea as
# list itself instead of single value.

import pandas as pd
df_product = pd.DataFrame({'product_code':[101,102,103,104],
                           'product_desc':['laptop','pc','ipad','watch']})
df_product

Unnamed: 0,product_code,product_desc
0,101,laptop
1,102,pc
2,103,ipad
3,104,watch


In [37]:
df_product1 = pd.DataFrame({'product_code':[105,106,107],
                           'product_desc':['speaker','ups','camera']})

In [38]:
df_product1

Unnamed: 0,product_code,product_desc
0,105,speaker
1,106,ups
2,107,camera


In [39]:
# need to create a master list of all these products
df_master = pd.concat([df_product,df_product1])

In [40]:
df_master

Unnamed: 0,product_code,product_desc
0,101,laptop
1,102,pc
2,103,ipad
3,104,watch
0,105,speaker
1,106,ups
2,107,camera


In [41]:
# In concatenation, index not proper for concatenated dataset above.
# use ignore_index=True. it will ignore index from original df & apply its own index.
df_master1 = pd.concat([df_product,df_product1],ignore_index=True)
df_master1

Unnamed: 0,product_code,product_desc
0,101,laptop
1,102,pc
2,103,ipad
3,104,watch
4,105,speaker
5,106,ups
6,107,camera


### 2. Merging
To combine data on common columns or indices

1. Inner Join - gives u entries which are common b/w 2 dataframes.
2. Left Outer Join - common entries + left df entries
3. Right Outer Join - common entries + right df entries
4. Full Outer Join - all entries from both dataframe

In [42]:
df_invoice = pd.DataFrame({'product_code':[101,101,104,107],
                           'invoice_amount':[600,900,899,700]})
df_invoice

Unnamed: 0,product_code,invoice_amount
0,101,600
1,101,900
2,104,899
3,107,700


### Inner join

In [43]:
# Now we also want product description corresponding to our products.
# We can take up description from another df which is product_code.
# left_on and right_on parameters can be used if col name is different in two dataframes (column on which u need to merge) although content is same.

df_invoice.merge(right=df_product,how='inner',on='product_code')
# df which u write with merge() command is taken left df by default.
# then in (right=) u need to specify 2nd dataframe with which u need to merge.

Unnamed: 0,product_code,invoice_amount,product_desc
0,101,600,laptop
1,101,900,laptop
2,104,899,watch


Note- The 107 product_code entry is present in df_invoice but not present in resultant df above after merge because we have used inner join.(only product codes common entries in both tables will be displayed)

### Left Join

In [44]:
df_invoice.merge(right=df_product,how='left',on='product_code')
# common + left table entries

Unnamed: 0,product_code,invoice_amount,product_desc
0,101,600,laptop
1,101,900,laptop
2,104,899,watch
3,107,700,


### Right Join

In [46]:
df_invoice.merge(right=df_product,how='right',on='product_code')
# common + right table entries

Unnamed: 0,product_code,invoice_amount,product_desc
0,101,600.0,laptop
1,101,900.0,laptop
2,102,,pc
3,103,,ipad
4,104,899.0,watch


### Full Outer Join

In [48]:
df_invoice.merge(right=df_product,how='outer',on='product_code')

Unnamed: 0,product_code,invoice_amount,product_desc
0,101,600.0,laptop
1,101,900.0,laptop
2,104,899.0,watch
3,107,700.0,
4,102,,pc
5,103,,ipad
