# Working with DataFrames Part2

## Merge DataFrames

In [1]:
# For more info on merge parameters check out:
import webbrowser
url = 'http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.merge.html'
webbrowser.open(url)
# Next we'll learn how to merge on Index!

True

In [2]:
# Basic Imports
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# Let's make a df
df1 = DataFrame({'key':['X','Z','Y','Z','X','X'],'dataset_1': np.arange(6)})

#Show
df1

Unnamed: 0,key,dataset_1
0,X,0
1,Z,1
2,Y,2
3,Z,3
4,X,4
5,X,5


In [5]:
# Lets make another dframe
# Note: there are only single instances of the keys

df2 = DataFrame({'key':['Q','Y','Z'],'dataset_2':[1,2,3]})

#Show
df2

Unnamed: 0,key,dataset_2
0,Q,1
1,Y,2
2,Z,3


## Merge the two DF's

### A "many-to-one" case.
**Note:** The many to one merge works when there are not multiple instances of the keys on BOTH dataframes.

The merge will automatically choose overlapping columns to merge on.

In [6]:
# Merge 
pd.merge(df1, df2)

Unnamed: 0,key,dataset_1,dataset_2
0,Z,1,3
1,Z,3,3
2,Y,2,2


**Note:** The method didin't overlap the 'X's

We could have also specified which column to merge on

In [7]:
# merge and specify the columns

pd.merge(df1, df2, on='key')

Unnamed: 0,key,dataset_1,dataset_2
0,Z,1,3
1,Z,3,3
2,Y,2,2


We can choose which DataFrame's keys to use, In this case we will choose left (df1) keys

In [10]:
#df1

In [10]:
# Merge of dataframe keys

pd.merge(df1,df2, on='key', how='left')

Unnamed: 0,key,dataset_1,dataset_2
0,X,0,
1,Z,1,3.0
2,Y,2,2.0
3,Z,3,3.0
4,X,4,
5,X,5,


Now we will choose the right (df2) keys

In [11]:
# Choosing the right (df2) keys

pd.merge(df1,df2,on='key', how='right')

Unnamed: 0,key,dataset_1,dataset_2
0,Z,1.0,3
1,Z,3.0,3
2,Y,2.0,2
3,Q,,1


Choosing the "outer" method selects the union of both keys

**outer:** Use union of keys from both frames, similar to a SQL full outer join

In [13]:
# The "outer" method 

pd.merge(df1,df2, on='key', how='outer')

Unnamed: 0,key,dataset_1,dataset_2
0,X,0.0,
1,X,4.0,
2,X,5.0,
3,Z,1.0,3.0
4,Z,3.0,3.0
5,Y,2.0,2.0
6,Q,,1.0


### A many-to-many merge case

**Note** that both DataFrames contain **more than one instance** of the key in **BOTH** datasets.

In [11]:
# Df's creation
df3 = DataFrame({'key': ['X', 'X', 'X', 'Y', 'Z', 'Z'],
                 'dataset_3': range(6)})

df4 = DataFrame({'key': ['Y', 'Y', 'X', 'X', 'Z'],
                 'dataset_4': range(5)})

Function to display df side by side

In [17]:
from IPython.display import display_html

# function
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [18]:
display_side_by_side(df3, df4)

Unnamed: 0,key,dataset_3
0,X,0
1,X,1
2,X,2
3,Y,3
4,Z,4
5,Z,5

Unnamed: 0,key,dataset_4
0,Y,0
1,Y,1
2,X,2
3,X,3
4,Z,4


Many to many merge

In [17]:
#Show the merge
pd.merge(df3, df4)

Unnamed: 0,key,dataset_3,dataset_4
0,X,0,2
1,X,0,3
2,X,1,2
3,X,1,3
4,X,2,2
5,X,2,3
6,Y,3,0
7,Y,3,1
8,Z,4,4
9,Z,5,4


So what happened? A many to many merge results in the **product of the rows**. Because there were 3 'X's in df3 and 2 'X's in df4 there ended up being a total of 6 'X' rows in the result **(2*3=6)!** Note how df3 repeats its 0,1,2 values for 'X' and df4 repeats its '2,3' pairs throughout the key set. 

## Merge on index

Create two new dataframes

In [19]:
# Lets create two dframes
df_left = DataFrame({'key': ['X','Y','Z','X','Y'], 'data': range(5)})

df_right = DataFrame({'group_data': [10, 20]}, index=['X', 'Y'])

In [20]:
#Show both dfs
display_side_by_side(df_left, df_right)

Unnamed: 0,key,data
0,X,0
1,Y,1
2,Z,2
3,X,3
4,Y,4

Unnamed: 0,group_data
X,10
Y,20


### Merge 

We will use the key for the left df, and the index for the right df

In [21]:
# merge on key and index

pd.merge(df_left, df_right, left_on='key', right_index=True)

Unnamed: 0,key,data,group_data
0,X,0,10
3,X,3,10
1,Y,1,20
4,Y,4,20


#### We can also get a union by using outer parameter

**outer:** Use union of keys from both frames, similar to a SQL full outer join

In [22]:
# Union using outer
pd.merge(df_left, df_right, left_on='key', right_index=True, how='outer')

Unnamed: 0,key,data,group_data
0,X,0,10.0
3,X,3,10.0
1,Y,1,20.0
4,Y,4,20.0
2,Z,2,


# Let's exercise!