### Categories of Joins:
    The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one and many-to-many joins.
    All three types of joins are accessed via an identical call to the pd.merge() interface;
    The type of join performed depends on the form of the input data.

### One-to-one joins
    Perhaps the simplest type of merge expression is the one-to-one join, which is in many ways very similar to the column-wise concatenation.

In [1]:
import numpy as np
import pandas as pd

In [10]:
# Creating employee DataFrame with group
df1 = pd.DataFrame({'Employee' : ['Bob', 'Jack', 'Lisa', 'Sue'], 'group' : ['Accounting', 'Engineering', 'Engineering', 'HR']})
df1

Unnamed: 0,Employee,group
0,Bob,Accounting
1,Jack,Engineering
2,Lisa,Engineering
3,Sue,HR


In [6]:
# Creating employee DataFrame with hire_date
df2 = pd.DataFrame({'Employee' : ['Lisa', 'Bob', 'Jack', 'Sue'], 'hire_date' : [2004, 2008, 2012, 2014]})
df2

Unnamed: 0,Employee,hire_date
0,Lisa,2004
1,Bob,2008
2,Jack,2012
3,Sue,2014


In [11]:
# Combining two DataFrames(one-to-one join)
df3 = pd.merge(df1, df2)
df3

Unnamed: 0,Employee,group,hire_date
0,Bob,Accounting,2008
1,Jack,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


### Many-to-one joins
    Many-to-one joins are joins in which one of the two key columns contains duplicate entries.
    For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate.

In [8]:
df4 = pd.DataFrame({'group' : ['Accounting', 'Engineering', 'HR'], 'supervisor' : ['Carly', 'Guido', 'Steve']})
df4

Unnamed: 0,group,supervisor
0,Accounting,Carly
1,Engineering,Guido
2,HR,Steve


In [12]:
# Combining two DataFrames(many-to-one join)
df5 = pd.merge(df3, df4)
df5

Unnamed: 0,Employee,group,hire_date,supervisor
0,Bob,Accounting,2008,Carly
1,Jack,Engineering,2012,Guido
2,Lisa,Engineering,2004,Guido
3,Sue,HR,2014,Steve


### Many-to-many joins
    Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined.
    If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.

In [14]:
df6 = pd.DataFrame({'group' : ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'], 'skills' : ['math', 'spread-sheets', 'coding', 'linux', 'spread-sheets', 'organiztion']})
df6

Unnamed: 0,group,skills
0,Accounting,math
1,Accounting,spread-sheets
2,Engineering,coding
3,Engineering,linux
4,HR,spread-sheets
5,HR,organiztion


In [16]:
# Combining two DataFrames(many-to-many join)
df7 = pd.merge(df5, df6)
df7

Unnamed: 0,Employee,group,hire_date,supervisor,skills
0,Bob,Accounting,2008,Carly,math
1,Bob,Accounting,2008,Carly,spread-sheets
2,Jack,Engineering,2012,Guido,coding
3,Jack,Engineering,2012,Guido,linux
4,Lisa,Engineering,2004,Guido,coding
5,Lisa,Engineering,2004,Guido,linux
6,Sue,HR,2014,Steve,spread-sheets
7,Sue,HR,2014,Steve,organiztion


### Specification of the Merge Key:
    We’ve already seen the default behavior of pd.merge() : it looks for one or more matching column names between the two inputs, and uses this as the key.
    However, often the column names will not match so nicely, and pd.merge() provides a variety of options for handling this.

### The on keyword:
    Most simply, you can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names.
    This option works only if both the left and right DataFrame s have the specified column name.

In [19]:
pd.merge(df1, df2, on = 'Employee')

Unnamed: 0,Employee,group,hire_date
0,Bob,Accounting,2008
1,Jack,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


### The left_on and right_on keywords:
    1. At times you may wish to merge two datasets with different column names;
    2. For example, we may have a dataset in which the employee name is labeled as “name” rather than “employee”.
    3. In this case, we can use the left_on and right_on keywords to specify the two column names.
    4. The result has a redundant column that we can drop if desired —for example, by using the drop() method of DataFrames.

In [20]:
df8 = pd.DataFrame({'Name' : ['Bob', 'Jack', 'Lisa', 'Sue'], 'Salary' : [70000, 80000, 120000, 90000]})
df8

Unnamed: 0,Name,Salary
0,Bob,70000
1,Jack,80000
2,Lisa,120000
3,Sue,90000


In [24]:
pd.merge(df1, df8, left_on = 'Employee', right_on = 'Name').drop('Name', axis = 1)

Unnamed: 0,Employee,group,Salary
0,Bob,Accounting,70000
1,Jack,Engineering,80000
2,Lisa,Engineering,120000
3,Sue,HR,90000


### The left_index and right_index keywords:
    Sometimes, rather than merging on a column, you would instead like to merge on an index.

In [25]:
df1a = df1.set_index('Employee')
df1a

Unnamed: 0_level_0,group
Employee,Unnamed: 1_level_1
Bob,Accounting
Jack,Engineering
Lisa,Engineering
Sue,HR


In [27]:
df2a = df2.set_index('Employee')
df2a

Unnamed: 0_level_0,hire_date
Employee,Unnamed: 1_level_1
Lisa,2004
Bob,2008
Jack,2012
Sue,2014


In [31]:
pd.merge(df1a, df2a, left_index=True, right_index=True)

Unnamed: 0_level_0,group,hire_date
Employee,Unnamed: 1_level_1,Unnamed: 2_level_1
Bob,Accounting,2008
Jack,Engineering,2012
Lisa,Engineering,2004
Sue,HR,2014


### Specifying Set Arithmetic for Joins:
    In all the preceding examples we have glossed over one important consideration in performing a join: the type of set arithmetic used in the join.
    This comes up when avalue appears in one key column but not the other.

In [32]:
df9 = pd.DataFrame({'Name' : ['Peter', 'Paul', 'Mary'], 'Food' : ['Fish', 'Beans', 'Bread']})
df9

Unnamed: 0,Name,Food
0,Peter,Fish
1,Paul,Beans
2,Mary,Bread


In [33]:
df10 = pd.DataFrame({'Name' : ['Mary', 'Joseph'], 'Drink' : ['Wine', 'Beer']})
df10

Unnamed: 0,Name,Drink
0,Mary,Wine
1,Joseph,Beer


In [35]:
# Merging two DataFrames
# Similar two how = inner
pd.merge(df9, df10)

Unnamed: 0,Name,Food,Drink
0,Mary,Bread,Wine


In [36]:
# Merging two DataFrames with how = 'inner'
pd.merge(df9, df10, how = 'inner')

Unnamed: 0,Name,Food,Drink
0,Mary,Bread,Wine


In [37]:
# An outer join returns a join over the union of the input columns, and fills in all missing values with NAs.
pd.merge(df9, df10, how = 'outer')

Unnamed: 0,Name,Food,Drink
0,Peter,Fish,
1,Paul,Beans,
2,Mary,Bread,Wine
3,Joseph,,Beer


In [38]:
# The left join and right join return join over the left entries and right entries, respectively
print("Merge DataFrames using how = 'left':\n",pd.merge(df9, df10, how = 'left'))
print("\nMerge DataFrames using how = 'right':\n", pd.merge(df9, df10, how = 'right'))

Merge DataFrames using how = 'left':
     Name   Food Drink
0  Peter   Fish   NaN
1   Paul  Beans   NaN
2   Mary  Bread  Wine

Merge DataFrames using how = 'right':
      Name   Food Drink
0    Mary  Bread  Wine
1  Joseph    NaN  Beer


### Overlapping Column Names: The suffixes Keyword
    Finally, you may end up in a case where your two input DataFrame s have conflicting column names.

In [39]:
df11 = pd.DataFrame({'Name' : ['Bob', 'Jack', 'Lisa', 'Sue'], 'Rank' : [1, 2, 3, 4]})
df11

Unnamed: 0,Name,Rank
0,Bob,1
1,Jack,2
2,Lisa,3
3,Sue,4


In [40]:
df12 = pd.DataFrame({'Name' : ['Bob', 'Jack', 'Lisa', 'Sue'], 'Rank' : [3, 1, 4, 2]})
df12

Unnamed: 0,Name,Rank
0,Bob,3
1,Jack,1
2,Lisa,4
3,Sue,2


In [42]:
pd.merge(df11, df12, on = 'Name', suffixes = ['_L', '_R'])

Unnamed: 0,Name,Rank_L,Rank_R
0,Bob,1,3
1,Jack,2,1
2,Lisa,3,4
3,Sue,4,2
