## Content

- **Joins in merge**
    - One-to-One
    - Many-to-One
    - Many-to-Many

- **Concatenation**
    - `df.append()`

In [None]:
import pandas as pd
import numpy as np

## Joins in Merge



### 1. One-to-One join
  - Similar to concatenating columns

In [None]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
print(df1)
print(df2)

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014


In [None]:
df3 = pd.merge(df1, df2)
df3

Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014



  - The pd.merge() function recognizes that each DataFrame has an “employee” column, and automatically joins using this column as a key.
  
  - The result of the merge is a new DataFrame that combines the information from the two inputs.
  
  - The order of entries in each column is not necessarily maintained <br> In this case, the order of the “employee” column differs between df1 and df2, and the pd.merge() function correctly accounts for this.
  
  - Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index

### 2. Many to one join

  - One of the two key columns contains duplicate entries.
  - For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate.


In [None]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                           'supervisor': ['Carly', 'Guido', 'Steve']})

In [None]:
df3

Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


In [None]:
df4

Unnamed: 0,group,supervisor
0,Accounting,Carly
1,Engineering,Guido
2,HR,Steve


In [None]:
pd.merge(df3, df4)

Unnamed: 0,employee,group,hire_date,supervisor
0,Bob,Accounting,2008,Carly
1,Jake,Engineering,2012,Guido
2,Lisa,Engineering,2004,Guido
3,Sue,HR,2014,Steve


### 3. Many to Many joins

  -  If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.
  - Lets look at an example to understand this better

In [None]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                                     'Engineering', 'Engineering', 'HR', 'HR'],

                           'skills': ['math', 'spreadsheets', 'coding', 'linux',
                                      'spreadsheets', 'organization']})


In [None]:
df5

Unnamed: 0,group,skills
0,Accounting,math
1,Accounting,spreadsheets
2,Engineering,coding
3,Engineering,linux
4,HR,spreadsheets
5,HR,organization


In [None]:
df1

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


In [None]:
pd.merge(df1, df5)

Unnamed: 0,employee,group,skills
0,Bob,Accounting,math
1,Bob,Accounting,spreadsheets
2,Jake,Engineering,coding
3,Jake,Engineering,linux
4,Lisa,Engineering,coding
5,Lisa,Engineering,linux
6,Sue,HR,spreadsheets
7,Sue,HR,organization


***

## Concatenation using df.append()

There also exists a shorter method of appending 1 dataframe to the other

This is through the `append()` method

Concatenation takes place only through axis = 0

In [None]:
df1

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


In [None]:
df2

Unnamed: 0,employee,hire_date
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


In [None]:
df1.append(df2, ignore_index = False)

Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,
1,Jake,Engineering,
2,Lisa,Engineering,
3,Sue,HR,
0,Lisa,,2004.0
1,Bob,,2008.0
2,Jake,,2012.0
3,Sue,,2014.0


In [None]:
df1

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


#### How is it different from pd.concat() ?
  - The append() method does not modify the orginial object
  - It creates a new one with combined data
  - Only works along axis = 0
  - Can be used to concatenate only 2 dataframes/series at a time
  - Hence, it is not a very efficient method
