# Combining DataFrames

Data in two or more DataFrames can be combined in variety of ways. These ways are 
1. Concatanation
2. Merging/ Joining

## Concatanation

In this operation rows/ columns of another DataFrame are added in the given DataFrame. This operation is also called stacking. 

Pandas provides a `concat` function to perform this operation. A demonstration of the use of `concat` function is given below.

In [1]:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3]) 

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7]) 

In [2]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [3]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


Now, we concat the rows of `df2` below the rows of `df1` using `concat` function.

In [4]:
pd.concat([df1, df2])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


It may be noted that the concat function only returns the combined DataFrame. The original DataFrames remain unchanged.

In [5]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


If we want to retain the information about the source Datasets, supplying `keys` arguments

In [6]:
pd.concat([df1, df2], keys = ['df1', 'df2'])

Unnamed: 0,Unnamed: 1,A,B,C,D
df1,0,A0,B0,C0,D0
df1,1,A1,B1,C1,D1
df1,2,A2,B2,C2,D2
df1,3,A3,B3,C3,D3
df2,4,A4,B4,C4,D4
df2,5,A5,B5,C5,D5
df2,6,A6,B6,C6,D6
df2,7,A7,B7,C7,D7


The information about the source Datasets can also be retained by supplying `dict` of DataFrames to `concat` function. The keys of dictionary will be used as the keys.

In [7]:
pd.concat({'Data 1':df1, 'Data 2':df2})

Unnamed: 0,Unnamed: 1,A,B,C,D
Data 1,0,A0,B0,C0,D0
Data 1,1,A1,B1,C1,D1
Data 1,2,A2,B2,C2,D2
Data 1,3,A3,B3,C3,D3
Data 2,4,A4,B4,C4,D4
Data 2,5,A5,B5,C5,D5
Data 2,6,A6,B6,C6,D6
Data 2,7,A7,B7,C7,D7


We can also combine more than two DataFrames.

In [8]:
df3 = pd.DataFrame({'A': ['A8', 'A9'],
                    'B': ['B8', 'B9'],
                    'C': ['C8', 'C9'],
                    'D': ['D8', 'D9']},
                    index=[8, 9])
dfDict = {'DF1':df1, 'DF2':df2, 'DF3': df3}
pd.concat(dfDict)

Unnamed: 0,Unnamed: 1,A,B,C,D
DF1,0,A0,B0,C0,D0
DF1,1,A1,B1,C1,D1
DF1,2,A2,B2,C2,D2
DF1,3,A3,B3,C3,D3
DF2,4,A4,B4,C4,D4
DF2,5,A5,B5,C5,D5
DF2,6,A6,B6,C6,D6
DF2,7,A7,B7,C7,D7
DF3,8,A8,B8,C8,D8
DF3,9,A9,B9,C9,D9


If `keys` are supplied in the arguments, only the DataaFrames corresponding to the supplied keys are concatenated.

In [9]:
pd.concat(dfDict, keys = ['DF1', 'DF3'])

Unnamed: 0,Unnamed: 1,A,B,C,D
DF1,0,A0,B0,C0,D0
DF1,1,A1,B1,C1,D1
DF1,2,A2,B2,C2,D2
DF1,3,A3,B3,C3,D3
DF3,8,A8,B8,C8,D8
DF3,9,A9,B9,C9,D9


To concatenate along columns, use the `axis` argument.

In [10]:
df4 = pd.DataFrame({'E': ['E0', 'E1', 'E2', 'E3'],
                    'F': ['F0', 'F1', 'F2', 'F3']},
                     index=[0, 1, 2, 3]) 
pd.concat([df1, df4], axis = 1)

Unnamed: 0,A,B,C,D,E,F
0,A0,B0,C0,D0,E0,F0
1,A1,B1,C1,D1,E1,F1
2,A2,B2,C2,D2,E2,F2
3,A3,B3,C3,D3,E3,F3


While concatenating along an axis, the index of the other axis is utilized.

In [11]:
df4.index = [0, 1, 3, 4]
pd.concat([df1, df4], axis = 1)

Unnamed: 0,A,B,C,D,E,F
0,A0,B0,C0,D0,E0,F0
1,A1,B1,C1,D1,E1,F1
2,A2,B2,C2,D2,,
3,A3,B3,C3,D3,E2,F2
4,,,,,E3,F3


Akso see the following example, where we are concatenating along rows.

In [12]:
df5 = pd.DataFrame({'A': ['A8', 'A9'],
                    'C': ['C8', 'C9'],
                    'E': ['D8', 'D9']},
                    index=[8, 9])
pd.concat([df1, df5])

Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
8,A8,,C8,,D8
9,A9,,C9,,D9


Note that the union of index is performed while concatanating.

To perform intersection, we supply the `join` arguments as 

In [13]:
pd.concat([df1, df4], axis = 1, join = 'inner')

Unnamed: 0,A,B,C,D,E,F
0,A0,B0,C0,D0,E0,F0
1,A1,B1,C1,D1,E1,F1
3,A3,B3,C3,D3,E2,F2


When Series is involved in concatanation, the Series will be transformed to DataFrame with the column name as the name of the Series.

In [14]:
x = pd.Series(['x0', 'x1', 'x2', 'x3'], name = 'X')
pd.concat([df1, x], axis = 1)

Unnamed: 0,A,B,C,D,X
0,A0,B0,C0,D0,x0
1,A1,B1,C1,D1,x1
2,A2,B2,C2,D2,x2
3,A3,B3,C3,D3,x3


## Merging DataFrames (joins)

Pandas provides a `merge` function/ method to carry out database style join operations.

In [15]:
df5 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A4'],
                    'E': ['E0', 'E1', 'E2', 'E4']},
                    index=[0, 1, 2, 4]) 
df5

Unnamed: 0,A,E
0,A0,E0
1,A1,E1
2,A2,E2
4,A4,E4


In [16]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


### Inner join
The merge operation performed below is analogous to 

SQL: `select * from df1 natural join df5`

This operation performs merge (join) with all common columns in the two DataFrames as the composite join key.

In [17]:
df1.merge(df5)

Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,E0
1,A1,B1,C1,D1,E1
2,A2,B2,C2,D2,E2


If instead of using all common columns as the join key, we can explicitly specify the join key. This join is analogous to 

SQL: `select * from df1 join df5 using (A)`

In [18]:
df6 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A4'],
                    'B': ['B0', 'B1', 'B4', 'B5']},
                    index=[0, 1, 2, 4])
df1.merge(df6, on = 'A')

Unnamed: 0,A,B_x,C,D,B_y
0,A0,B0,C0,D0,B0
1,A1,B1,C1,D1,B1
2,A2,B2,C2,D2,B4


Observe that the B columns that was present in both the DataFrames has been distinguished in the result.

The nonkey common columns can be distinguished using specified suffixes as shown below.

In [19]:
df1.merge(df6, on = 'A', suffixes = [1, 6])

Unnamed: 0,A,B1,C,D,B6
0,A0,B0,C0,D0,B0
1,A1,B1,C1,D1,B1
2,A2,B2,C2,D2,B4


Now suppose the two DataFrames do not have any common columns, the join keys can be specified using arguments `left_on` and `right_on`

SQL: `select * from df1 join df7 on (A = E)`

In [20]:
df7 = pd.DataFrame({'E': ['A0', 'A1', 'A2', 'A4'],
                    'F': ['F0', 'F1', 'F2', 'F4']},
                    index=[0, 1, 2, 4])

df1.merge(df7, left_on = 'A', right_on = 'E')

Unnamed: 0,A,B,C,D,E,F
0,A0,B0,C0,D0,A0,F0
1,A1,B1,C1,D1,A1,F1
2,A2,B2,C2,D2,A2,F2


Note that by default, `merge` function performs inner join. It should also be noted that `merge` is designed to perform only equi-joins. **`merge` cannot perform general "theta join"**.

### Outer Join

To perform outer join, we need to supply the `how` argument. The possible values of `how` arguments are: `inner` (default), `left`, `right`, and `outer`. 

#### Left Join
The merge performed below is analogous to the following SQL join.

SQL: `select * from df1 natural left join df5`

In [21]:
df1.merge(df5, how = 'left')

Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,E0
1,A1,B1,C1,D1,E1
2,A2,B2,C2,D2,E2
3,A3,B3,C3,D3,


The next merge is analogous to the following SQL

SQL: `select * from df1 left join using (A)`

In [22]:
df1.merge(df6, how = 'left', on = 'A', suffixes = [1, 6])

Unnamed: 0,A,B1,C,D,B6
0,A0,B0,C0,D0,B0
1,A1,B1,C1,D1,B1
2,A2,B2,C2,D2,B4
3,A3,B3,C3,D3,


The next merge is analogous to the following SQL

SQL: `select * from df1 left join on (A = E)`

In [23]:
df1.merge(df7, how = 'left', left_on = 'A', right_on = 'E')

Unnamed: 0,A,B,C,D,E,F
0,A0,B0,C0,D0,A0,F0
1,A1,B1,C1,D1,A1,F1
2,A2,B2,C2,D2,A2,F2
3,A3,B3,C3,D3,,


#### Right Join
The merge performed below is analogous to the following SQL join.

SQL: `select * from df1 natural right join df5`

In [24]:
df1.merge(df5, how = 'right')

Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,E0
1,A1,B1,C1,D1,E1
2,A2,B2,C2,D2,E2
3,A4,,,,E4


The other two variants of right join can be performed in a similar way.

#### Full outer Join

The next merge is analogous to the following SQL join.

SQL: `select * from df1 natural full outer join df5`

In [25]:
df1.merge(df5, how = 'outer')

Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,E0
1,A1,B1,C1,D1,E1
2,A2,B2,C2,D2,E2
3,A3,B3,C3,D3,
4,A4,,,,E4


It is important to note that `merge` function ignores the indices of DataFrames while performing merge using column(s) of DataFrames as join keys.

In [37]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [42]:
df4.index = [4, 5, 6, 7]
df4.columns = ['A', 'E']
df4.A = ['E0','A1', 'E2','A3']
df4

Unnamed: 0,A,E
4,E0,F0
5,A1,F1
6,E2,F2
7,A3,F3


In [43]:
df1.merge(df4)

Unnamed: 0,A,B,C,D,E
0,A1,B1,C1,D1,F1
1,A3,B3,C3,D3,F3


### Using index as a join key

Instead of using a DataFrame column as join key, it is also possible to use index of DataFrames as join key as shown below.

In [26]:
pd.merge(df1, df5, left_index = True, right_index = True)

Unnamed: 0,A_x,B,C,D,A_y,E
0,A0,B0,C0,D0,A0,E0
1,A1,B1,C1,D1,A1,E1
2,A2,B2,C2,D2,A2,E2


Observe that, since indices are used as keys, the common column A is now treated as nonkey column. An  alternative way to use indices as join key is provided by `join` function/ method

In [27]:
df1.join(df5, how = 'inner', lsuffix = '1')

Unnamed: 0,A1,B,C,D,A,E
0,A0,B0,C0,D0,A0,E0
1,A1,B1,C1,D1,A1,E1
2,A2,B2,C2,D2,A2,E2


Observe that we have supplied `how` and `inner` arguments. This is because by default `join` function performs left outer join. Also `join` does not automatically distinguish nonkey columns. 

### Merging with duplicate values in join key

When a join key contains duplicate values `merge` function performs one-to-many/ many-to-one/ many-to-many joins.  

In [28]:
df5.loc[1, 'A'] = 'A2'
df5

Unnamed: 0,A,E
0,A0,E0
1,A2,E1
2,A2,E2
4,A4,E4


One-to-many join is performed by the following statement

In [29]:
df1.merge(df5)

Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,E0
1,A2,B2,C2,D2,E1
2,A2,B2,C2,D2,E2
