# Combine data

For combining data we can use the methods `concat`, `merge` and `join`. Sometimes all three methods can be applyed to get the same end result, it depends on your data and your preference. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.concat?

In [3]:
pd.merge?

In [4]:
pd.DataFrame.join?

## Concat 

with concat we can combine data. This is espaccially handy in combining arrays. By default the concat works with `axis=0`, e.g the row concationation. It glues the rows of one dataframe or array to another

In [2]:
df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01

Unnamed: 0,0,1,2,3,4
0,0.113879,1.824252,-1.007082,-0.411709,-0.129588
1,0.571164,-1.167565,-1.462957,-0.57323,-0.974223
2,0.136346,-0.870894,-1.320389,1.000776,2.22767


In [3]:
df_02 = pd.DataFrame(np.random.randn(3,5))
df_02

Unnamed: 0,0,1,2,3,4
0,-0.656365,0.9986,-2.124309,0.574141,-1.108821
1,1.082656,0.051674,-1.077717,1.066916,1.290093
2,0.460137,0.015859,0.216606,1.164776,0.050458


In [4]:
df_03 = pd.concat([df_01, df_02])
df_03

Unnamed: 0,0,1,2,3,4
0,0.113879,1.824252,-1.007082,-0.411709,-0.129588
1,0.571164,-1.167565,-1.462957,-0.57323,-0.974223
2,0.136346,-0.870894,-1.320389,1.000776,2.22767
0,-0.656365,0.9986,-2.124309,0.574141,-1.108821
1,1.082656,0.051674,-1.077717,1.066916,1.290093
2,0.460137,0.015859,0.216606,1.164776,0.050458


if you pass `axis=1` it will glue the dataframes in the collumn direction

In [5]:
df_03 = pd.concat([df_01, df_02], axis=1)
df_03

Unnamed: 0,0,1,2,3,4,0.1,1.1,2.1,3.1,4.1
0,0.113879,1.824252,-1.007082,-0.411709,-0.129588,-0.656365,0.9986,-2.124309,0.574141,-1.108821
1,0.571164,-1.167565,-1.462957,-0.57323,-0.974223,1.082656,0.051674,-1.077717,1.066916,1.290093
2,0.136346,-0.870894,-1.320389,1.000776,2.22767,0.460137,0.015859,0.216606,1.164776,0.050458


In case of unequal shape it will fill the gaps with NaN

In [6]:
A = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]}).set_index('key')
B = pd.DataFrame({'key': ['A', 'B', 'X', 'Y'], 'value': [3, 4, 5, 7]}).set_index('key')

In [7]:
A

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
A,1
B,2
C,3


In [8]:
B

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
A,3
B,4
X,5
Y,7


## Merge

You can also perform a SQL-style join using the .merge() function:

In [9]:
pd.merge?

In [10]:
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left 

Unnamed: 0,key,left_value
0,A,1
1,B,2
2,C,3


In [11]:
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right

Unnamed: 0,key,right_value
0,A,3
1,B,4
2,D,513


In [12]:
pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])

Unnamed: 0,key,left_value,right_value
0,A,1,3
1,B,2,4


In [13]:
pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])

Unnamed: 0,key,left_value,right_value
0,A,1.0,3.0
1,B,2.0,4.0
2,C,3.0,
3,D,,513.0


In [14]:
pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])

Unnamed: 0,key,left_value,right_value
0,A,1.0,3
1,B,2.0,4
2,D,,513


In [15]:
pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])

Unnamed: 0,key,left_value,right_value
0,A,1,3.0
1,B,2,4.0
2,C,3,


In [16]:
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left

Unnamed: 0,key,left_value,other_key
0,A,1,X
1,B,2,Y
2,C,3,Z


In [17]:
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right

Unnamed: 0,key,right_value,some_key
0,A,3,W
1,B,532,Y
2,D,5,Z


In [18]:
pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])

Unnamed: 0,key,left_value,other_key,right_value,some_key
0,B,2,Y,532,Y


In [19]:
df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03


Unnamed: 0,key,left_value,other_key,right_value,some_key
0,A,1,X,3.0,W
1,B,2,Y,532.0,Y
2,C,3,Z,,


## Join
Pandas DataFrame has als a join function for merging by index. However overlapping columns cannot exist.

In [20]:
left

Unnamed: 0,key,left_value,other_key
0,A,1,X
1,B,2,Y
2,C,3,Z


In [21]:
right

Unnamed: 0,key,right_value,some_key
0,A,3,W
1,B,532,Y
2,D,5,Z


In [22]:
left.set_index('key').join(right.set_index('key'), how='outer')

Unnamed: 0_level_0,left_value,other_key,right_value,some_key
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,1.0,X,3.0,W
B,2.0,Y,532.0,Y
C,3.0,Z,,
D,,,5.0,Z


In [23]:
right = right.rename(columns = {'key': 'name'})
right

Unnamed: 0,name,right_value,some_key
0,A,3,W
1,B,532,Y
2,D,5,Z


In [24]:
df_04 = left.join(right, how='outer')
df_04

Unnamed: 0,key,left_value,other_key,name,right_value,some_key
0,A,1,X,A,3,W
1,B,2,Y,B,532,Y
2,C,3,Z,D,5,Z


With the `on=` argument you can match indexes with keys. For example:

In [25]:
left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [26]:
right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [27]:
df_05 = left1.join(right1, on='key')
df_05

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


Merging on indexes with `merge` is also possible

In [28]:
left1 = left1.set_index('key')
left1

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
a,0
b,1
a,2
a,3
b,4
c,5


In [29]:
df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06

Unnamed: 0,value,group_val
a,0,3.5
a,2,3.5
a,3,3.5
b,1,7.0
b,4,7.0
c,5,
