![alt text](pandas.png "Title")

In [1]:
import pandas as pd
import random

# Dataframe merges

pandas offers all the tools you'd expect to combine data stored in dataframes

## Test data

In [2]:
# let's create a dm dataframe...
data = {
    'subjid': [10010, 10011, 10012],
    'gender': ['M', 'F', 'F'],
    'age':    [20, 25, 23],
}

dm = pd.DataFrame(data, columns=['subjid','age', 'gender'])
dm

Unnamed: 0,subjid,age,gender
0,10010,20,M
1,10011,25,F
2,10012,23,F


In [3]:
# and a vs dataframe...
patients = [10010, 10011, 10013]
visits = [1, 2, 3]
param = ['heart rate', 'systolic blood pressure']

data = {'subjid': sorted(patients * len(visits)) * len(param),
        'visit' : visits * len(param) * len(patients),
        'param' : sorted(param * len(visits) * len(patients)),
        'result': [random.randint(50, 150)  for n in range(len(visits) * len(patients))] +
                  [random.randint(100, 180) for n in range(len(visits) * len(patients))] 
}

vs = pd.DataFrame(data, columns=['subjid', 'visit', 'param', 'result']).sort_values(['subjid','visit', 'param'])
vs.head(8)

Unnamed: 0,subjid,visit,param,result
0,10010,1,heart rate,138
9,10010,1,systolic blood pressure,147
1,10010,2,heart rate,68
10,10010,2,systolic blood pressure,110
2,10010,3,heart rate,138
11,10010,3,systolic blood pressure,111
3,10011,1,heart rate,113
12,10011,1,systolic blood pressure,177


## Many-to-one joins

pd.merge() performs the merge and offers all the options you need. It returns a dataframe (i.e. not in-place operation).

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

In [4]:
# We need to pass two dataframes:
df = pd.merge(vs, dm)

# We can pass the key column(s) to join. If not passed, the overlapping columns are used.
# The default merge is an 'inner' join: only the key values in common are merged.
# oh, and no need to sort datasets beforehand ;-)

# we could be explicit: 
pd.merge(vs, dm, on = 'subjid', how='inner')

Unnamed: 0,subjid,visit,param,result,age,gender
0,10010,1,heart rate,138,20,M
1,10010,1,systolic blood pressure,147,20,M
2,10010,2,heart rate,68,20,M
3,10010,2,systolic blood pressure,110,20,M
4,10010,3,heart rate,138,20,M
5,10010,3,systolic blood pressure,111,20,M
6,10011,1,heart rate,113,25,F
7,10011,1,systolic blood pressure,177,25,F
8,10011,2,heart rate,147,25,F
9,10011,2,systolic blood pressure,176,25,F


In [5]:
# We just need the 'age' to be merged onto VS, with a left join (keep all observations in VS)
key = ['subjid']
pd.merge(vs, dm[key + ['age']], on=key, how='left')

# the following is similar, only the columns order would differ
# pd.merge(dm[['subjid','age']], vs, on = 'subjid', how='right')

# In both cases, we get missings values due to missings keys.

Unnamed: 0,subjid,visit,param,result,age
0,10010,1,heart rate,138,20.0
1,10010,1,systolic blood pressure,147,20.0
2,10010,2,heart rate,68,20.0
3,10010,2,systolic blood pressure,110,20.0
4,10010,3,heart rate,138,20.0
5,10010,3,systolic blood pressure,111,20.0
6,10011,1,heart rate,113,25.0
7,10011,1,systolic blood pressure,177,25.0
8,10011,2,heart rate,147,25.0
9,10011,2,systolic blood pressure,176,25.0


In [6]:
# outer join
pd.merge(vs, dm, on='subjid', how='outer').tail(5) # show the last 5 records

Unnamed: 0,subjid,visit,param,result,age,gender
14,10013,2.0,heart rate,78.0,,
15,10013,2.0,systolic blood pressure,142.0,,
16,10013,3.0,heart rate,60.0,,
17,10013,3.0,systolic blood pressure,178.0,,
18,10012,,,,23.0,F


In [7]:
# quick summary statistics after a merge
pd.merge(vs, dm[key + ['gender']], on=key, how='left').groupby(['gender','param']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,subjid,visit,result
gender,param,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,heart rate,10011.0,2.0,124.0
F,systolic blood pressure,10011.0,2.0,152.0
M,heart rate,10010.0,2.0,114.666667
M,systolic blood pressure,10010.0,2.0,122.666667


In [8]:
# Notes:

# 1) Overlapping columns (apart from key) are renamed with a prefix (which you can customize)
# 2) you can merge on index as well
# 3) you can pass a list of key variables 
# 4) if key variables have different names, no need to rename beforehand: use right_on/left_on.

__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+