# Data merging

One of the most time consuming but necessary tasks in cheminformatics is merging information from different sources.  So long as your chemicals have a common identifier (refer to lecture on chemical representation for what a unique identifier means), there exist operations to merge different datasets.  

In pandas, this is done through the merge function ([refer to the documentation](https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging) for full functionality).  

There are several types of ways to merge data, borrowed from the table joins from the SQL database language.  They are conceptualized in the following graphic.  

<img src='join_examples.png'>

This notebook we show the most common four ways to merge to datasets in tabular format:

1) Inner Join  
2) Full (Outer) Join  
3) Left Join  
4) Right Join.

Support for reading Excel tables using Pandas requires the installation of the `openpyxl` package.   

We first need to load the two datasets as different dataframes.

In [2]:
import pandas as pd

ds_one = pd.read_excel('Fake Data.xlsx', sheet_name='Dataset One')
ds_two = pd.read_excel('Fake Data.xlsx', sheet_name='Dataset Two')

In [6]:
print(ds_one.shape)

ds_one.head()

(19, 3)


Unnamed: 0,Chemical Identifier,Molecular Weight,Log P
0,Compound 1,257,3
1,Compound 2,437,-1
2,Compound 3,432,-3
3,Compound 4,318,-1
4,Compound 5,298,-1


In [7]:
print(ds_two.shape)

ds_two.head()

(19, 3)


Unnamed: 0,Chemical Identifier,Chemical SMILES,Chemical InChI
0,Compound 1,O=C2NC(=O)C(C1=CC=CC=C1)(C(=O)N2)CC,
1,Compound 4,C=1C=C(C=CC=1C(C2=CC=C(C=C2)Cl)C(Cl)(Cl)Cl)Cl,
2,Compound 5,O=C(O)C1=C(C=CC(=C1Cl)Cl)Cl,
3,Compound 11,O=[N+]([O-])C=1C=CC(O)=C(C=1)[N+](=O)[O-],
4,Compound 12,O=C(OCC)N,


After, we can use the `pd.merge` function that well join the datasets by a unique column identifier.  The merge function takes the first (left) and second (right) dataframes as required arguments.  The other important arguments are the `on` argument, which is the unique column shared between two datasets and the `how` argument which tells the type of join. 

### Inner join

The first join is the inner join. The inner join takes all finds all rows from both datasets that match in both dataframes.  E.g., identifiers not in both dataframes are exclused.  

In [5]:
inner_join = pd.merge(ds_one, ds_two, on='Chemical Identifier', how='inner')

print(inner_join.shape)
inner_join.head()

(12, 5)


Unnamed: 0,Chemical Identifier,Molecular Weight,Log P,Chemical SMILES,Chemical InChI
0,Compound 1,257,3,O=C2NC(=O)C(C1=CC=CC=C1)(C(=O)N2)CC,
1,Compound 4,318,-1,C=1C=C(C=CC=1C(C2=CC=C(C=C2)Cl)C(Cl)(Cl)Cl)Cl,
2,Compound 5,298,-1,O=C(O)C1=C(C=CC(=C1Cl)Cl)Cl,
3,Compound 11,440,-3,O=[N+]([O-])C=1C=CC(O)=C(C=1)[N+](=O)[O-],
4,Compound 12,224,3,O=C(OCC)N,


### Outer Join

An outer join will match all common elements on the column, but include all the remaining data from the left and the right datasets. 

In [9]:
outer_join = pd.merge(ds_one, ds_two, on='Chemical Identifier', how='outer')

print(outer_join.shape)
outer_join.head()

(26, 5)


Unnamed: 0,Chemical Identifier,Molecular Weight,Log P,Chemical SMILES,Chemical InChI
0,Compound 1,257.0,3.0,O=C2NC(=O)C(C1=CC=CC=C1)(C(=O)N2)CC,
1,Compound 2,437.0,-1.0,,
2,Compound 3,432.0,-3.0,,
3,Compound 4,318.0,-1.0,C=1C=C(C=CC=1C(C2=CC=C(C=C2)Cl)C(Cl)(Cl)Cl)Cl,
4,Compound 5,298.0,-1.0,O=C(O)C1=C(C=CC(=C1Cl)Cl)Cl,


### Left Join

A left join will find the elements from the left dataset that are in the right.

In [11]:
left_join = pd.merge(ds_one, ds_two, on='Chemical Identifier', how='left')

print(left_join.shape)
left_join.head()

(19, 5)


Unnamed: 0,Chemical Identifier,Molecular Weight,Log P,Chemical SMILES,Chemical InChI
0,Compound 1,257,3,O=C2NC(=O)C(C1=CC=CC=C1)(C(=O)N2)CC,
1,Compound 2,437,-1,,
2,Compound 3,432,-3,,
3,Compound 4,318,-1,C=1C=C(C=CC=1C(C2=CC=C(C=C2)Cl)C(Cl)(Cl)Cl)Cl,
4,Compound 5,298,-1,O=C(O)C1=C(C=CC(=C1Cl)Cl)Cl,


### Right Join

A right join does the same as the left, except for the other dataset.  

In [12]:
right_join = pd.merge(ds_one, ds_two, on='Chemical Identifier', how='right')

print(right_join.shape)
right_join.head()

(19, 5)


Unnamed: 0,Chemical Identifier,Molecular Weight,Log P,Chemical SMILES,Chemical InChI
0,Compound 1,257.0,3.0,O=C2NC(=O)C(C1=CC=CC=C1)(C(=O)N2)CC,
1,Compound 4,318.0,-1.0,C=1C=C(C=CC=1C(C2=CC=C(C=C2)Cl)C(Cl)(Cl)Cl)Cl,
2,Compound 5,298.0,-1.0,O=C(O)C1=C(C=CC(=C1Cl)Cl)Cl,
3,Compound 11,440.0,-3.0,O=[N+]([O-])C=1C=CC(O)=C(C=1)[N+](=O)[O-],
4,Compound 12,224.0,3.0,O=C(OCC)N,
