# Data Preprocessing

1. [Loading the Datasets](#firstbullet)
2. [Exploratory Data Analysis](#secondbullet)
3. [Splitting the Dataset](#thirdbullet)
4. [Canonicalizing the Datast](#fourthbullet)
5. [Removing the Atom Maps](#fifthbullet)
6. [Tokenizing the Dataset](#sixthbullet)
7. [Preparing the DataFrames](#seventhbullet)
8. [Shuffling and Saving](#eighthbullet)

### 1. Loading the datasets: Exploratory data analysis <a class="anchor" id="firstbullet"></a>

In [1]:
pip install rdkit

Note: you may need to restart the kernel to use updated packages.


In [2]:
!pip install OpenNMT-py==2.2.0



In [3]:
import rdkit
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem
import os
import yaml
import torch #library for deep learning frameworks
from rdkit import RDLogger # optional step to disable warnings of RDkit)
RDLogger.DisableLog('rdApp.*')

In [4]:
train = pd.read_csv(r"C:\Users\sayye\OneDrive\Documents\New folder (2)\USPTO 50k dataset\uspto50k_train.csv")
test = pd.read_csv(r"C:\Users\sayye\OneDrive\Documents\New folder (2)\USPTO 50k dataset\uspto50k_test.csv")
val = pd.read_csv(r"C:\Users\sayye\OneDrive\Documents\New folder (2)\USPTO 50k dataset\uspto50k_val.csv")

#a glimpse of the datasets
print(train.head(),'\n\n', val.head(), '\n\n', test.head())

                id  class                      reactants>reagents>production
0       US05849732      6  [NH:1]([CH2:2][CH2:3][CH2:4][CH2:5][C@@H:6]([C...
1  US20120114765A1      2  [C:1](=[O:2])([c:3]1[cH:4][c:5]([N+:6](=[O:7])...
2     US08003648B2      1  [CH3:44][CH2:45][NH:46][CH2:47][CH3:48].[CH:1]...
3     US09045475B2      1  [C:1]([CH2:2][F:3])([CH2:4][F:5])=[O:65].[CH3:...
4     US08188098B2      2  [C:1](=[O:2])([O:3][CH:4]1[CH2:5][CH2:6][CH2:7... 

                 id  class                      reactants>reagents>production
0     US08329716B2      5  [C:1](=[O:2])([C:3]([F:4])([F:5])[F:6])[O:27][...
1       US06051718      5  [CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
2     US07504410B2      5  [C:1](=[O:2])([C:3]([F:4])([F:5])[F:6])[O:19][...
3       US04960769      5  [CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
4  US20110092505A1      5  [CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7... 

                 id  class                      reactants>reagents>prod

### 2. Exploratory Data Analysis <a class="anchor" id="secondbullet"></a>

In [5]:
print(f'Shape of the train dataset:\t{train.shape}\nShape of the validation dataset:\t{val.shape}\nShape of the test dataset:\t{test.shape}')

Shape of the train dataset:	(40008, 3)
Shape of the validation dataset:	(5001, 3)
Shape of the test dataset:	(5007, 3)


In [6]:
print(f'Info of the train dataset:\t{train.describe()}\nInfo of the validation dataset:\t{val.describe()}\nInfo of the test dataset:\t{test.describe()}')

Info of the train dataset:	              class
count  40008.000000
mean       3.397570
std        2.504438
min        1.000000
25%        1.000000
50%        2.000000
75%        6.000000
max       10.000000
Info of the validation dataset:	             class
count  5001.000000
mean      3.396921
std       2.504121
min       1.000000
25%       1.000000
50%       2.000000
75%       6.000000
max      10.000000
Info of the test dataset:	             class
count  5007.000000
mean      3.399441
std       2.505572
min       1.000000
25%       1.000000
50%       2.000000
75%       6.000000
max      10.000000


In [7]:
print(f'Columns in the train dataset:\t{train.columns}\nColumns in the validation dataset:\t{val.columns}\nColumns in the test dataset:\t{test.columns}')

Columns in the train dataset:	Index(['id', 'class', 'reactants>reagents>production'], dtype='object')
Columns in the validation dataset:	Index(['id', 'class', 'reactants>reagents>production'], dtype='object')
Columns in the test dataset:	Index(['id', 'class', 'reactants>reagents>production'], dtype='object')


In [8]:
train['source'] = 'train'
val['source'] = 'val'
test['source'] = 'test'

df_all = pd.concat([train, val, test], ignore_index=True)
df_all

Unnamed: 0,id,class,reactants>reagents>production,source
0,US05849732,6,[NH:1]([CH2:2][CH2:3][CH2:4][CH2:5][C@@H:6]([C...,train
1,US20120114765A1,2,[C:1](=[O:2])([c:3]1[cH:4][c:5]([N+:6](=[O:7])...,train
2,US08003648B2,1,[CH3:44][CH2:45][NH:46][CH2:47][CH3:48].[CH:1]...,train
3,US09045475B2,1,[C:1]([CH2:2][F:3])([CH2:4][F:5])=[O:65].[CH3:...,train
4,US08188098B2,2,[C:1](=[O:2])([O:3][CH:4]1[CH2:5][CH2:6][CH2:7...,train
...,...,...,...,...
50011,US20050019696A1,2,[C:1]([C:2](=[CH2:3])[CH3:4])(=[O:5])[Cl:19].[...,test
50012,US20030139425A1,1,[CH2:1]([c:2]1[cH:3][cH:4][c:5]([F:6])[cH:7][c...,test
50013,US05411980,1,[CH3:7][CH2:8][CH2:9][CH2:10][c:11]1[n:12][nH:...,test
50014,US04426381,6,[O:1]([C:2](=[O:3])[c:4]1[c:5]2[n:6]([c:7]3[cH...,test


Since the datsets are open source, they are pretty much clean and don't have null values. We have explored the datasets as per our requirements of the project.

### 3. Splitting the Dataset <a class="anchor" id="thirdbullet"></a>

As shown the dataset has three columns having the id of the reaction, the class, i.e., the type of the reaction, be it elimination, substitution encoded, along with the overall reaction. It's better for the reaction to be split into their components i.e., reactants, reagents and products for further processing.

In [9]:
train[['reactants', 'reagents', 'products']] = train['reactants>reagents>production'].str.split('>', expand = True)
test[['reactants', 'reagents', 'products']] = test['reactants>reagents>production'].str.split('>', expand = True)
val[['reactants', 'reagents', 'products']] = val['reactants>reagents>production'].str.split('>', expand = True)

#checking the columns
print(train.columns, '\n', val.columns, '\n', test.columns)

Index(['id', 'class', 'reactants>reagents>production', 'source', 'reactants',
       'reagents', 'products'],
      dtype='object') 
 Index(['id', 'class', 'reactants>reagents>production', 'source', 'reactants',
       'reagents', 'products'],
      dtype='object') 
 Index(['id', 'class', 'reactants>reagents>production', 'source', 'reactants',
       'reagents', 'products'],
      dtype='object')


### 4. Canonicalizing the Data <a class="anchor" id="fourthbullet"></a>

In [10]:
def canonicalize(smiles): # will raise an Exception if invalid SMILES
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        return Chem.MolToSmiles(mol)
    else:
        return ''

train['reactants'] = train['reactants'].apply(canonicalize)
train['reagents'] = train['reagents'].apply(canonicalize)
train['products'] = train['products'].apply(canonicalize)

test['reactants'] = test['reactants'].apply(canonicalize)
test['reagents'] = test['reagents'].apply(canonicalize)
test['products'] = test['products'].apply(canonicalize)

val['reactants'] = val['reactants'].apply(canonicalize)
val['reagents'] = val['reagents'].apply(canonicalize)
val['products'] = val['products'].apply(canonicalize)

### 5. Removing the Atom maps <a class="anchor" id="fifthbullet"></a>

In [11]:
def remove_atommapping(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        for atom in mol.GetAtoms():
            atom.SetAtomMapNum(0)
        return Chem.MolToSmiles(mol)
    else:
        return ''

train['reactants'] = train['reactants'].apply(remove_atommapping)
train['reagents'] = train['reagents'].apply(remove_atommapping)
train['products'] = train['products'].apply(remove_atommapping)

test['reactants'] = test['reactants'].apply(remove_atommapping)
test['reagents'] = test['reagents'].apply(remove_atommapping)
test['products'] = test['products'].apply(remove_atommapping)

val['reactants'] = val['reactants'].apply(remove_atommapping)
val['reagents'] = val['reagents'].apply(remove_atommapping)
val['products'] = val['products'].apply(remove_atommapping)

### 6. Tokenizing the Dataset <a class="anchor" id="sixthbullet"></a>

To be able to train a language model, we need to split the strings into tokens.

In [12]:
REGEX_TOKENIZER =  r"(\%\([0-9]{3}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\||\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"

def tokenize(smiles):
    return ' '.join(smiles)

train['token_reactants'] = train['reactants'].apply(tokenize)
train['token_reagents'] = train['reagents'].apply(tokenize)
train['token_products'] = train['products'].apply(tokenize)

test['token_reactants'] = test['reactants'].apply(tokenize)
test['token_reagents'] = test['reagents'].apply(tokenize)
test['token_products'] = test['products'].apply(tokenize)

val['token_reactants'] = val['reactants'].apply(tokenize)
val['token_reagents'] = val['reagents'].apply(tokenize)
val['token_products'] = val['products'].apply(tokenize)

### Preparing the Data Frames <a class="anchor" id="seventhbullet"></a>

In [13]:
train_df = pd.DataFrame({'Id': train['id'], 
                         'Class': train['class'], 
                         'Tokenized Reactants': train['token_reactants'],
                         'Tokenized Products': train['token_products'],
                        'Overall Reaction': train['reactants>reagents>production']})
print(f"The training set contains {train_df.shape[0]} reactions.")
train_df.head()

The training set contains 40008 reactions.


Unnamed: 0,Id,Class,Tokenized Reactants,Tokenized Products,Overall Reaction
0,US05849732,6,C O C ( = O ) [ C @ H ] ( C C C C N C ( = O ) ...,C O C ( = O ) [ C @ H ] ( C C C C N ) N C ( = ...,[NH:1]([CH2:2][CH2:3][CH2:4][CH2:5][C@@H:6]([C...
1,US20120114765A1,2,N c 1 c c c c 2 c n c c c 1 2 . O = C ( O ) c ...,O = C ( N c 1 c c c c 2 c n c c c 1 2 ) c 1 c ...,[C:1](=[O:2])([c:3]1[cH:4][c:5]([N+:6](=[O:7])...
2,US08003648B2,1,C C N C C . C c 1 n c ( - c 2 c c c ( C = O ) ...,C C N ( C C ) C c 1 c c c ( - c 2 n c ( C ) c ...,[CH3:44][CH2:45][NH:46][CH2:47][CH3:48].[CH:1]...
3,US09045475B2,1,C C 1 ( C ) C C C ( C N 2 C C N ( c 3 c c c ( ...,C C 1 ( C ) C C C ( C N 2 C C N ( c 3 c c c ( ...,[C:1]([CH2:2][F:3])([CH2:4][F:5])=[O:65].[CH3:...
4,US08188098B2,2,C C O c 1 c c c ( O c 2 n c n c 3 c 2 c n n 3 ...,C C O c 1 c c c ( O c 2 n c n c 3 c 2 c n n 3 ...,[C:1](=[O:2])([O:3][CH:4]1[CH2:5][CH2:6][CH2:7...


In [14]:
test_df = pd.DataFrame({'Id': test['id'], 
                        'Class': test['class'],
                        'Tokenized Reactants': test['token_reactants'],
                        'Tokenized Products': test['token_products'],
                       'Overall Reaction': test['reactants>reagents>production']})
print(f"The training set contains {test_df.shape[0]} reactions.")
test_df.head()

The training set contains 5007 reactions.


Unnamed: 0,Id,Class,Tokenized Reactants,Tokenized Products,Overall Reaction
0,US07928231B2,5,C C ( = O ) c 1 c c c 2 [ n H ] c c c 2 c 1 . ...,C C ( = O ) c 1 c c c 2 c ( c c n 2 C ( = O ) ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
1,US20090192322A1,5,C C ( C ) ( C ) O C ( = O ) O C ( = O ) O C ( ...,C c 1 c c c ( S ( = O ) ( = O ) O [ C @ @ H ] ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
2,US20080146614A1,10,C C O C ( = O ) c 1 n n ( - c 2 c c c ( C l ) ...,C C O C ( = O ) c 1 n n ( - c 2 c c c ( C l ) ...,[Br:1][N:35]1[C:30](=[O:29])[CH2:31][CH2:32][C...
3,US20120207729A1,5,C C ( C ) ( C ) O C ( = O ) O C ( = O ) O C ( ...,C C ( C ) ( C ) O C ( = O ) N c 1 n c 2 c ( - ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
4,US20070003539A1,5,N C c 1 c c c c c 1 S ( = O ) ( = O ) C 1 C C ...,O = C ( N C c 1 c c c c c 1 S ( = O ) ( = O ) ...,[C:1](=[O:2])([C:3]([F:4])([F:5])[F:6])[O:23][...


In [15]:
val_df = pd.DataFrame({'Id': val['id'],
                       'Class': val['class'],
                       'Tokenized Reactants': val['token_reactants'],
                       'Tokenized Products': val['token_products'],
                      'Overall Reaction': val['reactants>reagents>production']})
print(f"The training set contains {val_df.shape[0]} reactions.")
val_df.head(20)

The training set contains 5001 reactions.


Unnamed: 0,Id,Class,Tokenized Reactants,Tokenized Products,Overall Reaction
0,US08329716B2,5,N c 1 c c c ( O c 2 c c n c 3 [ n H ] c c c 2 ...,O = C ( N c 1 c c c ( O c 2 c c n c 3 [ n H ] ...,[C:1](=[O:2])([C:3]([F:4])([F:5])[F:6])[O:27][...
1,US06051718,5,C C ( C ) ( C ) O C ( = O ) O C ( = O ) O C ( ...,C O c 1 c c c ( C ( C ) C N C ( = O ) O C ( C ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
2,US07504410B2,5,N c 1 c c c ( [ N + ] ( = O ) [ O - ] ) c c 1 ...,O = C ( N c 1 c c c ( [ N + ] ( = O ) [ O - ] ...,[C:1](=[O:2])([C:3]([F:4])([F:5])[F:6])[O:19][...
3,US04960769,5,C C ( C ) ( C ) O C ( = O ) O C ( = O ) O C ( ...,C C ( C ) ( C ) O C ( = O ) N 1 C C C c 2 o n ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
4,US20110092505A1,5,C C ( C ) ( C ) O C ( = O ) N 1 C C N ( c 2 c ...,C C ( C ) ( C ) O C ( = O ) N 1 C C N ( c 2 c ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
5,US05864045,5,C C ( C ) ( C ) O C ( = O ) O C ( = O ) O C ( ...,C C ( C ) ( C ) O C ( = O ) N C C ( c 1 c c c ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
6,US20030166650A1,5,C C ( = O ) c 1 c c c 2 c ( c 1 ) C C C N 2 . ...,C C ( = O ) c 1 c c c 2 c ( c 1 ) C C C N 2 C ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
7,US07232834B2,5,C = C [ C @ H ] 1 C N C C [ C @ H ] 1 C C C ( ...,C = C [ C @ H ] 1 C N ( C ( = O ) O C ( C ) ( ...,[CH2:8]=[CH:9][C@H:10]1[CH2:11][NH:12][CH2:13]...
8,US06534531B2,5,C C ( C ) ( C ) O C ( = O ) O C ( = O ) O C ( ...,C C ( C ) ( C ) O C ( = O ) N 1 C C C C ( C ( ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...
9,US05866685,5,C C ( C ) ( C ) O C ( = O ) O C ( = O ) O C ( ...,C C ( C ) ( C ) O C ( = O ) N 1 C C C ( C C ( ...,[CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7...


### Shuffling and saving the datasets <a class="anchor" id="eighthbullet"></a>

The dataset contains different types of reactions arranged in a ordered manner (as shown the snippet has same type of reaction i.e., 5) hence, without shuffling model might learn patterns that are not generalizable. Shuffling ensures that each training batch has a variety of reaction types, reactants, and complexities. This helps the model learn general rules of reactivity, and avoid overfitting. After shuffling it can be seen the dataset is random from the classof the reaction.

In [16]:
train_rn = train_df.sample(frac=1, random_state=42).reset_index(drop=True)
test_rn = test_df.sample(frac=1, random_state=42).reset_index(drop=True)
val_rn = val_df.sample(frac=1, random_state=42).reset_index(drop=True)
train_rn.head(20)

Unnamed: 0,Id,Class,Tokenized Reactants,Tokenized Products,Overall Reaction
0,US05869694,9,C C O C ( = O ) C [ C @ H ] ( O ) C C l . [ N ...,C C O C ( = O ) C [ C @ H ] ( O ) C N = [ N + ...,[CH2:1]([C@H:2]([CH2:3][C:4]([O:5][CH2:6][CH3:...
1,US20070197512A1,7,C C ( C ) C 1 O c 2 c c c ( C l ) c c 2 N C 1 = O,C C ( C ) C 1 C N c 2 c c ( C l ) c c c 2 O 1,[C:1]1(=[O:15])[CH:2]([CH:3]([CH3:4])[CH3:5])[...
2,US20120184542A1,6,C O c 1 n c c 2 c c ( C ( = O ) N c 3 c c ( C ...,C O c 1 n c c 2 c c ( C ( = O ) N c 3 c c ( C ...,[NH:1]([C@H:2]([CH2:3][CH2:4][NH:5][C:6]([c:7]...
3,US20060223843A1,3,N # C c 1 c n c 2 c c c ( C = O ) n c 2 c 1 . ...,N # C c 1 c n c 2 c c c ( C = C 3 S C ( N C c ...,[CH:1]([c:2]1[cH:3][cH:4][c:5]2[n:6][cH:7][c:8...
4,US07671043B2,2,C N O C . C O C ( = O ) c 1 c c c ( C ( = O ) ...,C O C ( = O ) c 1 c c c ( C ( = O ) N ( C ) O ...,[C:1]([c:2]1[cH:3][cH:4][c:5]([C:6]([O:7][CH3:...
5,US09145380B2,7,N S ( = O ) ( = O ) c 1 c c c c c 1 N S ( = O ...,N S ( = O ) ( = O ) c 1 c c c c c 1 N S ( = O ...,[NH2:1][S:2](=[O:3])(=[O:4])[c:5]1[cH:6][cH:7]...
6,US07932390B2,2,C C O C ( = O ) c 1 c n c ( N ) c 2 c ( C O c ...,C c 1 n n c ( - c 2 c c c c ( O C c 3 c s c 4 ...,[C:1](=[O:2])([c:3]1[cH:4][n:5][c:6]([NH2:7])[...
7,US20100168418A1,6,C O c 1 c c c 2 c ( / C = C / C ( N ) = O ) c ...,C O c 1 c c c 2 c ( / C = C / C ( N ) = O ) c ...,[O:1]1[CH:2]([CH2:3][n:4]2[c:5]3[c:6]([cH:7][c...
8,US20150038528A1,6,C C ( C ) ( C ) O C ( = O ) N [ C @ @ H ] ( C ...,N # C c 1 n c ( C ( = O ) N C C [ C @ H ] ( N ...,[NH:1]([C@@H:2]([CH2:3][CH2:4][NH:5][C:6](=[O:...
9,US20100234356A1,1,O = C ( c 1 c c c ( C l ) c c 1 ) N 1 C C ( O ...,O = C ( c 1 c c c ( C l ) c c 1 ) N 1 C C ( O ...,[CH:1]([c:2]1[cH:3][cH:4][c:5]([Cl:6])[cH:7][c...


In [17]:
train_rn['Tokenized Reactants'].to_csv("uspto_50k_train_reactants.txt", index=False, header=False)
train_rn['Tokenized Products'].to_csv("uspto_50k_train_products.txt", index=False, header=False)

test_rn['Tokenized Reactants'].to_csv("uspto_50k_test_reactants.txt", index=False, header=False)
test_rn['Tokenized Products'].to_csv("uspto_50k_test_products.txt", index=False, header=False)

val_rn['Tokenized Reactants'].to_csv("uspto_50k_val_reactants.txt", index=False, header=False)
val_rn['Tokenized Products'].to_csv("uspto_50k_val_products.txt", index=False, header=False)