# Data Preprocessing

1. [Loading the Datasets](#firstbullet)
2. [Exploratory Data Analysis](#secondbullet)
3. [Splitting the Dataset](#thirdbullet)
4. [Canonicalizing the Datast](#fourthbullet)
5. [Removing the Atom Maps](#fifthbullet)
6. [Tokenizing the Dataset](#sixthbullet)
7. [Preparing the DataFrames](#seventhbullet)
8. [Shuffling and Saving](#eighthbullet)

### 1. Loading the datasets: Exploratory data analysis <a class="anchor" id="firstbullet"></a>

In [None]:
train = pd.read_csv(r"C:\Users\sayye\OneDrive\Documents\New folder (2)\USPTO 50k dataset\uspto50k_train.csv")
test = pd.read_csv(r"C:\Users\sayye\OneDrive\Documents\New folder (2)\USPTO 50k dataset\uspto50k_test.csv")
val = pd.read_csv(r"C:\Users\sayye\OneDrive\Documents\New folder (2)\USPTO 50k dataset\uspto50k_val.csv")

#a glimpse of the datasets
print(train.head(),'\n\n', val.head(), '\n\n', test.head())

### 2. Exploratory Data Analysis <a class="anchor" id="secondbullet"></a>

In [None]:
print(f'Shape of the train dataset:\t{train.shape}\nShape of the validation dataset:\t{val.shape}\nShape of the test dataset:\t{test.shape}')

In [None]:
print(f'Info of the train dataset:\t{train.describe()}\nInfo of the validation dataset:\t{val.describe()}\nInfo of the test dataset:\t{test.describe()}')

In [None]:
print(f'Columns in the train dataset:\t{train.columns}\nColumns in the validation dataset:\t{val.columns}\nColumns in the test dataset:\t{test.columns}')

In [None]:
train['source'] = 'train'
val['source'] = 'val'
test['source'] = 'test'

df_all = pd.concat([train, val, test], ignore_index=True)
df_all

Since the datsets are open source, they are pretty much clean and don't have null values. We have explored the datasets as per our requirements of the project.

### 3. Splitting the Dataset <a class="anchor" id="thirdbullet"></a>

As shown the dataset has three columns having the id of the reaction, the class, i.e., the type of the reaction, be it elimination, substitution encoded, along with the overall reaction. It's better for the reaction to be split into their components i.e., reactants, reagents and products for further processing.

In [None]:
train[['reactants', 'reagents', 'products']] = train['reactants>reagents>production'].str.split('>', expand = True)
test[['reactants', 'reagents', 'products']] = test['reactants>reagents>production'].str.split('>', expand = True)
val[['reactants', 'reagents', 'products']] = val['reactants>reagents>production'].str.split('>', expand = True)

#checking the columns
print(train.columns, '\n', val.columns, '\n', test.columns)

### 4. Canonicalizing the Data <a class="anchor" id="fourthbullet"></a>

In [None]:
def canonicalize(smiles): # will raise an Exception if invalid SMILES
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        return Chem.MolToSmiles(mol)
    else:
        return ''

train['reactants'] = train['reactants'].apply(canonicalize)
train['reagents'] = train['reagents'].apply(canonicalize)
train['products'] = train['products'].apply(canonicalize)

test['reactants'] = test['reactants'].apply(canonicalize)
test['reagents'] = test['reagents'].apply(canonicalize)
test['products'] = test['products'].apply(canonicalize)

val['reactants'] = val['reactants'].apply(canonicalize)
val['reagents'] = val['reagents'].apply(canonicalize)
val['products'] = val['products'].apply(canonicalize)

### 5. Removing the Atom maps <a class="anchor" id="fifthbullet"></a>

In [None]:
def remove_atommapping(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        for atom in mol.GetAtoms():
            atom.SetAtomMapNum(0)
        return Chem.MolToSmiles(mol)
    else:
        return ''

train['reactants'] = train['reactants'].apply(remove_atommapping)
train['reagents'] = train['reagents'].apply(remove_atommapping)
train['products'] = train['products'].apply(remove_atommapping)

test['reactants'] = test['reactants'].apply(remove_atommapping)
test['reagents'] = test['reagents'].apply(remove_atommapping)
test['products'] = test['products'].apply(remove_atommapping)

val['reactants'] = val['reactants'].apply(remove_atommapping)
val['reagents'] = val['reagents'].apply(remove_atommapping)
val['products'] = val['products'].apply(remove_atommapping)

### 6. Tokenizing the Dataset <a class="anchor" id="sixthbullet"></a>

To be able to train a language model, we need to split the strings into tokens.

In [None]:
REGEX_TOKENIZER =  r"(\%\([0-9]{3}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\||\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"

def tokenize(smiles):
    return ' '.join(smiles)

train['token_reactants'] = train['reactants'].apply(tokenize)
train['token_reagents'] = train['reagents'].apply(tokenize)
train['token_products'] = train['products'].apply(tokenize)

test['token_reactants'] = test['reactants'].apply(tokenize)
test['token_reagents'] = test['reagents'].apply(tokenize)
test['token_products'] = test['products'].apply(tokenize)

val['token_reactants'] = val['reactants'].apply(tokenize)
val['token_reagents'] = val['reagents'].apply(tokenize)
val['token_products'] = val['products'].apply(tokenize)

### Preparing the Data Frames <a class="anchor" id="seventhbullet"></a>

In [None]:
train_df = pd.DataFrame({'Id': train['id'], 
                         'Class': train['class'], 
                         'Tokenized Reactants': train['token_reactants'],
                         'Tokenized Products': train['token_products'],
                        'Overall Reaction': train['reactants>reagents>production']})
print(f"The training set contains {train_df.shape[0]} reactions.")
train_df.head()

In [None]:
test_df = pd.DataFrame({'Id': test['id'], 
                        'Class': test['class'],
                        'Tokenized Reactants': test['token_reactants'],
                        'Tokenized Products': test['token_products'],
                       'Overall Reaction': test['reactants>reagents>production']})
print(f"The training set contains {test_df.shape[0]} reactions.")
test_df.head()

In [None]:
val_df = pd.DataFrame({'Id': val['id'],
                       'Class': val['class'],
                       'Tokenized Reactants': val['token_reactants'],
                       'Tokenized Products': val['token_products'],
                      'Overall Reaction': val['reactants>reagents>production']})
print(f"The training set contains {val_df.shape[0]} reactions.")
val_df.head(20)

### Shuffling and saving the datasets <a class="anchor" id="eighthbullet"></a>

The dataset contains different types of reactions arranged in a ordered manner (as shown the snippet has same type of reaction i.e., 5) hence, without shuffling model might learn patterns that are not generalizable. Shuffling ensures that each training batch has a variety of reaction types, reactants, and complexities. This helps the model learn general rules of reactivity, and avoid overfitting. After shuffling it can be seen the dataset is random from the classof the reaction.

In [None]:
train_rn = train_df.sample(frac=1, random_state=42).reset_index(drop=True)
test_rn = test_df.sample(frac=1, random_state=42).reset_index(drop=True)
val_rn = val_df.sample(frac=1, random_state=42).reset_index(drop=True)
train_rn.head(20)

In [None]:
train_rn['Tokenized Reactants'].to_csv("uspto_50k_train_reactants.txt", index=False, header=False)
train_rn['Tokenized Products'].to_csv("uspto_50k_train_products.txt", index=False, header=False)

test_rn['Tokenized Reactants'].to_csv("uspto_50k_test_reactants.txt", index=False, header=False)
test_rn['Tokenized Products'].to_csv("uspto_50k_test_products.txt", index=False, header=False)

val_rn['Tokenized Reactants'].to_csv("uspto_50k_val_reactants.txt", index=False, header=False)
val_rn['Tokenized Products'].to_csv("uspto_50k_val_products.txt", index=False, header=False)