# Data processing

In this notebook, I am loading a list of molecules I obtained from ChEMBL and processing them to make sure I have:
- Standard SMILES representation of the compound
- InChIKey associated to the compound

In [1]:
# In this codeblock I will import the necessary packages and specify the paths to relevant folders

# %pip install rdkit
# %pip install standardiser

# uncomment the above in case of errors relating to rdkit or standardiser not found

import pandas as pd
import sys
sys.path.append('../src')
from smiles_processing import standardise_smiles

input_file = '../data/reference_library.csv'
output_folder = '../data/'

In [2]:
# In this codeblock I will load the data from the /data folder to a Pandas dataframe and understand which headers it has
df: pd.DataFrame = pd.read_csv(input_file)
df.columns

Index(['smiles'], dtype='object')

In [3]:
# In this codeblock I will convert the molecules to standard SMILES by using the function standardise_smiles from /src
# I will import the function directly from src, not copying it here
standard_smiles: list = standardise_smiles(df['smiles'])

[05:49:09] Can't kekulize mol.  Unkekulized atoms: 3 7


In [4]:
# In this codeblock I will get the Inchikey representation of the molecules using the RDKIT package
from rdkit import Chem

smiles_dict = {}

# convert SMILES to Molecule Object then fetch Inchikey and save to a dictionary
for smiles in standard_smiles:
    try:
        mol_object = Chem.MolFromSmiles(smiles)
    except TypeError as e:
        print(f"A TypeError occurred: {e}")
    else:
        inchikey = Chem.MolToInchiKey(mol_object)
        smiles_dict[smiles] = inchikey

A TypeError occurred: No registered converter was able to produce a C++ rvalue of type std::__1::basic_string<wchar_t, std::__1::char_traits<wchar_t>, std::__1::allocator<wchar_t>> from this Python object of type float
A TypeError occurred: No registered converter was able to produce a C++ rvalue of type std::__1::basic_string<wchar_t, std::__1::char_traits<wchar_t>, std::__1::allocator<wchar_t>> from this Python object of type float
A TypeError occurred: No registered converter was able to produce a C++ rvalue of type std::__1::basic_string<wchar_t, std::__1::char_traits<wchar_t>, std::__1::allocator<wchar_t>> from this Python object of type float


In [5]:
# In this codeblock I will save the data as a .csv file containing only the standard smiles and the inchikey as columns. 
# All data will be saved with informative names in the /data folder

df = pd.DataFrame(list(smiles_dict.items()), columns=['SMILES', 'InChIKey'] )

df.to_csv(output_folder + 'smiles_to_inchikeys_conversion.csv', index=False)

# Model Bias Evaluation

Now, I will use the predictions I got from the Ersilia Model Hub on the dataset of 1000 molecules curated above and see how are they distributed in their space (which might be 0 to 1 for probabilities, or different for regression models)

In [6]:
# In this codeblock I will load the predictions I've run on Ersilia and saved in the /data folder

In [7]:
# In this codeblock I will create the necessary plots with MatPlotLib to observe the distribution of predicted values