# Implementation and evaluation of a computational standardization pipeline for chemical compounds
## Based on ["Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research" from 2010 (D. Fourches, ...)"](https://pubmed.ncbi.nlm.nih.gov/20572635/)


### Introduction 
This notebook serves to showcase the functionality of the Standardization module. Following the recommended standardization steps of "Trust, But Verify"(D. Fourches, ..., 2010).
This notebook uses the dataset of following paper: [Cheminformatics Analysis of Assertions Mined from Literature That Describe Drug-Induced Liver Injury in Different Species](https://pubs.acs.org/doi/10.1021/tx900326k)

<span style='color:red'> For all relative paths in this notebook to work, please make sure you are starting this notebook from the working directory ./opencadd/docs/tutorials/  </span><br>
Check your directory with the cell below.

In [1]:
import os
os.getcwd()

'/home/allen/dev/opencadd/docs/tutorials'

In [12]:
#Import pandas and numpy
import pandas as pd
import numpy as np

#import modules and Standardization API functions needed
from rdkit import Chem
#from rdkit.Chem.PandasTools import RemoveSaltsFromFrame
from opencadd.compounds.standardization import convert_format,handle_fragments,disconnect_metals,detect_inorganic,remove_salts,normalize

In [3]:
# Helper function to mark at which step the entry failed the standardization pipeline
def failMarker(i):
    i=taskNum
    return i

### Initial dataset import and cleaning of empty entries
------------------------------------------------
The first step before the standardization steps are started is a import of the dataset as an Pandas Dataframe, only including the columns necessary. In this case we use the <b>IDs</b>, <b>Names</b> and <b>SMILEs</b> column.<br>
Then we search for all entries which actually don't have any strings saved under <b>SMILEs</b> and kick them from the dataset, since they are not holding any information.<br>
After the import we add a <b>Failed_at</b> column to track in which standardization step the entry failed. 
The intial `taskNum` will be 0, which leads to an default <b>Failed_at</b>-value of 0 for all entries, where null stands for <i>not failed</i> . 

In [4]:
taskNum = 0

# Importing the test-dataset 
dataset = pd.read_csv (r'./data/standardization_test_data.csv')

#Filter for needed columns
dataset = dataset[['IDs','Names','SMILEs']]

#Kick all empty entries
empty_smiles = dataset[(dataset['SMILEs'].isnull())] 
# The empty_smiles dataframe could be used to check which entires are affected and review the dataset again.
dataset = dataset[(dataset['SMILEs'].notna())]

#Setting a initial score of 0 for all entries in the 'Failed_at'-column
dataset['Failed_at'] = dataset['SMILEs'].apply(failMarker)


#Show the current form of the main-dataframe
dataset.tail()

Unnamed: 0,IDs,Names,SMILEs,Failed_at
198,199,Citalopram,CN(C)CCCC1(OCc2cc(ccc12)C#N)c1ccc(F)cc1,0
199,200,Citric acid,OC(=O)CC(O)(CC(O)=O)C(O)=O,0
200,201,zirconium,CCO[Zr](OCC)(OCC)OCC,0
201,202,hemoglobin,CC1=C(C2=CC3=NC(=CC4=C(C(=C([N-]4)C=C5C(=C(C(=...,0
202,203,test_salt,[Al].N.[Ba].[Bi].Br.[Ca].Cl.F.I.[K].[Li].[Mg]....,0


### Step 1: Conversion of SMILEs to mol
------------------------------------------
### Convert the SMILE representation format of the compounds into Mol-files

RDKit performs a sanitization of the molecule by default. In this sanitization step RDKit tries to kekulize the mols (generates alternate Lewis structures). This step might fail, when the structure is aromatic, but no Hydrogen position is provided. TODO:!(This explanation might be a bit short and not fully correct, check this later again)!

If the conversion from SMILE to mol fails, then those SMILEs will get a <b>Failed_at</b> marker added. 

To avoid the sanitization of the molecule `convert_smiles_to_mol` can be called with the argument `sanitize=False`. Keep in mind that the generation of different Lewis structures serves to find different representation formats of the same molecule. 

References:<br>
https://chemistry.stackexchange.com/questions/116498/what-is-kekulization-in-rdkit<br>
https://rdkit-discuss.narkive.com/QwnqcKcM/another-can-t-kekulize-mol-observation<br>
https://www.rdkit.org/docs/Cookbook.html<br>
https://www.rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html<br>


#### Task 1: Convert to Mol

In [5]:
# Setting up the taskNum
taskNum = 1

# A column called mol is beeing added to the dataframe to store the mol-files
dataset['mol'] = dataset['SMILEs'].apply(convert_format.convert_smiles_to_mol)

# All entries which couldn't generate a mol will get filtered another dataframe
failed_step_1 = dataset[(dataset['mol'].isnull())]
failed_step_1['Failed_at'] = failed_step_1['Failed_at'].apply(failMarker)
failed_step_1 = failed_step_1[['IDs','Names','SMILEs','Failed_at']]

# Update the dataset by removing all entries without a mol
result1 = dataset[dataset['mol'].notna()]
result1.tail(16)


RDKit ERROR: [20:54:12] Can't kekulize mol.  Unkekulized atoms: 1 2 3 4 5 7 9
RDKit ERROR: 
RDKit ERROR: [20:54:12] Can't kekulize mol.  Unkekulized atoms: 2 3 4 6 7 8 10 11 12
RDKit ERROR: 
RDKit ERROR: [20:54:12] Can't kekulize mol.  Unkekulized atoms: 6 8 10
RDKit ERROR: 
RDKit ERROR: [20:54:12] Can't kekulize mol.  Unkekulized atoms: 7 8 9 10 11 12 13 14 15
RDKit ERROR: 
RDKit ERROR: [20:54:12] Can't kekulize mol.  Unkekulized atoms: 57 58 60
RDKit ERROR: 
RDKit ERROR: [20:54:12] Can't kekulize mol.  Unkekulized atoms: 14 15 16 17 18 19 20 21 23
RDKit ERROR: 
RDKit ERROR: [20:54:12] Can't kekulize mol.  Unkekulized atoms: 11 12 13 15 16 17 19 20 21
RDKit ERROR: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  failed_step_1['Failed_at'] = failed_step_1['Failed_at'].appl

Unnamed: 0,IDs,Names,SMILEs,Failed_at,mol
187,188,Chlortetracycline,CN(C)C1C2CC3C(C(=O)c4c(O)ccc(Cl)c4C3(C)O)=C(O)...,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec301030>
188,189,Chlorzoxazone,Oc1nc2cc(Cl)ccc2o1,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec301080>
189,190,Cholestyramine,CC(C)(Oc1ccc(cc1)C(=O)c1ccc(Cl)cc1)C(=O)NCCS(O...,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3010d0>
190,191,Chondroitin sulfate,CC(=O)NC1C(O)OC(OS(O)(=O)=O)C(O)C1OC1OC(C(O)C(...,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec301120>
191,192,Cidofovir,NC1=NC(=O)N(CC(CO)OCP(O)(O)=O)C=C1,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec301170>
192,193,Cimetidine,CN=C(NCCSCc1nc[nH]c1C)NC#N,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3011c0>
193,194,Cinchophen,OC(=O)c1cc(nc2ccccc12)-c1ccccc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec301210>
194,195,Cinoxacin,CCN1N=C(C(O)=O)C(=O)c2cc3OCOc3cc12,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec301260>
195,196,Ciprofibrate,CC(C)(Oc1ccc(cc1)C1CC1(Cl)Cl)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3012b0>
196,197,Ciprofloxacin,OC(=O)C1=CN(C2CC2)c2cc(N3CCNCC3)c(F)cc2C1=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec301300>


### Step 2: Removal of Inorganics and Mixtures
--------------------------------------------------

Since molecular descriptors can only be computed for organic compunds, all inorganic compunds must be removed before the descriptors are calculated. (Chapter 2.1. Fourches 2010)

For the flagging and following removal of compounds containing inorganic molecules, we can use the function `detect_inorganic`. This function returns a boolean value of "True" when it finds a inorganic molecule. We can run this flagging in a pre-processing step of the data, and discard those compounds. 
"Inorganic compounds are known to have biological effects, like for example toxic effects."(Chapter 2.1. Fourches 2010)(fix citation)
 Due to their potential bioactivity we can not distinguish if the recored activity of a mixed compound is caused by it's organic or inorganic part. Therefore the entry is useless and can be discarded. ! THIS SHOULD BE LOGGED AND MANUAL CURATION SHOULD BE ENABLED !
An alternate and easy way would be that every SMILES  is undertaken a substring search, where a match of a inorganic compound pattern (search pattern set should be defined) would be  flagged.

Due to the fact, that the treatment is not as simple as it apprears the paper (Fourches, 2010) recommends to delete records containing mixtures. ! THIS AGAIN CAN BE LOGGED AND MANUAL CURATION CAN BE DONE WITH THIS SET ! The ease up the curation various filtering functions can be implemented to help decide which to keep and which to discard. Three types of mixtures are described. ! CHECK IF IMPLEMENTATION WOULD BE POSSIBLE EASY AND FAST ! Common and widely used practice is to retain molecules with the highest molecular weight or the largest number of atoms(Chapter 2.1. Fourches 2010), but the paper (Fourches, 2010) states this might not be the best solution, and further investigation in mixtures should only be done if there is a reason to belive that the biological activity is really caused by the largest molecule and not the mixture itself.

Those actions might be performed, before the entered SMILES are beeing converted into mol-files. Some described steps are related to string pattern searches.

#### Task 2: Filter entries with inorganic components

In [None]:
# Setting up the taskNum 
taskNum = 2
# getting the valid entries from the step before
dataset = result1

# Check for inorganic structures in the entries
dataset['Inorganics'] = dataset['mol'].apply(detect_inorganic)

# Filter the failed entries
failed_step_2 = dataset[dataset['Inorganics']== True]
failed_step_2['Failed_at'] = failed_step_2['Failed_at'].apply(failMarker)

# Safe entries that passed
result2 = dataset[dataset['Inorganics']== False]
failed_step_2.tail(15)

#### Task 3: Filter entries containing fragments

In [None]:
# Setting up the taskNum 
taskNum = 3
# getting the valid entries from the step before
dataset = result2

# Perform remove_fragments on entries
dataset['mol_after'] = dataset['mol'].apply(handle_fragments.remove_fragments)

# Create Smiles for evaluation
dataset['smiles_before'] = dataset['mol'].apply(convert_format.convert_mol_to_smiles)

# Create new SMILEs from the current state for evaluation of performed changes 
dataset['Smiles 3'] = dataset['mol_after'].apply(convert_format.convert_mol_to_smiles)
dataset['noChanges']= dataset['smiles_before'] == dataset['Smiles 3']


# Filter the failed entries
failed_step_3 = dataset[dataset['noChanges']== False]
failed_step_3['Failed_at'] = failed_step_3['Failed_at'].apply(failMarker)

# Safe entries that passed
result3 = dataset[dataset['noChanges']== True]
result3 = result3[['IDs','Names','SMILEs','Failed_at','mol']]
failed_step_3.tail()



#### Task 4: Filter entires containing metals

In [None]:
# Setting up the taskNum 
taskNum = 4
# getting the valid entries from the step before
dataset = result3[['IDs','Names','SMILEs','Failed_at','mol']] #Load result1 subset to actually see this step perform

# Create Smiles for evaluation
dataset['smiles_before'] = dataset['mol'].apply(convert_format.convert_mol_to_smiles)

# Perform disconnect_metals on entries
dataset['mol_after'] = dataset['mol'].apply(disconnect_metals)

# Create new SMILEs from the current state for evaluation of performed changes 
dataset['Smiles 4'] = dataset['mol_after'].apply(convert_format.convert_mol_to_smiles)
dataset['noChanges']= dataset['smiles_before'] == dataset['Smiles 4']

# Filter the failed entries
failed_step_4 = dataset[dataset['noChanges']== False]
failed_step_4['Failed_at'] = failed_step_4['Failed_at'].apply(failMarker)
failed_step_4.tail()

# Safe entries that passed
result4 = dataset[dataset['noChanges']== True]
result4 = result4[['IDs','Names','SMILEs','Failed_at','mol']]
result4.tail()

#### Task 5: Filter inorganics again  
<span style='color:red'>This step actually doesn't make much sense here, since no actual changes happend to the entries in this first step</span>


In [None]:
# Setting up the taskNum 
taskNum = 5
# getting the valid entries from the step before
dataset = result4

# Check for inorganic structures in the entries
dataset['Inorganics2'] = dataset['mol'].apply(detect_inorganic)

# Filter the failed entries
failed_step_5 = dataset[dataset['Inorganics2']== True]
failed_step_5['Failed_at'] = failed_step_5['Failed_at'].apply(failMarker)

# Safe entries that passed
result5 = dataset[dataset['Inorganics2']== False]
failed_step_5.tail(15)

### Step 3: Structural Conversion and Cleaning
--------------------------------------------------

Some drugs need to be transformed "into their salt form to enhance how the drug disscolves (...) and (to) increase it's effectiveness. (https://www.drugs.com/article/pharmaceutical-salts.html (03/12/21)) Therefore it is common for chemical compound databases to contain records of salts. If possible it is recommended to delete the records containing salts completely, since, similar to in-organic compounds, "most descriptor-generating software (can not process salts)" (Fourches 2010 Chapter 2.2 ).While not beeing desirable, it is still an acceptable procedure to convert compounds into their neutral forms. But cases like this should be tagged, filtered and afterwards manually curated or compared to the actual neutral form of that compound. 
In case that we want to continue working on the converted records, we should perform the following steps:
- check if records contain compounds with presence of metals --> difficult case, filter out (already done this - one step ahead)
- removing the salts from the record
- neutralize the record (normalization or basic standardization)
- neutralize the charges
- to be discussed: the adding/removing of hydrogens, both got pros and cons (pro addingH --> higher prediction performances / con addingH --> may introduce noise --> less reliable models)(removingH might introduce erros in calculating descriptors, due to it might not handle certain cases well)



In [None]:
# Structural coversion
# Cleaning/removal of salts
# Functions remove_salts
# normalize_molecules
# handle_charges
# handle_hydrogens

#### Task 6: Removing salts <span style='color:red'>sadly fails ATM</span>

In [11]:
# Setting up the taskNum 
taskNum = 6
# getting the valid entries from the step before
dataset = result1
#dataset.head(100)
dataset['removed_salts'] = dataset['mol'].apply(remove_salts)
dataset.head(20)

#dataset['removed_salts'] = RemoveSaltsFromFrame(dataset,molCol='mol')
#where_salt = dataset[dataset['removed_salts'].notna()]

RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: [20:55:21] ERROR: Empty structure
RDKit ERROR: 

RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:55:53] ERROR: Empty structure
RDKit ERROR: [20:56:10] ERROR: Empty structure
RDKit ERROR: [20:56:10] ERROR: Empty structure
RDKit ERROR: [20:56:10] ERROR: Empty structure
RDKit ERROR: [20:56:10] ERROR: Empty structure
RDKit ERROR: [20:56:10] ERROR: Empty structure
RDKit ERROR: [20:56:10] ERROR: Empty structure
RDKit ERROR: [20:56:10] ERROR: Empty structure
RDKit ERROR: 

Unnamed: 0,IDs,Names,SMILEs,Failed_at,mol,removed_salts
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1.[Ca],0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec2f1e90>,<rdkit.Chem.rdchem.Mol object at 0x7f1ce3fb7760>
1,2,17-Methyltestosterone,CC1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305620>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305620>
2,3,1-alpha-Hydroxycholecalciferol,CC(C)CCCC(C)C1CCC2C(CCCC12C)=CC=C1CC(O)CC(O)C1=C,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305670>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305670>
3,4,"2,3-Dimercaptosuccinic acid",OC(=O)C(S)C(S)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3056c0>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3056c0>
4,5,"2,4,6-Trinitrotoluene",Cc1c(cc(cc1N(=O)=O)N(=O)=O)N(=O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305710>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305710>
5,6,2-Deoxy-D-glucose,OCC1OC(O)CC(O)C1O.O1CCOCC1,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305760>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305760>
6,7,2'-fluoro-5-methylarabinosyluracil,CC1=CN(C2OC(CO)C(O)C2F)C(=O)NC1=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3057b0>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3057b0>
7,8,2-Methoxyestradiol,COc1cc2C3CCC4(C)C(O)CCC4C3CCc2cc1O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305800>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305800>
8,9,4-aminobenzoic acid,Nc1ccc(cc1)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305850>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305850>
9,10,4-Hydroxytamoxifen,CCC(c1ccccc1)=C(c1ccc(O)cc1)c1ccc(OCCN(C)C)cc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3058a0>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3058a0>


In [None]:
test = '[Al].N.[Ba].[Bi].Br.[Ca].Cl.F.I.[K].[Li].[Mg].[Na].[Ag].[Sr].S.O.[Zn]'
mol = convert_format.convert_smiles_to_mol(test)
mol
salt_remover = remove_salts(mol)
salt_remover

#### Task 7: Normalize molecules

In [14]:
# Setting up the taskNum 
taskNum = 7
# getting the valid entries from the step before
dataset = result1
#dataset.head(100)
dataset['normalized'] = dataset['mol'].apply(normalize)
dataset.head()

Unnamed: 0,IDs,Names,SMILEs,Failed_at,mol,removed_salts,normalized
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1.[Ca],0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec2f1e90>,<rdkit.Chem.rdchem.Mol object at 0x7f1ce3fb7760>,<rdkit.Chem.rdchem.Mol object at 0x7f1ce3ec0ad0>
1,2,17-Methyltestosterone,CC1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305620>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305620>,<rdkit.Chem.rdchem.Mol object at 0x7f1ce3ec5080>
2,3,1-alpha-Hydroxycholecalciferol,CC(C)CCCC(C)C1CCC2C(CCCC12C)=CC=C1CC(O)CC(O)C1=C,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305670>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305670>,<rdkit.Chem.rdchem.Mol object at 0x7f1ce3ec5440>
3,4,"2,3-Dimercaptosuccinic acid",OC(=O)C(S)C(S)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3056c0>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec3056c0>,<rdkit.Chem.rdchem.Mol object at 0x7f1ce3ec5260>
4,5,"2,4,6-Trinitrotoluene",Cc1c(cc(cc1N(=O)=O)N(=O)=O)N(=O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305710>,<rdkit.Chem.rdchem.Mol object at 0x7f1cec305710>,<rdkit.Chem.rdchem.Mol object at 0x7f1ce3ec5490>


#### Task 8: Charges and Hydrogens

### Normalization of Specific Chemotypes

More complex than just Normalization.

In [None]:
# Normalization of specific chemotypes
# normalize_molecules

In [None]:
# Treatment of tautomeric forms
# handle_tautomers

### Removal of duplicates

In [None]:
# Analysis/removal of duplicates

In [None]:
# Manual inspection

In [None]:
#Contatination of results for the end 
test = pd.concat([failed_step_1,failed_step_2])
test = test.sort_values(by=['IDs'])
