# Implementation and evaluation of a computational standardization pipeline for chemical compounds
## Based on ["Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research" from 2010 (D. Fourches, ...)"](https://pubmed.ncbi.nlm.nih.gov/20572635/)


### Introduction 
This notebook serves to showcase the functionality of the Standardization module. Following the recommended standardization steps of "Trust, But Verify"(D. Fourches, ..., 2010).
This notebook uses the dataset of following paper: [Cheminformatics Analysis of Assertions Mined from Literature That Describe Drug-Induced Liver Injury in Different Species](https://pubs.acs.org/doi/10.1021/tx900326k)

<span style='color:red'> For all relative paths in this notebook to work, please make sure you are starting this notebook from the working directory ./opencadd/docs/tutorials/  </span><br>
Check your directory with the cell below.

In [69]:
import os
os.getcwd()

'/home/allen/dev/opencadd/docs/tutorials'

In [70]:
#Import pandas and numpy
import pandas as pd
import numpy as np

#import modules and Standardization API functions needed
from rdkit import Chem
from opencadd.compounds.standardization import convert_format,handle_fragments,disconnect_metals,detect_inorganic,remove_salts

In [71]:
# Helper function to mark at which step the entry failed the standardization pipeline
def failMarker(i):
    i=taskNum
    return i

### Initial dataset import and cleaning of empty entries
------------------------------------------------
The first step before the standardization steps are started is a import of the dataset as an Pandas Dataframe, only including the columns necessary. In this case we use the <b>IDs</b>, <b>Names</b> and <b>SMILEs</b> column.<br>
Then we search for all entries which actually don't have any strings saved under <b>SMILEs</b> and kick them from the dataset, since they are not holding any information.<br>
After the import we add a <b>Failed_at</b> column to track in which standardization step the entry failed. 
The intial `taskNum` will be 0, which leads to an default <b>Failed_at</b>-value of 0 for all entries, where null stands for <i>not failed</i> . 

In [86]:
taskNum = 0

# Importing the test-dataset 
dataset = pd.read_csv (r'./data/standardization_test_data.csv')

#Filter for needed columns
dataset = dataset[['IDs','Names','SMILEs']]

#Kick all empty entries
empty_smiles = dataset[(dataset['SMILEs'].isnull())] 
# The empty_smiles dataframe could be used to check which entires are affected and review the dataset again.
dataset = dataset[(dataset['SMILEs'].notna())]

#Setting a initial score of 0 for all entries in the 'Failed_at'-column
dataset['Failed_at'] = dataset['SMILEs'].apply(failMarker)


#Show the current form of the main-dataframe
dataset.head()

Unnamed: 0,IDs,Names,SMILEs,Failed_at
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1.[Ca],0
1,2,17-Methyltestosterone,CC1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C,0
2,3,1-alpha-Hydroxycholecalciferol,CC(C)CCCC(C)C1CCC2C(CCCC12C)=CC=C1CC(O)CC(O)C1=C,0
3,4,"2,3-Dimercaptosuccinic acid",OC(=O)C(S)C(S)C(O)=O,0
4,5,"2,4,6-Trinitrotoluene",Cc1c(cc(cc1N(=O)=O)N(=O)=O)N(=O)=O,0


### Step 1: Conversion of SMILEs to mol
------------------------------------------
### Convert the SMILE representation format of the compounds into Mol-files

RDKit performs a sanitization of the molecule by default. In this sanitization step RDKit tries to kekulize the mols (generates alternate Lewis structures). This step might fail, when the structure is aromatic, but no Hydrogen position is provided. TODO:!(This explanation might be a bit short and not fully correct, check this later again)!

If the conversion from SMILE to mol fails, then those SMILEs will get a <b>Failed_at</b> marker added. 

To avoid the sanitization of the molecule `convert_smiles_to_mol` can be called with the argument `sanitize=False`. Keep in mind that the generation of different Lewis structures serves to find different representation formats of the same molecule. 

References:<br>
https://chemistry.stackexchange.com/questions/116498/what-is-kekulization-in-rdkit<br>
https://rdkit-discuss.narkive.com/QwnqcKcM/another-can-t-kekulize-mol-observation<br>
https://www.rdkit.org/docs/Cookbook.html<br>
https://www.rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html<br>


In [87]:
# Setting up the taskNum
taskNum = 1

# A column called mol is beeing added to the dataframe to store the mol-files
dataset['mol'] = dataset['SMILEs'].apply(convert_format.convert_smiles_to_mol)

# All entries which couldn't generate a mol will get filtered another dataframe
failed_step_1 = dataset[(dataset['mol'].isnull())]
failed_step_1['Failed_at'] = failed_step_1['Failed_at'].apply(failMarker)
failed_step_1 = failed_step_1[['IDs','Names','SMILEs','Failed_at']]

# Update the dataset by removing all entries without a mol
result1 = dataset[dataset['mol'].notna()]
result1.tail(16)


RDKit ERROR: 
RDKit ERROR: [11:00:01] Can't kekulize mol.  Unkekulized atoms: 2 3 4 6 7 8 10 11 12
RDKit ERROR: 
RDKit ERROR: [11:00:01] Can't kekulize mol.  Unkekulized atoms: 6 8 10
RDKit ERROR: 
RDKit ERROR: [11:00:01] Can't kekulize mol.  Unkekulized atoms: 7 8 9 10 11 12 13 14 15
RDKit ERROR: 
RDKit ERROR: [11:00:01] Can't kekulize mol.  Unkekulized atoms: 57 58 60
RDKit ERROR: 
RDKit ERROR: [11:00:01] Can't kekulize mol.  Unkekulized atoms: 14 15 16 17 18 19 20 21 23
RDKit ERROR: 
RDKit ERROR: [11:00:01] Can't kekulize mol.  Unkekulized atoms: 11 12 13 15 16 17 19 20 21
RDKit ERROR: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  failed_step_1['Failed_at'] = failed_step_1['Failed_at'].apply(failMarker)


Unnamed: 0,IDs,Names,SMILEs,Failed_at,mol
185,186,Chlorpromazine,CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc12,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810cd990>
186,187,Chlorpropamide,CCCNC(=O)NS(=O)(=O)c1ccc(Cl)cc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050b20>
187,188,Chlortetracycline,CN(C)C1C2CC3C(C(=O)c4c(O)ccc(Cl)c4C3(C)O)=C(O)...,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050b70>
188,189,Chlorzoxazone,Oc1nc2cc(Cl)ccc2o1,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050bc0>
189,190,Cholestyramine,CC(C)(Oc1ccc(cc1)C(=O)c1ccc(Cl)cc1)C(=O)NCCS(O...,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050c10>
190,191,Chondroitin sulfate,CC(=O)NC1C(O)OC(OS(O)(=O)=O)C(O)C1OC1OC(C(O)C(...,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050c60>
191,192,Cidofovir,NC1=NC(=O)N(CC(CO)OCP(O)(O)=O)C=C1,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050cb0>
192,193,Cimetidine,CN=C(NCCSCc1nc[nH]c1C)NC#N,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050d00>
193,194,Cinchophen,OC(=O)c1cc(nc2ccccc12)-c1ccccc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050d50>
194,195,Cinoxacin,CCN1N=C(C(O)=O)C(=O)c2cc3OCOc3cc12,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381050da0>


### Step 2: Removal of Inorganics and Mixtures
--------------------------------------------------

Since molecular descriptors can only be computed for organic compunds, all inorganic compunds must be removed before the descriptors are calculated. (Chapter 2.1. Fourches 2010)

For the flagging and following removal of compounds containing inorganic molecules, we can use the function `detect_inorganic`. This function returns a boolean value of "True" when it finds a inorganic molecule. We can run this flagging in a pre-processing step of the data, and discard those compounds. 
"Inorganic compounds are known to have biological effects, like for example toxic effects."(Chapter 2.1. Fourches 2010)(fix citation)
 Due to their potential bioactivity we can not distinguish if the recored activity of a mixed compound is caused by it's organic or inorganic part. Therefore the entry is useless and can be discarded. ! THIS SHOULD BE LOGGED AND MANUAL CURATION SHOULD BE ENABLED !
An alternate and easy way would be that every SMILES  is undertaken a substring search, where a match of a inorganic compound pattern (search pattern set should be defined) would be  flagged.

Due to the fact, that the treatment is not as simple as it apprears the paper (Fourches, 2010) recommends to delete records containing mixtures. ! THIS AGAIN CAN BE LOGGED AND MANUAL CURATION CAN BE DONE WITH THIS SET ! The ease up the curation various filtering functions can be implemented to help decide which to keep and which to discard. Three types of mixtures are described. ! CHECK IF IMPLEMENTATION WOULD BE POSSIBLE EASY AND FAST ! Common and widely used practice is to retain molecules with the highest molecular weight or the largest number of atoms(Chapter 2.1. Fourches 2010), but the paper (Fourches, 2010) states this might not be the best solution, and further investigation in mixtures should only be done if there is a reason to belive that the biological activity is really caused by the largest molecule and not the mixture itself.

Those actions might be performed, before the entered SMILES are beeing converted into mol-files. Some described steps are related to string pattern searches.

In [84]:
# Setting up the taskNum 
taskNum = 2
# getting the valid entries from the step before
dataset = result1

# Check for inorganic structures in the entries
dataset['Inorganics'] = dataset['mol'].apply(detect_inorganic)

# Filter the failed entries
failed_step_2 = dataset[dataset['Inorganics']== True]
failed_step_2['Failed_at'] = failed_step_2['Failed_at'].apply(failMarker)

result2 = dataset[dataset['Inorganics']== False]
result2.tail(15)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  failed_step_2['Failed_at'] = failed_step_2['Failed_at'].apply(failMarker)


Unnamed: 0,IDs,Names,SMILEs,Failed_at,mol,Inorganics
185,186,Chlorpromazine,CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc12,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381137210>,False
186,187,Chlorpropamide,CCCNC(=O)NS(=O)(=O)c1ccc(Cl)cc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381137260>,False
187,188,Chlortetracycline,CN(C)C1C2CC3C(C(=O)c4c(O)ccc(Cl)c4C3(C)O)=C(O)...,0,<rdkit.Chem.rdchem.Mol object at 0x7f43811372b0>,False
188,189,Chlorzoxazone,Oc1nc2cc(Cl)ccc2o1,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381137300>,False
189,190,Cholestyramine,CC(C)(Oc1ccc(cc1)C(=O)c1ccc(Cl)cc1)C(=O)NCCS(O...,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381137350>,False
190,191,Chondroitin sulfate,CC(=O)NC1C(O)OC(OS(O)(=O)=O)C(O)C1OC1OC(C(O)C(...,0,<rdkit.Chem.rdchem.Mol object at 0x7f43811373a0>,False
191,192,Cidofovir,NC1=NC(=O)N(CC(CO)OCP(O)(O)=O)C=C1,0,<rdkit.Chem.rdchem.Mol object at 0x7f43811373f0>,False
192,193,Cimetidine,CN=C(NCCSCc1nc[nH]c1C)NC#N,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381137440>,False
193,194,Cinchophen,OC(=O)c1cc(nc2ccccc12)-c1ccccc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f4381137490>,False
194,195,Cinoxacin,CCN1N=C(C(O)=O)C(=O)c2cc3OCOc3cc12,0,<rdkit.Chem.rdchem.Mol object at 0x7f43811374e0>,False


In [None]:
#  Removal of mixtures, inorganics (and eventually organometallics)
# Functions detect_inorganic,remove_fragments, disconnect_metals, detect_inorganic again

In [90]:
# Setting up the taskNum 
taskNum = 3
# getting the valid entries from the step before
dataset = result2

# Create InchIs for evaluation
dataset['InchI_before'] = dataset['mol'].apply(convert_format.convert_mol_to_inchi)

# Perform remove_fragments on entries
dataset['mol_after'] = dataset['mol'].apply(handle_fragments.remove_fragments)

# Create new InchI from the current state for evaluation of performed changes 
dataset['InchI_after'] = dataset['mol_after'].apply(convert_format.convert_mol_to_inchi)
dataset['noChanges_inchi']= dataset['InchI_before'] == dataset['InchI_after']

# Create Smiles for evaluation
dataset['smiles_before'] = dataset['mol'].apply(convert_format.convert_mol_to_smiles)

# Create new SMILEs from the current state for evaluation of performed changes 
dataset['Smiles 3'] = dataset['mol_after'].apply(convert_format.convert_mol_to_smiles)
dataset['noChanges']= dataset['smiles_before'] == dataset['Smiles 3']


# Filter the failed entries
failed_step_3 = dataset[dataset['noChanges_inchi']== False]
failed_step_3['Failed_at'] = failed_step_3['Failed_at'].apply(failMarker)

result3 = dataset[dataset['noChanges_inchi']== True]
#result3 = result3[['IDs','Names','SMILEs','Failed_at']]
result3.head(50)

#TODO: Subset of Mixtures, deletion from main set

RDKit INFO: [1 MetalDisconnector
RDKit INFO: [11:01:16] Running MetalDisconnector
RDKit INFO: [11:01:16] Initializing MetalDisconnector
RDKit INFO: [11:01:16] Running MetalDisconnector
RDKit INFO: [11:01:16] Initializing MetalDisconnector
RDKit INFO: [11:01:16] Running MetalDisconnector
RDKit INFO: [11:01:16] Initializing MetalDisconnector
RDKit INFO: [11:01:16] Running MetalDisconnector
RDKit INFO: [11:01:16] Initializing MetalDisconnector
RDKit INFO: [11:01:16] Running MetalDisconnector
RDKit INFO: [11:01:16] Removed covalent bond between Zr and O
RDKit INFO: [11:01:16] Removed covalent bond between Zr and O
RDKit INFO: [11:01:16] Removed covalent bond between Zr and O
RDKit INFO: [11:01:16] Removed covalent bond between Zr and O
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  failed_step_3['Failed_at'] = failed_step_3['Failed_at'].apply(failMarker)


Unnamed: 0,IDs,Names,SMILEs,Failed_at,mol,Inorganics,InchI_before,mol_after,InchI_after,noChanges_inchi,smiles_before,Smiles 3,noChanges
1,2,17-Methyltestosterone,CC1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810f6030>,False,InChI=1S/C20H30O2/c1-18-9-6-14(21)12-13(18)4-5...,<rdkit.Chem.rdchem.Mol object at 0x7f438107f350>,InChI=1S/C20H30O2/c1-18-9-6-14(21)12-13(18)4-5...,True,CC12CCC(=O)C=C1CCC1C2CCC2(C)C1CCC2(C)O,CC12CCC(=O)C=C1CCC1C2CCC2(C)C1CCC2(C)O,True
2,3,1-alpha-Hydroxycholecalciferol,CC(C)CCCC(C)C1CCC2C(CCCC12C)=CC=C1CC(O)CC(O)C1=C,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810f6120>,False,InChI=1S/C27H44O2/c1-18(2)8-6-9-19(3)24-13-14-...,<rdkit.Chem.rdchem.Mol object at 0x7f43810b35d0>,InChI=1S/C27H44O2/c1-18(2)8-6-9-19(3)24-13-14-...,True,C=C1C(=CC=C2CCCC3(C)C2CCC3C(C)CCCC(C)C)CC(O)CC1O,C=C1C(=CC=C2CCCC3(C)C2CCC3C(C)CCCC(C)C)CC(O)CC1O,True
3,4,"2,3-Dimercaptosuccinic acid",OC(=O)C(S)C(S)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810def80>,False,"InChI=1S/C4H6O4S2/c5-3(6)1(9)2(10)4(7)8/h1-2,9...",<rdkit.Chem.rdchem.Mol object at 0x7f43810b32b0>,"InChI=1S/C4H6O4S2/c5-3(6)1(9)2(10)4(7)8/h1-2,9...",True,O=C(O)C(S)C(S)C(=O)O,O=C(O)C(S)C(S)C(=O)O,True
4,5,"2,4,6-Trinitrotoluene",Cc1c(cc(cc1N(=O)=O)N(=O)=O)N(=O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810def30>,False,InChI=1S/C7H5N3O6/c1-4-6(9(13)14)2-5(8(11)12)3...,<rdkit.Chem.rdchem.Mol object at 0x7f4381015620>,InChI=1S/C7H5N3O6/c1-4-6(9(13)14)2-5(8(11)12)3...,True,Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1[N+](=O)[O-],Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1[N+](=O)[O-],True
6,7,2'-fluoro-5-methylarabinosyluracil,CC1=CN(C2OC(CO)C(O)C2F)C(=O)NC1=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810dee90>,False,InChI=1S/C10H13FN2O5/c1-4-2-13(10(17)12-8(4)16...,<rdkit.Chem.rdchem.Mol object at 0x7f4381015120>,InChI=1S/C10H13FN2O5/c1-4-2-13(10(17)12-8(4)16...,True,Cc1cn(C2OC(CO)C(O)C2F)c(=O)[nH]c1=O,Cc1cn(C2OC(CO)C(O)C2F)c(=O)[nH]c1=O,True
7,8,2-Methoxyestradiol,COc1cc2C3CCC4(C)C(O)CCC4C3CCc2cc1O,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810dedf0>,False,InChI=1S/C19H26O3/c1-19-8-7-12-13(15(19)5-6-18...,<rdkit.Chem.rdchem.Mol object at 0x7f4381015210>,InChI=1S/C19H26O3/c1-19-8-7-12-13(15(19)5-6-18...,True,COc1cc2c(cc1O)CCC1C2CCC2(C)C(O)CCC12,COc1cc2c(cc1O)CCC1C2CCC2(C)C(O)CCC12,True
8,9,4-aminobenzoic acid,Nc1ccc(cc1)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810dee40>,False,InChI=1S/C7H7NO2/c8-6-3-1-5(2-4-6)7(9)10/h1-4H...,<rdkit.Chem.rdchem.Mol object at 0x7f4381015030>,InChI=1S/C7H7NO2/c8-6-3-1-5(2-4-6)7(9)10/h1-4H...,True,Nc1ccc(C(=O)O)cc1,Nc1ccc(C(=O)O)cc1,True
9,10,4-Hydroxytamoxifen,CCC(c1ccccc1)=C(c1ccc(O)cc1)c1ccc(OCCN(C)C)cc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810deda0>,False,InChI=1S/C26H29NO2/c1-4-25(20-8-6-5-7-9-20)26(...,<rdkit.Chem.rdchem.Mol object at 0x7f4381015080>,InChI=1S/C26H29NO2/c1-4-25(20-8-6-5-7-9-20)26(...,True,CCC(=C(c1ccc(O)cc1)c1ccc(OCCN(C)C)cc1)c1ccccc1,CCC(=C(c1ccc(O)cc1)c1ccc(OCCN(C)C)cc1)c1ccccc1,True
10,11,5 fluorouracil,FC1=CNC(=O)NC1=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810ded50>,False,"InChI=1S/C4H3FN2O2/c5-2-1-6-4(9)7-3(2)8/h1H,(H...",<rdkit.Chem.rdchem.Mol object at 0x7f4381015170>,"InChI=1S/C4H3FN2O2/c5-2-1-6-4(9)7-3(2)8/h1H,(H...",True,O=c1[nH]cc(F)c(=O)[nH]1,O=c1[nH]cc(F)c(=O)[nH]1,True
11,12,5-Azacitidine,NC1=NC(=O)N(C=N1)C1OC(CO)C(O)C1O,0,<rdkit.Chem.rdchem.Mol object at 0x7f43810de710>,False,InChI=1S/C8H12N4O5/c9-7-10-2-12(8(16)11-7)6-5(...,<rdkit.Chem.rdchem.Mol object at 0x7f4381015490>,InChI=1S/C8H12N4O5/c9-7-10-2-12(8(16)11-7)6-5(...,True,Nc1ncn(C2OC(CO)C(O)C2O)c(=O)n1,Nc1ncn(C2OC(CO)C(O)C2O)c(=O)n1,True


In [91]:
# Setting up the taskNum 
taskNum = 4
# getting the valid entries from the step before
dataset = result1[['IDs','Names','SMILEs','Failed_at','mol']]

# Create InchIs for evaluation
dataset['InchI_before'] = dataset['mol'].apply(convert_format.convert_mol_to_inchi)

# Perform disconnect_metals on entries
dataset['mol_after'] = dataset['mol'].apply(disconnect_metals)

# Create new InchI from the current state for evaluation of performed changes 
dataset['InchI_after'] = dataset['mol_after'].apply(convert_format.convert_mol_to_inchi)
dataset['noChanges_inchi']= dataset['InchI_before'] == dataset['InchI_after']

# Create Smiles for evaluation
dataset['smiles_before'] = dataset['mol'].apply(convert_format.convert_mol_to_smiles)

# Create new SMILEs from the current state for evaluation of performed changes 
dataset['Smiles 3'] = dataset['mol_after'].apply(convert_format.convert_mol_to_smiles)
dataset['noChanges']= dataset['smiles_before'] == dataset['Smiles 3']

# Filter the failed entries
failed_step_4 = dataset[dataset['noChanges_inchi']== True]
failed_step_4['Failed_at'] = failed_step_4['Failed_at'].apply(failMarker)
failed_step_4.tail()


RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [11:09:03] Running FragmentRemover
RDKit INFO: [

Unnamed: 0,IDs,Names,SMILEs,Failed_at,mol,InchI_before,mol_after,InchI_after,noChanges_inchi,smiles_before,Smiles 3,noChanges
196,197,Ciprofloxacin,OC(=O)C1=CN(C2CC2)c2cc(N3CCNCC3)c(F)cc2C1=O,4,<rdkit.Chem.rdchem.Mol object at 0x7f4381050e40>,InChI=1S/C17H18FN3O3/c18-13-7-11-14(8-15(13)20...,<rdkit.Chem.rdchem.Mol object at 0x7f4381045080>,InChI=1S/C17H18FN3O3/c18-13-7-11-14(8-15(13)20...,True,O=C(O)c1cn(C2CC2)c2cc(N3CCNCC3)c(F)cc2c1=O,O=C(O)c1cn(C2CC2)c2cc(N3CCNCC3)c(F)cc2c1=O,True
197,198,Cisapride,COC1CN(CCCOc2ccc(F)cc2)CCC1NC(=O)c1cc(Cl)c(N)c...,4,<rdkit.Chem.rdchem.Mol object at 0x7f4381050e90>,InChI=1S/C23H29ClFN3O4/c1-30-21-13-19(26)18(24...,<rdkit.Chem.rdchem.Mol object at 0x7f4381045120>,InChI=1S/C23H29ClFN3O4/c1-30-21-13-19(26)18(24...,True,COc1cc(N)c(Cl)cc1C(=O)NC1CCN(CCCOc2ccc(F)cc2)C...,COc1cc(N)c(Cl)cc1C(=O)NC1CCN(CCCOc2ccc(F)cc2)C...,True
198,199,Citalopram,CN(C)CCCC1(OCc2cc(ccc12)C#N)c1ccc(F)cc1,4,<rdkit.Chem.rdchem.Mol object at 0x7f4381050ee0>,InChI=1S/C20H21FN2O/c1-23(2)11-3-10-20(17-5-7-...,<rdkit.Chem.rdchem.Mol object at 0x7f43810450d0>,InChI=1S/C20H21FN2O/c1-23(2)11-3-10-20(17-5-7-...,True,CN(C)CCCC1(c2ccc(F)cc2)OCc2cc(C#N)ccc21,CN(C)CCCC1(c2ccc(F)cc2)OCc2cc(C#N)ccc21,True
199,200,Citric acid,OC(=O)CC(O)(CC(O)=O)C(O)=O,4,<rdkit.Chem.rdchem.Mol object at 0x7f4381050f30>,"InChI=1S/C6H8O7/c7-3(8)1-6(13,5(11)12)2-4(9)10...",<rdkit.Chem.rdchem.Mol object at 0x7f4381045170>,"InChI=1S/C6H8O7/c7-3(8)1-6(13,5(11)12)2-4(9)10...",True,O=C(O)CC(O)(CC(=O)O)C(=O)O,O=C(O)CC(O)(CC(=O)O)C(=O)O,True
200,201,zirconium,CCO[Zr](OCC)(OCC)OCC,4,<rdkit.Chem.rdchem.Mol object at 0x7f4381050f80>,"InChI=1S/4C2H5O.Zr/c4*1-2-3;/h4*2H2,1H3;/q4*-1;+4",<rdkit.Chem.rdchem.Mol object at 0x7f43810451c0>,"InChI=1S/4C2H5O.Zr/c4*1-2-3;/h4*2H2,1H3;/q4*-1;+4",True,CCO[Zr](OCC)(OCC)OCC,CC[O-].CC[O-].CC[O-].CC[O-].[Zr+4],False


In [None]:
# Pseudocode for filtering metals 
new_data = no_fragement_record
metal_true = []
metal_false = []

for x in new_data:
    convert_smiles_to_mol(x)
    if detect_metals(x)=="False":
        metal_false.append(x)
    elif detect_metals(x)=="True":
        metal_true.append(x)
    else:
        raise Exception("Something is wrong with" x)
return metal_true,metal_false

### Structural Conversion and Cleaning

Some drugs need to be transformed "into their salt form to enhance how the drug disscolves (...) and (to) increase it's effectiveness. (https://www.drugs.com/article/pharmaceutical-salts.html (03/12/21)) Therefore it is common for chemical compound databases to contain records of salts. If possible it is recommended to delete the records containing salts completely, since, similar to in-organic compounds, "most descriptor-generating software (can not process salts)" (Fourches 2010 Chapter 2.2 ).While not beeing desirable, it is still an acceptable procedure to convert compounds into their neutral forms. But cases like this should be tagged, filtered and afterwards manually curated or compared to the actual neutral form of that compound. 
In case that we want to continue working on the converted records, we should perform the following steps:
- check if records contain compounds with presence of metals --> difficult case, filter out (already done this - one step ahead)
- removing the salts from the record
- neutralize the record (normalization or basic standardization)
- neutralize the charges
- to be discussed: the adding/removing of hydrogens, both got pros and cons (pro addingH --> higher prediction performances / con addingH --> may introduce noise --> less reliable models)(removingH might introduce erros in calculating descriptors, due to it might not handle certain cases well)



In [None]:
# Structural coversion
# Cleaning/removal of salts
# Functions remove_salts
# normalize_molecules
# handle_charges
# handle_hydrogens

### Normalization of Specific Chemotypes

More complex than just Normalization.

In [None]:
# Normalization of specific chemotypes
# normalize_molecules

In [None]:
# Treatment of tautomeric forms
# handle_tautomers

### Removal of duplicates

In [None]:
# Analysis/removal of duplicates

In [None]:
# Manual inspection

In [None]:
#Contatination of results for the end 
test = pd.concat([failed_step_1,failed_step_2])
test = test.sort_values(by=['IDs'])
