# Implementation and evaluation of a computational standardization pipeline for chemical compounds
--------------------------------------------------------------

> Based on ["Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research" from 2010 (D. Fourches, ...)"](https://pubmed.ncbi.nlm.nih.gov/20572635/)

By Allen Dumler; reviewed by Jaime Rodríguez-Guerra, PhD.

### Introduction 

This notebook serves to display the functionality of the `opencadd.compounds.standardization` subpackage. 

We are following the recommended standardization steps of ["Trust, But Verify" (Fourches et al., 2010)](https://pubmed.ncbi.nlm.nih.gov/20572635/), and using a modified¹ version of the dataset from the following paper: [Cheminformatics Analysis of Assertions Mined from Literature That Describe Drug-Induced Liver Injury in Different Species](https://pubs.acs.org/doi/10.1021/tx900326k).

¹ We added some entries to trigger curation steps not covered by the original data.

### Overview over the pipeline
------------------------------------------

This pipeline has **five** main steps:
1. Structural Conversion
2. Filtering of Inorganics and Mixtures
3. Structural Cleaning 
4. Normalization of Specific Chemotypes
5. Removal of Duplicates

Each step consists of action performing tasks on the dataset. Actions are:

- filtering
- cleaning
- normalizing

**Filtering** actions will result in a score applied to the entries. The score is the number of the filtering task. You can use it to select subsets of the dataset sorting by the column `filtered_at`.

**Cleaning** actions will result in a modification of the mol-representation of the entry, overwriting with the recent version calculated in the task. You can use it to select subsets of the dataset sorting by the column `cleaned_at`.

**Normalizing** actions also will result in a modification of the mol-representation of the entry.You can use it to select subsets of the dataset sorting by the column `normalized_at`.

At the end of the script, there is the possibility to export subsets of the dataset as a CSV. 

In [1]:
from pathlib import Path

HERE = Path(_dh[-1])
REPO = HERE.parents[1]

print("Tutorial location:", HERE)
print("Repo location:    ", REPO)

Tutorial location: C:\Users\Allen.DESKTOP-O8FR8HB\Documents\DEV\opencadd\docs\tutorials
Repo location:     C:\Users\Allen.DESKTOP-O8FR8HB\Documents\DEV\opencadd


In [2]:
# Import pandas and numpy
import pandas as pd
import numpy as np

# Importing functions from the standardization API
from opencadd.compounds.standardization import (
    convert_format,
    detect_mixtures,
    detect_metals,
    detect_salts,
    detect_inorganics,
    handle_fragments,
    handle_tautomers,
    handle_charges,
    handle_tautomers,
    disconnect_metals,
    remove_salts,
    normalize_molecules,
    validate_molecules,
)

In [3]:
# Utility function to compare SMILES
# JRG: Why is this function needed? You can use
#     `operator.eq` builtin!
def smiles_string_changed(smiles_old, smiles_new):
    """
    Compares SMILES strings. If they are identical, the value returned is False,
    if they differ the value returned is True.

    Parameters
    ----------
    smiles_old: str
        SMILES string
    smiles_new: str
        SMILES string

    Returns
    -------
    bool
        True if changes were detectrd, false otherwise
    """
    return smiles_old != smiles_new

### Initial dataset import and cleaning of empty entries
------------------------------------------------
Before any curation steps are can be applied, we need to import the dataset as a Pandas Dataframe. <br>
At this point you have the possibility to select the columns you need for the curation process. For our example dataset we will use columns <b>IDs</b>, <b>Names</b> and <b>SMILEs</b>.<br>
After that, we search for all entries which have empty strings saved under <b>SMILES</b> and remove them from the dataset.<br>
After the import, we add a <b>Filtered_at</b> column to track which standardization step filtered the entry. 
The initial `task_number` will be 0, which leads to a default <b>Filtered_at</b>-value of 0 for all entries, where null stands for all the entries that passed without any filtering. 

In [4]:
task_number = 0

# Import test-dataset
dataset = pd.read_csv(HERE / "data" / "standardization_test_data.csv", delimiter=";")

# Filter columns
dataset = dataset[["IDs", "Names", "SMILEs"]]
# Rename a column, due to an typo in the original dataset
dataset = dataset.rename(columns={"SMILEs": "SMILES"})

# Delete empty entries from the main set.
# JRG: The usual thing here is to use df.dropna() function, possibly with a subset=XXX option
dataset = dataset[(dataset["SMILES"].notna())]

# Initializing the score to null at the 'Filtered_at'-column
# JRG: You can use a constant here, I believe: dataset[X] = task_number
dataset["Filtered_at"] = dataset["SMILES"].apply(
    lambda x, task_number=task_number: task_number
)

# Initializing the score to null at the 'Cleaned_at'-column
# JRG: Same as above
dataset["Cleaned_at"] = dataset["SMILES"].apply(
    lambda x, task_number=task_number: task_number
)

# Initializing the score to null at the 'Normalized_at'-column
# JRG: Same as above
dataset["Normalized_at"] = dataset["SMILES"].apply(
    lambda x, task_number=task_number: task_number
)


# Reset the index to correct the deletion of the empty entries
dataset = dataset.reset_index(drop=True)

# [Optional] Display empty entries for manual inspection.
# dataset[(dataset["SMILES"].isnull())]
dataset.head()

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,0,0,0
1,2,17-Methyltestosterone,CC1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C,0,0,0
2,3,1-alpha-Hydroxycholecalciferol,CC(C)CCCC(C)C1CCC2C(CCCC12C)=CC=C1CC(O)CC(O)C1=C,0,0,0
3,4,"2,3-Dimercaptosuccinic acid",OC(=O)C(S)C(S)C(O)=O,0,0,0
4,5,"2,4,6-Trinitrotoluene",Cc1c(cc(cc1N(=O)=O)N(=O)=O)N(=O)=O,0,0,0


### Step 1: Encoding Converison
------------------------------------------

__Convert the SMILES representation format of the compounds into Mol-files__

RDKit performs a sanitization of molecules converted to mol by default. <br>
In addition to some Nitro and Perchlorate transformations the following steps are taken²:


- Calculate explicit and implicit valence of all atoms. Fails when atoms have illegal valence.
- Calculate symmetrized SSSR. The slowest step fails in rare cases.
- Kekulize. Fails if a Kekule form cannot be found or non-ring bonds are marked as aromatic.
- Assign radicals if hydrogens set and bonds+hydrogens+charge < valence.
- Set aromaticity, if none set in input. Go round rings, Huckel rule to set atoms+bonds as aromatic.
- Set a conjugated property on bonds where applicable.
- Set hybridization property on atoms.
- Remove chirality markers from sp and sp2 hybridized centers.

If the conversion from SMILES to mol fails, then those SMILES will get a **Filtered_at** marker added. 

> JRG: SMILES contains the end S already. It is not a plural form!

To avoid molecule sanitization `convert_smiles_to_mol` can be called with the argument `sanitize=False`. Keep in mind that the generation of different Lewis structures serves to find alternative representation formats of the same molecule. 

__Overwrite the SMILES representation with ones compiled from our generated Mol-files__ <br>
In order to register the changes we make to the entries, we have to calculate canonical SMILES with our function `convert_format`, which by default returns a canonical representation. The conversion back to SMILES has to happen since SMILES encodings vary depending on the algorithm used to calculate them. The newly calculated SMILES will be used as a validation parameter to determine any changes made to our entries further down the curation pipeline. 

References:

² https://molvs.readthedocs.io/en/latest/guide/standardize.html?highlight=sanitize#rdkit-sanitize
* https://chemistry.stackexchange.com/questions/116498/what-is-kekulization-in-rdkit
* https://rdkit-discuss.narkive.com/QwnqcKcM/another-can-t-kekulize-mol-observation
* https://www.rdkit.org/docs/Cookbook.html
* https://www.rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html

#### Task 1: Convert to RDKit Molecule Objects

In [5]:
# Setting up the task_number
task_number = 1

# A column called mol is being added to the dataframe to store the mol-files
# JRG: I think you can use dataset["SMILES"].apply here directly, without loc (huge overhead)
dataset["mol"] = dataset.loc[:, ("SMILES")].apply(convert_format.convert_smiles_to_mol)

# Add task_number to failed entries
dataset.loc[dataset["mol"].isnull(), ["Filtered_at"]] = task_number

# Overwrite SMILES with canonical one which are created by default by the conversion algorithm used in convert_format

dataset["SMILES"] = dataset.apply(
    lambda row: convert_format.convert_mol_to_smiles(row.mol)
    if row.Filtered_at == 0
    else row.SMILES,
    axis=1,
)

dataset.tail(16)

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol
943,944,Xipamide,Cc1cccc(C)c1NC(=O)c1cc(S(N)(=O)=O)c(Cl)cc1O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
944,945,Yohimbine,COC(=O)C1C(O)CCC2CN3CCc4c([nH]c5ccccc45)C3CC21,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
945,946,Zafirlukast,COc1cc(C(=O)NS(=O)(=O)c2ccccc2C)ccc1Cc1cn(C)c2...,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
946,947,Zalcitabine,Nc1ccn(C2CCC(CO)O2)c(=O)n1,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
947,948,Zidovudine,Cc1cn(C2CC(NN=N)C(CO)O2)c(=O)[nH]c1=O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
948,949,Zileuton,CC(c1cc2ccccc2s1)N(O)C(N)=O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
949,950,Zinc acetate,CC(=O)O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
950,951,Zolpidem,Cc1ccc(-c2nc3ccc(C)cn3c2CC(=O)N(C)C)cc1,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
951,952,zirconium,CCO[Zr](OCC)(OCC)OCC,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...
952,953,hemoglobin,C=CC1=C(C)c2cc3[n-]c(cc4nc(cc5[nH]c(cc1n2)c(C)...,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...


### Step 2: Filtering of Inorganics and Mixtures
--------------------------------------------------

Since most cheminformatics applications are not capable of processing inorganic structures,  there is a need to remove those entries before any processing.<br>
Detecting inorganic structures is divided into two steps:<br>
First removing all entries not containing any Carbon at all, which are therefore not organic.<br>
Secondly, filtering out all compounds with inorganic substructures. <br>

Similar problems occur for mixtures. Since most applications can not calculate descriptors for mixtures, filtering has to happen before processing. <br>
Additionally, since "*inorganic compounds are known to have biological effects, like toxic effects*" (Fourches 2010), we can often not distinguish if its organic or inorganic part causes the recorded activity of a mixed compound. Therefore the entry is useless and can be discarded. 


Since the treatment is not as simple as it appears, the paper recommends deleting records containing mixtures.
Common and widely used practice is to retain molecules with the highest molecular weight or the largest number of atoms. Still, the paper states this might not be the best solution, and investigation in mixtures should only happen if there is a reason to believe the largest molecule and not the mixture itself is causing the biological activity.

#### Task 2: Filter entries without Carbon

The first task to determine if an entry is an organic molecule is to check for the presence of Carbon. `detect_carbon` is a function able to do this. It searches for the existence of carbon atoms. If the function finds at least one Carbon atom, it returns a boolean value of **TRUE**, if not **FALSE**. All entries that return **FALSE** will get the current task number (2) assigned into the *Filtered_at* column.


In [6]:
# Setting up the task_number
task_number = 2

# Check for Carbon
dataset["Carbon_present"] = dataset.apply(
    lambda row: detect_inorganics.detect_carbon(row.mol)
    if row.Filtered_at == 0
    else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["Carbon_present"] == False, ["Filtered_at"]] = task_number

Below you can see all entries that do not contain any Carbon and thereby are inorganic molecules.

In [7]:
dataset[dataset["Filtered_at"] == 2]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present
953,954,test_salt,Br.Cl.F.I.N.O.S.[Ag].[Al].[Ba].[Bi].[Ca].[K].[...,2,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,False
957,959,Water,O,2,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,False


#### Task 3: Filter entries with inorganic components

While we filtered out all molecules not containing any Carbon, now we further inspect the entries for elements that do not occur in organic molecules. The elements that might fall into here might vary slightly depending on the definition and scope. `detect_inorganic` is a suitable function for this task.
A recommendation is to check which elements can be managed by the software later used. Customization of the allowed elements in `detect_inorganic` can easily be provided by a set of SMARTS, as described shortly below. 
The default set of accepted elements in an organic molecule are Hydrogen, Carbon, Nitrogen, Oxygen, Fluorine, Phosphorus, Sulfur, Chlorine, Selenium, Bromine, Iodine (nonmetals and halogens). <br>
*While Astatine and Tennessine are also considered halogens, they are not included due to their radioactivity and rarity.*
<br>


###### An example of how to set up a custom set of elements and implement them in `detect_inorganic` 
-----------------------------------------------------------------------------------------------------------
Defining a set:<br>
`elements = Chem.MolFromSmarts("[!#1&!#6&!#7&!#8&!#9&!#15&!#16&!#17&!#35&!#53]")` <br>
(If you want to run this, import the following before: from rdkit import Chem)

Pass the set as a parameter, where the `detect_inorganic` function is getting called:<br>
`lambda row: detect_inorganics.detect_inorganic(row.mol, elements)`

In [8]:
# Setting up the task_number
task_number = 3


# Check for inorganic structures
dataset["Inorganics"] = dataset.apply(
    lambda row: detect_inorganics.detect_inorganic(row.mol)
    if row.Filtered_at == 0
    else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["Inorganics"] == True, ["Filtered_at"]] = task_number

Below you can see all entries that contain other than our allowed elements. (Hydrogen, Carbon, Nitrogen, Oxygen, Fluorine, Phosphorus, Sulfur, Chlorine, Selenium, Bromine, Iodine)

In [9]:
dataset[dataset["Filtered_at"] == 3]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics
114,115,Bortezomib,CC(C)CC(NC(=O)C(Cc1ccccc1)NC(=O)c1cnccn1)B(O)O,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True
407,408,Gold Sodium Thiomalate,O=C(O)CC(S[Au])C(=O)O,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True
533,534,Mersalyl,COC(CNC(=O)c1ccccc1OCC(=O)O)C[Hg]O,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True
951,952,zirconium,CCO[Zr](OCC)(OCC)OCC,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True
952,953,hemoglobin,C=CC1=C(C)c2cc3[n-]c(cc4nc(cc5[nH]c(cc1n2)c(C)...,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True
954,956,covalent_metal,CCC(=O)O[Na],3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True


#### Task 4: Filter entries containing mixtures

We will use the function `detect_mixtures` as shown below for the filtering of mixtures.

In [10]:
# Setting up the task_number
task_number = 4

# Check for inorganic structures
dataset["mixture"] = dataset.apply(
    lambda row: detect_mixtures(row.mol) if row.Filtered_at == 0 else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["mixture"] == True, ["Filtered_at"]] = task_number

Below you can see all entries that are mixtures.

In [11]:
dataset[dataset["Filtered_at"] == 4]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture
958,960,"1,4-Dioxane",C1COCCO1.Oc1ccccc1,4,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,True


### Step 3: Structural Cleaning 
--------------------------------------------------

"Some drugs need to be transformed "into their salt form to enhance how the drug dissolves (...) and (to) increase its effectiveness."³ Therefore it is common for chemical compound databases to contain records of salts. If possible, a suggestion is to delete the records containing salts completely, since, similar to in-organic compounds, "most descriptor-generating software (can not process salts)" (Fourches 2010). While not desirable, it is still an acceptable procedure to convert compounds into their neutral forms. But cases like this should be tagged, filtered, and afterward manually curated or compared to the concrete neutral form of that compound. 
In case that we want to continue working on the converted records, we should perform the following steps:
- check if records contain compounds with the presence of metals
- removing the salts from the record
- neutralize the record (normalization or essential standardization)
- neutralize the charges

<span style="color:red">__Note:__</span>
While it is possible to clean and reuse entries containing metals or salts, those entries will not be curated here since it does not fit the scope of this tutorial. Nevertheless, we hint at the steps to do and which functions of the standardization API to use. 

³ (https://www.drugs.com/article/pharmaceutical-salts.html (03/12/21))

#### Task 5: Filter entries containing metals

Entries can contain metals in different forms. Either as a regular compound in a mixture or as a counterion.<br> In the following steps, we search for those metals. When they are a counterion, we disconnect them from the non-metals they are bonding. 
We might not find any metals due to previous filtering steps detecting mixtures and inorganics. Therefore we could search in the flagged entries and clean those entries later on. 

In [12]:
# Setting up the task_number
task_number = 5

# Check for metals
dataset["metals"] = dataset.apply(
    lambda row: detect_metals(row.mol) if row.Filtered_at == 0 else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["metals"] == True, ["Filtered_at"]] = task_number

Below you can see all entries containing metals. We didn't find any entries in our filtered set, as already assumed.

In [13]:
dataset[dataset["Filtered_at"] == 5]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals


So the next step would be to examine our filtered entries.<br>
For that, we make a copy of our current status of the dataset.

In [14]:
score = [3, 4]
failed_entries_copy = dataset[dataset["Filtered_at"].isin(score)].copy()

And we check for the presence of metals here

In [15]:
# Check for metals
failed_entries_copy["metals"] = failed_entries_copy.apply(
    lambda row: detect_metals(row.mol) if row.Filtered_at != 0 else None,
    axis=1,
)

In [16]:
failed_entries_copy[failed_entries_copy["metals"] == True]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals
407,408,Gold Sodium Thiomalate,O=C(O)CC(S[Au])C(=O)O,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,True
533,534,Mersalyl,COC(CNC(=O)c1ccccc1OCC(=O)O)C[Hg]O,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,True
951,952,zirconium,CCO[Zr](OCC)(OCC)OCC,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,True
954,956,covalent_metal,CCC(=O)O[Na],3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,True


One entry has a counterion that can be disconnected. We might consider removing the metals in those mixtures and re-run this standardization script with the cleaned entry. <br> But for this case, this does not make much sense since the resulting molecules after removing Zirconium would be not functional, and the covalent metal would be deleted entirely since all of its substructures are salts.

However, what we could have done if it made sense:
1. disconnect_metals
2. handle_charges.uncharge
3. normalize_molecule.normalize
4. remove_salts
5. handle_charges.uncharge
6. normalize_molecule.normalize
7. handle_fragments.choose_largest_fragment
8. Apply a `cleaned_at` marker

#### Task 6: Removing salts 

This curation step can be applied to different subsets of the dataset.<br>
First, we will apply this to our entries that passed all steps before. 
Since we filtered all mixtures out in previous steps, all salts found in this step are the only compound in the entry. Therefore they need to be deleted (filtered).

More interesting might be the inspection of the *inorganics* **(Task 3)** or *mixtures* **(Task 4)**. We could check if any of those mixtures contain salts known in our dictionary. If so, we can delete those salts and reuse the entries if they are free of mixtures.

First, we will search for salts in our dataset. 

In [17]:
# Setting up the task_number
task_number = 6

# Check for salts
dataset["salts"] = dataset.apply(
    lambda row: detect_salts(row.mol) if row.Filtered_at == 0 else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["salts"] == True, ["Filtered_at"]] = task_number

Below you can see all entries containing salts.

In [18]:
dataset[dataset["Filtered_at"] == 6]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals,salts
22,23,Acetic acid,CC(=O)O,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA53...,True,False,False,False,True
199,200,Citric acid,O=C(O)CC(O)(CC(=O)O)C(=O)O,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True
275,276,Dimethyl sulfoxide,CS(C)=O,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True
335,336,Ethanol,CCO,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True
356,357,Ferrous citrate,O=C(O)CC(O)(CC(=O)O)C(=O)O,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True
400,401,Glutamic acid,NC(CCC(=O)O)C(=O)O,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True
405,406,Glycerol,OCC(O)CO,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True
481,482,Lactic acid,CC(O)C(=O)O,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True
605,606,Niacin,O=C(O)c1cccnc1,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True
613,614,Nicotinic acid,O=C(O)c1cccnc1,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,False,False,True


To demonstrate the removal of salts, we can generate SMILES out of the mol after the deletion of the salts.
We will observe the generation of an empty SMILES string.

In [19]:
# First we make a deep copy of the original dataframe
demo_df = dataset.copy()
demo_df.head()

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals,salts
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA53...,True,False,False,False,False
1,2,17-Methyltestosterone,CC12CCC(=O)C=C1CCC1C2CCC2(C)C1CCC2(C)O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA53...,True,False,False,False,False
2,3,1-alpha-Hydroxycholecalciferol,C=C1C(=CC=C2CCCC3(C)C2CCC3C(C)CCCC(C)C)CC(O)CC1O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA53...,True,False,False,False,False
3,4,"2,3-Dimercaptosuccinic acid",O=C(O)C(S)C(S)C(=O)O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA53...,True,False,False,False,False
4,5,"2,4,6-Trinitrotoluene",Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1[N+](=O)[O-],0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA53...,True,False,False,False,False


We then apply our changes to the copy to look at the SMILES generated after removal.

In [20]:
# Applying the remove_salts function on the molecules detected as salts in Tasks 6.
demo_df["mol"] = demo_df.apply(
    lambda row: remove_salts(row.mol) if row.Filtered_at == 6 else row.mol,
    axis=1,
)

# Generate a SMILES of the entries. (Only for demonstration purposes)
demo_df["SMILES"] = demo_df.apply(
    lambda row: convert_format.convert_mol_to_smiles(row.mol)
    if row.Filtered_at == 6
    else None,
    axis=1,
)

Then we can look at our entries containing salts

In [21]:
demo_df[demo_df["Filtered_at"] == 6]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals,salts
22,23,Acetic acid,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
199,200,Citric acid,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
275,276,Dimethyl sulfoxide,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
335,336,Ethanol,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
356,357,Ferrous citrate,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
400,401,Glutamic acid,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
405,406,Glycerol,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
481,482,Lactic acid,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
605,606,Niacin,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True
613,614,Nicotinic acid,,6,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,True


Next, we can filter for salts in the entries screened for inorganics and mixtures (as we did for the metals) to see if any inorganics or mixtures might have been salts.

In [22]:
score = [3, 4]
failed_entries_copy = dataset[dataset["Filtered_at"].isin(score)].copy()

In [23]:
# Check for salts
failed_entries_copy["salts"] = failed_entries_copy.apply(
    lambda row: detect_salts(row.mol) if row.Filtered_at != 0 else None,
    axis=1,
)

In [24]:
failed_entries_copy

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals,salts
114,115,Bortezomib,CC(C)CC(NC(=O)C(Cc1ccccc1)NC(=O)c1cnccn1)B(O)O,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,,False
407,408,Gold Sodium Thiomalate,O=C(O)CC(S[Au])C(=O)O,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,,False
533,534,Mersalyl,COC(CNC(=O)c1ccccc1OCC(=O)O)C[Hg]O,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,,False
951,952,zirconium,CCO[Zr](OCC)(OCC)OCC,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,,False
952,953,hemoglobin,C=CC1=C(C)c2cc3[n-]c(cc4nc(cc5[nH]c(cc1n2)c(C)...,3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,,False
954,956,covalent_metal,CCC(=O)O[Na],3,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,True,,,False
958,960,"1,4-Dioxane",C1COCCO1.Oc1ccccc1,4,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA54...,True,False,True,,False


One of the filtered entries in this example contained a salt. We might consider removing the salt in this mixture and do some normalization actions on it, similar to those in the previous step covering the disconnection of metals. <br>

The steps would be:

1. remove_salts
2. handle_charges.uncharge
3. normalize_molecule.normalize
4. handle_fragments.choose_largest_fragment

### Step 4: Normalization of Specific Chemotypes

After we filtered all problematic entries in the previous steps and created subsets to curate entries containing metals and salts, the next task is to apply normalization transformations to the remaining entries to correct functional groups and recombine charges. <br>
The standardization API utilizes the Normalization transformations embedded in the rdMolStandardize-Package, which derives the rules described in the InChI technical manual. <br>

*If available, custom conversions rules can be used and implemented but require modifying the `normalize_molecules.normalize` function to use them. (This might be covered in further development of this API.* 

#### Task 7: Normalization

In [25]:
# Setting up the task_number
task_number = 7

# Normalize the entries, overwrite the previous mol
dataset["mol"] = dataset.apply(
    lambda row: normalize_molecules.normalize(row.mol)
    if row.Filtered_at == 0
    else row.mol,
    axis=1,
)

# Calculate new SMILES for the entries to determine which entries needed to be normalized
dataset["SMILES_after_normalization"] = dataset.apply(
    lambda row: convert_format.convert_mol_to_smiles(row.mol)
    if row.Filtered_at == 0
    else row.SMILES,
    axis=1,
)


# Compare the SMILES for changes after the normalization --> save as Boolean Value
dataset["normalized"] = dataset.apply(
    lambda row: smiles_string_changed(row.SMILES, row.SMILES_after_normalization)
    if row.Filtered_at == 0
    else None,
    axis=1,
)


# Add task_number to normalized entries
dataset.loc[dataset["normalized"] == True, ["Normalized_at"]] = task_number

Below you can see all entries where normalization steps took place.

In [26]:
dataset[dataset["Normalized_at"] == 7]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals,salts,SMILES_after_normalization,normalized
574,575,Modafinil,NC(=O)CS(=O)C(c1ccccc1)c1ccccc1,0,0,7,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,NC(=O)C[S+]([O-])C(c1ccccc1)c1ccccc1,True
830,831,Sulfinpyrazone,O=C1C(CCS(=O)c2ccccc2)C(=O)N(c2ccccc2)N1c1ccccc1,0,0,7,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,O=C1C(CC[S+]([O-])c2ccccc2)C(=O)N(c2ccccc2)N1c...,True
831,832,Sulindac,CC1=C(CC(=O)O)c2cc(F)ccc2C1=Cc1ccc(S(C)=O)cc1,0,0,7,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,CC1=C(CC(=O)O)c2cc(F)ccc2C1=Cc1ccc([S+](C)[O-]...,True
955,957,test_charge_recombination,CC([O-])=[N+](C)C,0,0,7,<rdkit.Chem.rdchem.Mol object at 0x0000016DA65...,True,False,False,False,False,CC(=O)N(C)C,True


### FInal Conversion back to SMILES

Since we have finished all our actions taken on the mol-files, we can render new SMILES strings generated by our final mol-files

In [27]:
dataset["SMILES"] = dataset.apply(
    lambda row: convert_format.convert_mol_to_smiles(row.mol)
    if row.Filtered_at == 0
    else row.SMILES,
    axis=1,
)

#### (Task 8): Tautomers

Considering compounds can exist in various tautomeric forms, it can be advantageous to calculate those forms. Due to tautomerism being a broad and specific field, this notebook will not allow a deeper focus on interpreting the problems related to tautomers. We will only provide the tools to generate a canonicalized tautomer and enumerate all possible tautomers of the final SMILES. 

In [28]:
# Generate a canonicalized tautomer
dataset["canonicalized_tautomer_smiles"] = dataset.apply(
    lambda row: handle_tautomers.canonicalize_tautomer(row.SMILES)
    if row.Filtered_at != 1
    else None,
    axis=1,
)

# Compare the SMILES for changes after the generation of a canonicalized SMILES --> save as Boolean Value
dataset["new_canonical_tautomer"] = dataset.apply(
    lambda row: smiles_string_changed(row.SMILES, row.canonicalized_tautomer_smiles)
    if row.Filtered_at == 0
    else None,
    axis=1,
)



Below you can see all entries where the canonicalized tautomer differs to the SMILES, that resulted from the curation process.

In [29]:
dataset[dataset["new_canonical_tautomer"] == True]

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals,salts,SMILES_after_normalization,normalized,canonicalized_tautomer_smiles,new_canonical_tautomer
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,False,CCC(CO)N=c1[nH]c(=NCc2ccccc2)c2ncn(C(C)C)c2[nH]1,True
11,12,5-Azacitidine,Nc1ncn(C2OC(CO)C(O)C2O)c(=O)n1,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,Nc1ncn(C2OC(CO)C(O)C2O)c(=O)n1,False,N=c1ncn(C2OC(CO)C(O)C2O)c(=O)[nH]1,True
18,19,Acenocoumarol,CC(=O)CC(c1ccc([N+](=O)[O-])cc1)c1c(O)oc2ccccc...,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,CC(=O)CC(c1ccc([N+](=O)[O-])cc1)c1c(O)oc2ccccc...,False,CC(=O)CC(c1ccc([N+](=O)[O-])cc1)c1c(O)c2ccccc2...,True
21,22,Acetazolamide,CC(=O)Nc1nnc(S(N)(=O)=O)s1,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,CC(=O)Nc1nnc(S(N)(=O)=O)s1,False,CC(=O)N=c1[nH]nc(S(N)(=O)=O)s1,True
24,25,Acetohydroxamic acid,CC(=O)NO,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,CC(=O)NO,False,CC(O)=NO,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
937,938,Vitamin K,CC(=CCC1=C(C)C(=O)c2ccccc2C1=O)CCCC(C)CCCC(C)C...,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA65...,True,False,False,False,False,CC(=CCC1=C(C)C(=O)c2ccccc2C1=O)CCCC(C)CCCC(C)C...,False,CC(C=Cc1c(C)c(O)c2ccccc2c1O)=CCCC(C)CCCC(C)CCC...,True
941,942,Warfarin,CC(=O)CC(c1ccccc1)c1c(O)oc2ccccc2c1=O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA65...,True,False,False,False,False,CC(=O)CC(c1ccccc1)c1c(O)oc2ccccc2c1=O,False,CC(=O)CC(c1ccccc1)c1c(O)c2ccccc2oc1=O,True
946,947,Zalcitabine,Nc1ccn(C2CCC(CO)O2)c(=O)n1,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA65...,True,False,False,False,False,Nc1ccn(C2CCC(CO)O2)c(=O)n1,False,N=c1ccn(C2CCC(CO)O2)c(=O)[nH]1,True
947,948,Zidovudine,Cc1cn(C2CC(NN=N)C(CO)O2)c(=O)[nH]c1=O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA65...,True,False,False,False,False,Cc1cn(C2CC(NN=N)C(CO)O2)c(=O)[nH]c1=O,False,Cc1cn(C2CC(N=NN)C(CO)O2)c(=O)[nH]c1=O,True


Additionally to the calculation of a canonicalized tautomer, we can also use the function `enumerate_tautomer`, which returns a list of all possible tautomers. While this might be practical for a detailed check of the tautomers of an entry, storing the results in a data frame like here would be impractical. Therefore it is recommended to generate those lists only for specifically selected entries.<br>
An Example of how this can be done for the entry with the ID 12 is shown below:

In [30]:
# Extract the SMILES for the entry with the ID you want to enumerate the tautomers for
smiles_to_enumerate_tautomer = "".join(
    dataset.loc[dataset["IDs"] == 12, ["SMILES"]].values[0]
)

# Apply the enumerate_tautomer function on that SMILES
handle_tautomers.enumerate_tautomer(smiles_to_enumerate_tautomer)

{'N=c1ncn(C2OC(CO)C(O)C2O)c(=O)[nH]1',
 'N=c1ncn(C2OC(CO)C(O)C2O)c(O)n1',
 'Nc1ncn(C2OC(CO)C(O)C2O)c(=O)n1'}

### Step 5: Removal of duplicates

Since RDKit can calculate the canonical version of SMILES, we can try to find all duplicate entries in our data frame through a SMILES string comparison.

> JRG: Are you actually comparing canonical smiles here? I think they are just the raw values present in the dataset, aren't they? You might need to do a round trip MolFromSmiles->MolToSmiles to get the canonical version. Molecule comparison is tricky! Best way is to resort to graph homology, but we are not doing that now. Just ensure you are indeed using canonical smiles.


In [31]:
i = 0

all_SMILES = dataset.loc[:, "SMILES"]
IDs = dataset.loc[:, "IDs"]
list_len = len(all_SMILES)
list_len

while i < list_len:
    SMILES = all_SMILES[i]
    ID = IDs[i]
    j = i + 1
    while j < list_len:
        compared_SMILES = all_SMILES[j]
        compared_ID = IDs[j]
        j += 1
        if SMILES == compared_SMILES and ID != compared_ID:
            print("Following IDs have identical SMILES", ID, "and", compared_ID)
            continue  # JRG: What's this continue for? It's the last statement... do you mean `break`?
    i += 1

Following IDs have identical SMILES 2 and 552
Following IDs have identical SMILES 23 and 799
Following IDs have identical SMILES 23 and 950
Following IDs have identical SMILES 30 and 507
Following IDs have identical SMILES 43 and 764
Following IDs have identical SMILES 47 and 883
Following IDs have identical SMILES 55 and 407
Following IDs have identical SMILES 58 and 864
Following IDs have identical SMILES 65 and 218
Following IDs have identical SMILES 168 and 169
Following IDs have identical SMILES 200 and 357
Following IDs have identical SMILES 211 and 212
Following IDs have identical SMILES 253 and 513
Following IDs have identical SMILES 322 and 326
Following IDs have identical SMILES 327 and 646
Following IDs have identical SMILES 606 and 614
Following IDs have identical SMILES 736 and 737
Following IDs have identical SMILES 751 and 935
Following IDs have identical SMILES 795 and 796
Following IDs have identical SMILES 799 and 950


# Export of the Dataset

Various exporting possibilities are now open.<br>
You can: <br>
- export the whole dataset, with its scores in an dedicated row
- filter for only the entries that passed all filtering step (ergo have a Score of null in the *Filtered_at*-column)
- filter for the entries that where filtered at a step to perform some transformations on them (e.g. remove salts)
- export all the mol files you need into a SDF with the function `convert_mol_to_sdf` by firstly generating an array of all mol-entries (`mol_array`) and pass them to the function as a parameter, togheter with a filename `fn` (e.g. `convert_mol_to_sdf(mol_array, fn="my_sdf_export")`)

In [32]:
dataset.head()

Unnamed: 0,IDs,Names,SMILES,Filtered_at,Cleaned_at,Normalized_at,mol,Carbon_present,Inorganics,mixture,metals,salts,SMILES_after_normalization,normalized,canonicalized_tautomer_smiles,new_canonical_tautomer
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,False,CCC(CO)N=c1[nH]c(=NCc2ccccc2)c2ncn(C(C)C)c2[nH]1,True
1,2,17-Methyltestosterone,CC12CCC(=O)C=C1CCC1C2CCC2(C)C1CCC2(C)O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,CC12CCC(=O)C=C1CCC1C2CCC2(C)C1CCC2(C)O,False,CC12CCC(=O)C=C1CCC1C2CCC2(C)C1CCC2(C)O,False
2,3,1-alpha-Hydroxycholecalciferol,C=C1C(=CC=C2CCCC3(C)C2CCC3C(C)CCCC(C)C)CC(O)CC1O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,C=C1C(=CC=C2CCCC3(C)C2CCC3C(C)CCCC(C)C)CC(O)CC1O,False,C=C1C(=CC=C2CCCC3(C)C2CCC3C(C)CCCC(C)C)CC(O)CC1O,False
3,4,"2,3-Dimercaptosuccinic acid",O=C(O)C(S)C(S)C(=O)O,0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,O=C(O)C(S)C(S)C(=O)O,False,O=C(O)C(S)C(S)C(=O)O,False
4,5,"2,4,6-Trinitrotoluene",Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1[N+](=O)[O-],0,0,0,<rdkit.Chem.rdchem.Mol object at 0x0000016DA64...,True,False,False,False,False,Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1[N+](=O)[O-],False,Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1[N+](=O)[O-],False
