# Implementation and evaluation of a computational standardization pipeline for chemical compounds
--------------------------------------------------------------

> Based on ["Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research" from 2010 (D. Fourches, ...)"](https://pubmed.ncbi.nlm.nih.gov/20572635/)

By Allen Dumler; reviewed by Jaime Rodríguez-Guerra, PhD.

### Introduction 

This notebook serves to display the functionality of the `opencadd.compounds.standardization` subpackage. 

We are following the recommended standardization steps of ["Trust, But Verify" (Fourches et al., 2010)](https://pubmed.ncbi.nlm.nih.gov/20572635/), and using a modified¹ version of the dataset from the following paper: [Cheminformatics Analysis of Assertions Mined from Literature That Describe Drug-Induced Liver Injury in Different Species](https://pubs.acs.org/doi/10.1021/tx900326k).

¹ We added some entries to trigger curation steps not covered by the original data.

### Overview over the pipeline
------------------------------------------

This pipeline has **five** main steps:
1. Structural Conversion
2. Filtering of Inorganics and Mixtures
3. Structural Cleaning 
4. Normalization of Specific Chemotypes
5. Removal of Duplicates

Each step consists of action performing tasks on the dataset. <br>
Actions are:
- filtering
- cleaning
- normalizing

**Filtering** actions will result in a score applied to the entries. The score is the number of the filtering task. You can use it to select subsets of the dataset sorting by the column **Filtered_at**.

**Cleaning** actions will result in a modification of the mol-representation of the entry, overwriting with the recent version calculated in the task. You can use it to select subsets of the dataset sorting by the column **Cleaned_at**.

**Normalizing** actions also will result in a modification of the mol-representation of the entry.You can use it to select subsets of the dataset sorting by the column **Normalized_at**.

At the end of the script, there is the possibility to export subsets of the dataset as a CSV. 

In [1]:
from pathlib import Path

HERE = Path(_dh[-1])
REPO = HERE.parents[1]

print("Tutorial location:", HERE)
print("Repo location:    ", REPO)

Tutorial location: /home/allen/dev/opencadd/docs/tutorials
Repo location:     /home/allen/dev/opencadd


In [2]:
# Import pandas and numpy
import pandas as pd
import numpy as np

# Importing functions from the standardization API
from opencadd.compounds.standardization import (
    convert_format,
    handle_fragments,
    disconnect_metals,
    detect_inorganics,
    remove_salts,
    normalize,
    handle_tautomers,
    validate_molecules,
    detect_mixtures,
    detect_metals,
    detect_salts,
    handle_charges,
)

### Initial dataset import and cleaning of empty entries
------------------------------------------------
Before any curation steps are can be applied, we need to import the dataset as a Pandas Dataframe. <br>
At this point you have the possibility to select the columns you need for the curation process. For our example dataset we will use columns <b>IDs</b>, <b>Names</b> and <b>SMILEs</b>.<br>
After that, we search for all entries which have empty strings saved under <b>SMILEs</b> and remove them from the dataset.<br>
After the import, we add a <b>Filtered_at</b> column to track which standardization step filtered the entry. 
The initial `task_number` will be 0, which leads to a default <b>Filtered_at</b>-value of 0 for all entries, where null stands for all the entries that passed without any filtering. 

In [3]:
task_number = 0

# Import test-dataset
dataset = pd.read_csv(HERE / "data" / "standardization_test_data.csv")

# Filter columns
dataset = dataset[["IDs", "Names", "SMILEs"]]

# Delete empty entries from the main set.
dataset = dataset[(dataset["SMILEs"].notna())]

# Initializing the score to null at the 'Filtered_at'-column
dataset["Filtered_at"] = dataset["SMILEs"].apply(
    lambda x, task_number=task_number: task_number
)

# Reset the index to correct the deletion of the empty entries
dataset = dataset.reset_index(drop=True)

# [Optional] Display empty entries for manual inspection.
# dataset[(dataset["SMILEs"].isnull())]

### Step 1: Structural Converison
------------------------------------------

__Convert the SMILES representation format of the compounds into Mol-files__

RDKit performs a sanitization of molecules converted to mol by default. <br>
In addition to some Nitro and Perchlorate transformations the following steps are taken²:


- Calculate explicit and implicit valence of all atoms. Fails when atoms have illegal valence.
- Calculate symmetrized SSSR. The slowest step fails in rare cases.
- Kekulize. Fails if a Kekule form cannot be found or non-ring bonds are marked as aromatic.
- Assign radicals if hydrogens set and bonds+hydrogens+charge < valence.
- Set aromaticity, if none set in input. Go round rings, Huckel rule to set atoms+bonds as aromatic.
- Set a conjugated property on bonds where applicable.
- Set hybridization property on atoms.
- Remove chirality markers from sp and sp2 hybridized centers.

If the conversion from SMILE to mol fails, then those SMILEs will get a **Filtered_at** marker added. 

To avoid molecule sanitization `convert_smiles_to_mol` can be called with the argument `sanitize=False`. Keep in mind that the generation of different Lewis structures serves to find alternative representation formats of the same molecule. 

References:

² https://molvs.readthedocs.io/en/latest/guide/standardize.html?highlight=sanitize#rdkit-sanitize
* https://chemistry.stackexchange.com/questions/116498/what-is-kekulization-in-rdkit
* https://rdkit-discuss.narkive.com/QwnqcKcM/another-can-t-kekulize-mol-observation
* https://www.rdkit.org/docs/Cookbook.html
* https://www.rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html

#### Task 1: Convert to Mol

In [4]:
# Setting up the task_number
task_number = 1

# A column called mol is beeing added to the dataframe to store the mol-files
dataset["mol"] = dataset.loc[:, ("SMILEs")].apply(convert_format.convert_smiles_to_mol)

# Add task_number to failed entries
dataset.loc[dataset["mol"].isnull(), ["Filtered_at"]] = task_number

dataset.head(16)

RDKit ERROR: [01:10:41] Can't kekulize mol.  Unkekulized atoms: 1 2 3 4 5 7 9
RDKit ERROR: 
RDKit ERROR: [01:10:41] Can't kekulize mol.  Unkekulized atoms: 2 3 4 6 7 8 10 11 12
RDKit ERROR: 
RDKit ERROR: [01:10:41] Can't kekulize mol.  Unkekulized atoms: 6 8 10
RDKit ERROR: 
RDKit ERROR: [01:10:41] Can't kekulize mol.  Unkekulized atoms: 7 8 9 10 11 12 13 14 15
RDKit ERROR: 
RDKit ERROR: [01:10:41] Can't kekulize mol.  Unkekulized atoms: 57 58 60
RDKit ERROR: 
RDKit ERROR: [01:10:41] Can't kekulize mol.  Unkekulized atoms: 14 15 16 17 18 19 20 21 23
RDKit ERROR: 
RDKit ERROR: [01:10:41] Can't kekulize mol.  Unkekulized atoms: 11 12 13 15 16 17 19 20 21
RDKit ERROR: 


Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1.[Ca],0,<rdkit.Chem.rdchem.Mol object at 0x7f8df7280490>
1,2,17-Methyltestosterone,CC1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df727bf80>
2,3,1-alpha-Hydroxycholecalciferol,CC(C)CCCC(C)C1CCC2C(CCCC12C)=CC=C1CC(O)CC(O)C1=C,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df727ba80>
3,4,"2,3-Dimercaptosuccinic acid",OC(=O)C(S)C(S)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df727bee0>
4,5,"2,4,6-Trinitrotoluene",Cc1c(cc(cc1N(=O)=O)N(=O)=O)N(=O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df727bb70>
5,6,2-Deoxy-D-glucose,OCC1OC(O)CC(O)C1O.O1CCOCC1,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df722bda0>
6,7,2'-fluoro-5-methylarabinosyluracil,CC1=CN(C2OC(CO)C(O)C2F)C(=O)NC1=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df722bc60>
7,8,2-Methoxyestradiol,COc1cc2C3CCC4(C)C(O)CCC4C3CCc2cc1O,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df722bcb0>
8,9,4-aminobenzoic acid,Nc1ccc(cc1)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df722bd50>
9,10,4-Hydroxytamoxifen,CCC(c1ccccc1)=C(c1ccc(O)cc1)c1ccc(OCCN(C)C)cc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df722bd00>


### Step 2: Filtering of Inorganics and Mixtures
--------------------------------------------------

Since most cheminformatics applications are not capable of processing inorganic structures,  there is a need to remove those entries before any processing.<br>
Detecting inorganic structures is divided into two steps:<br>
First removing all entries not containing any Carbon at all, which are therefore not organic.<br>
Secondly, filtering out all compounds with inorganic substructures. <br>

Similar problems occur for mixtures. Since most applications can not calculate descriptors for mixtures, filtering has to happen before processing. <br>
Additionally, since "*inorganic compounds are known to have biological effects, like toxic effects*" (Fourches 2010), we can often not distinguish if its organic or inorganic part causes the recorded activity of a mixed compound. Therefore the entry is useless and can be discarded. 


Since the treatment is not as simple as it appears, the paper recommends deleting records containing mixtures.
Common and widely used practice is to retain molecules with the highest molecular weight or the largest number of atoms. Still, the paper states this might not be the best solution, and investigation in mixtures should only happen if there is a reason to believe the largest molecule and not the mixture itself is causing the biological activity.

#### Task 2: Filter entries without Carbon

The first task to determine if an entry is an organic molecule is to check for the presence of Carbon. `detect_carbon` is a function able to do this. It searches for the existence of carbon atoms. If the function finds at least one Carbon atom, it returns a boolean value of **True**, if not **False**. All entries that return **False** will get the current task number (2) assigned into the *Filtered_at* column.


In [5]:
# Setting up the task_number
task_number = 2

# Check for Carbon
dataset["Carbon_present"] = dataset.apply(
    lambda row: detect_inorganics.detect_carbon(row.mol)
    if row.Filtered_at == 0
    else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["Carbon_present"] == False, ["Filtered_at"]] = task_number

Below you can see all entries that do not contain any Carbon and thereby are inorganic molecules.

In [6]:
dataset[dataset["Filtered_at"] == 2]

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present
202,203,test_salt,[Al].N.[Ba].[Bi].Br.[Ca].Cl.F.I.[K].[Li].[Mg]....,2,<rdkit.Chem.rdchem.Mol object at 0x7f8df7248850>,False
203,204,test_duplicate,[Al].N.[Ba].[Bi].Br.[Ca].Cl.F.I.[K].[Li].[Mg]....,2,<rdkit.Chem.rdchem.Mol object at 0x7f8df72488a0>,False


#### Task 3: Filter entries with inorganic components

While we filtered out all molecules not containing any Carbon, now we further inspect the entries for elements that do not occur in organic molecules. The elements that might fall into here might vary slightly depending on the definition and scope. `detect_inorganic` is a suitable function for this task.
A recommendation is to check which elements can be managed by the software later used. Customization of the allowed elements in `detect_inorganic` can easily be provided by a set of SMARTS, as described shortly below. 
The default set of accepted elements in an organic molecule are Hydrogen, Carbon, Nitrogen, Oxygen, Fluorine, Phosphorus, Sulfur, Chlorine, Selenium, Bromine, Iodine (nonmetals and halogens). <br>
*While Astatine and Tennessine are also considered halogens, they are not included due to their radioactivity and rarity.*
<br>


###### An example of how to set up a custom set of elements and implement them in `detect_inorganic` 
-----------------------------------------------------------------------------------------------------------
Defining a set:<br>
`elements = Chem.MolFromSmarts("[!#1&!#6&!#7&!#8&!#9&!#15&!#16&!#17&!#35&!#53]")` <br>
(If you want to run this, import the following before: from rdkit import Chem)

Pass the set as a parameter, where the `detect_inorganic` function is getting called:<br>
`lambda row: detect_inorganics.detect_inorganic(row.mol, elements)`

In [7]:
# Setting up the task_number
task_number = 3


# Check for inorganic structures
dataset["Inorganics"] = dataset.apply(
    lambda row: detect_inorganics.detect_inorganic(row.mol)
    if row.Filtered_at == 0
    else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["Inorganics"] == True, ["Filtered_at"]] = task_number

Below you can see all entries that contain other than our allowed elements. (Hydrogen, Carbon, Nitrogen, Oxygen, Fluorine, Phosphorus, Sulfur, Chlorine, Selenium, Bromine, Iodine)

In [8]:
dataset[dataset["Filtered_at"] == 3]

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1.[Ca],3,<rdkit.Chem.rdchem.Mol object at 0x7f8df7280490>,True,True
114,115,Bortezomib,CC(C)CC(NC(=O)C(Cc1ccccc1)NC(=O)c1cnccn1)B(O)O,3,<rdkit.Chem.rdchem.Mol object at 0x7f8df7245d00>,True,True
200,201,zirconium,CCO[Zr](OCC)(OCC)OCC,3,<rdkit.Chem.rdchem.Mol object at 0x7f8df72487b0>,True,True
201,202,hemoglobin,CC1=C(C2=CC3=NC(=CC4=C(C(=C([N-]4)C=C5C(=C(C(=...,3,<rdkit.Chem.rdchem.Mol object at 0x7f8df7248800>,True,True
204,206,covalent_metal,CCC(=O)O[Na],3,<rdkit.Chem.rdchem.Mol object at 0x7f8df72488f0>,True,True


#### Task 4: Filter entries containing mixtures

We will use the function `detect_mixtures` as shown below for the filtering of mixtures.

In [9]:
# Setting up the task_number
task_number = 4

# Check for inorganic structures
dataset["mixture"] = dataset.apply(
    lambda row: detect_mixtures(row.mol) if row.Filtered_at == 0 else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["mixture"] == True, ["Filtered_at"]] = task_number

Below you can see all entries that are mixtures.

In [10]:
dataset[dataset["Filtered_at"] == 4]

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics,mixture
5,6,2-Deoxy-D-glucose,OCC1OC(O)CC(O)C1O.O1CCOCC1,4,<rdkit.Chem.rdchem.Mol object at 0x7f8df722bda0>,True,False,True


### Step 3: Structural Cleaning 
--------------------------------------------------

"Some drugs need to be transformed "into their salt form to enhance how the drug dissolves (...) and (to) increase its effectiveness."³ Therefore it is common for chemical compound databases to contain records of salts. If possible, a suggestion is to delete the records containing salts completely, since, similar to in-organic compounds, "most descriptor-generating software (can not process salts)" (Fourches 2010). While not desirable, it is still an acceptable procedure to convert compounds into their neutral forms. But cases like this should be tagged, filtered, and afterward manually curated or compared to the concrete neutral form of that compound. 
In case that we want to continue working on the converted records, we should perform the following steps:
- check if records contain compounds with the presence of metals
- removing the salts from the record
- neutralize the record (normalization or essential standardization)
- neutralize the charges


³ (https://www.drugs.com/article/pharmaceutical-salts.html (03/12/21))

In [11]:
# Structural coversion
# Cleaning/removal of salts
# Functions remove_salts
# normalize_molecules
# handle_charges
# handle_hydrogens

#### Task 5: Filter entries containing metals

Entries can contain metals in different forms. Either as a regular compound in a mixture or as a counterion.<br> In the following steps, we search for those metals. When they are a counterion, we disconnect them from the non-metals they are bonding. 
We might not find any metals due to previous filtering steps detecting mixtures and inorganics. Therefore we could search in the flagged entries and clean those entries later on. 

In [12]:
# Setting up the task_number
task_number = 5

# Check for metals
dataset["metals"] = dataset.apply(
    lambda row: detect_metals(row.mol) if row.Filtered_at == 0 else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["metals"] == True, ["Filtered_at"]] = task_number

Below you can see all entries containing metals. We didn't find any entries in our filtered set, as already assumed.

In [13]:
dataset[dataset["Filtered_at"] == 5]

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics,mixture,metals


So the next step would be to examine our "failed" entries.<br>
For that, we make a copy of our current status of the dataset.

In [14]:
score = [3, 4]
failed_entries_copy = dataset[dataset["Filtered_at"].isin(score)].copy()

And we check for the presence of metals here

In [15]:
# Check for metals
failed_entries_copy["metals"] = failed_entries_copy.apply(
    lambda row: detect_metals(row.mol) if row.Filtered_at != 0 else None,
    axis=1,
)

In [16]:
failed_entries_copy[failed_entries_copy["metals"] == True]

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics,mixture,metals
200,201,zirconium,CCO[Zr](OCC)(OCC)OCC,3,<rdkit.Chem.rdchem.Mol object at 0x7f8df72487b0>,True,True,,True
204,206,covalent_metal,CCC(=O)O[Na],3,<rdkit.Chem.rdchem.Mol object at 0x7f8df72488f0>,True,True,,True


One entry has a counterion that can be disconnected. We might consider removing the metals in those mixture and re-run this standardization script with the cleaned entry. <br> But for this case, this does not make much sense, since the resulting molecules after removing Zirconium would be not functional and the covalent metal would just 

But what we could have done, if it made sense:
1. disconnect_metals
2. handle_charges.uncharge
3. remove_salts
4. handle_fragments.choose_largest_fragment

#### Task 5.5: Reintegrate entry after metal disconnection

In [17]:
task_number = 5.5

# Disconnect metals
dataset["mol"] = dataset.apply(
    lambda row: disconnect_metals(row.mol) if row.metals == True else row.mol,
    axis=1,
)

# Normalize entry
dataset["mol"] = dataset.apply(
    lambda row: handle_charges.uncharge(row.mol) if row.metals == True else row.mol,
    axis=1,
)


# Add task_number to failed entries
dataset.loc[dataset["metals"] == True, ["Cleaned_at"]] = task_number

In [18]:
dataset[dataset["Cleaned_at"] == 5.5]

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics,mixture,metals,Cleaned_at


#### Task 6: Removing salts 

This curation step can be applied to different subsets of the dataset.<br>
First, we will apply this to our entries that passed all steps before. 
Since we filtered all mixtures out in previous steps, all salts found in this step are the only compound in the entry. Therefore they need to be deleted (filtered).

More interesting might be the inspection of the *inorganics* **(Task 3)** or *mixtures* **(Task 4)**. We could check if any of those mixtures contain salts known in our dictionary. If so, we can delete those salts and reuse the entries if they are free of mixtures.

First, we will search for salts in our dataset. 

In [19]:
# Setting up the task_number
task_number = 6

# Check for salts
dataset["salts"] = dataset.apply(
    lambda row: detect_salts(row.mol) if row.Filtered_at == 0 else None,
    axis=1,
)

# Add task_number to failed entries
dataset.loc[dataset["salts"] == True, ["Filtered_at"]] = task_number

Below you can see all entries containing salts.

In [20]:
dataset[dataset["Filtered_at"] == 6]

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics,mixture,metals,Cleaned_at,salts
22,23,Acetic acid,CC(O)=O,6,<rdkit.Chem.rdchem.Mol object at 0x7f8df722bf80>,True,False,False,False,,True
199,200,Citric acid,OC(=O)CC(O)(CC(O)=O)C(O)=O,6,<rdkit.Chem.rdchem.Mol object at 0x7f8df7248760>,True,False,False,False,,True


To demonstrate the removal of salts, we can generate SMILES out of the mol after the deletion of the salts.
We will observe the generation of an empty SMILES string.

In [21]:
# First we make a deep copy of the original dataframe
demo_df = dataset.copy()
demo_df.head()

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics,mixture,metals,Cleaned_at,salts
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1.[Ca],3,<rdkit.Chem.rdchem.Mol object at 0x7f8df7280490>,True,True,,,,
1,2,17-Methyltestosterone,CC1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df727bf80>,True,False,False,False,,False
2,3,1-alpha-Hydroxycholecalciferol,CC(C)CCCC(C)C1CCC2C(CCCC12C)=CC=C1CC(O)CC(O)C1=C,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df727ba80>,True,False,False,False,,False
3,4,"2,3-Dimercaptosuccinic acid",OC(=O)C(S)C(S)C(O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df727bee0>,True,False,False,False,,False
4,5,"2,4,6-Trinitrotoluene",Cc1c(cc(cc1N(=O)=O)N(=O)=O)N(=O)=O,0,<rdkit.Chem.rdchem.Mol object at 0x7f8df727bb70>,True,False,False,False,,False


We then apply our changes to the copy to look at the SMILES generated after removal.

In [22]:
# Applying the remove_salts function on the molecules detected as salts in Tasks 6.
demo_df["mol"] = demo_df.apply(
    lambda row: remove_salts(row.mol) if row.Filtered_at == 6 else row.mol,
    axis=1,
)

# Generate a SMILES of the entires. (Only for demonstration purposes)
demo_df["SMILEs"] = demo_df.apply(
    lambda row: convert_format.convert_mol_to_smiles(row.mol)
    if row.Filtered_at == 6
    else None,
    axis=1,
)

Then we can look at our entries containing salts

In [23]:
demo_df[demo_df["Filtered_at"] == 6]

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics,mixture,metals,Cleaned_at,salts
22,23,Acetic acid,,6,<rdkit.Chem.rdchem.Mol object at 0x7f8df71d4030>,True,False,False,False,,True
199,200,Citric acid,,6,<rdkit.Chem.rdchem.Mol object at 0x7f8df718b8f0>,True,False,False,False,,True


Next, we can filter for salts in the entries screened for inorganics and mixtures (as we did for the metals) to see if any inorganics or mixtures might have been salts.

In [24]:
score = [3, 4]
failed_entries_copy = dataset[dataset["Filtered_at"].isin(score)].copy()

In [25]:
# Check for salts
failed_entries_copy["salts"] = failed_entries_copy.apply(
    lambda row: detect_salts(row.mol) if row.Filtered_at != 0 else None,
    axis=1,
)

In [26]:
failed_entries_copy

Unnamed: 0,IDs,Names,SMILEs,Filtered_at,mol,Carbon_present,Inorganics,mixture,metals,Cleaned_at,salts
0,1,(R)-Roscovitine,CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1.[Ca],3,<rdkit.Chem.rdchem.Mol object at 0x7f8df7280490>,True,True,,,,True
5,6,2-Deoxy-D-glucose,OCC1OC(O)CC(O)C1O.O1CCOCC1,4,<rdkit.Chem.rdchem.Mol object at 0x7f8df722bda0>,True,False,True,,,False
114,115,Bortezomib,CC(C)CC(NC(=O)C(Cc1ccccc1)NC(=O)c1cnccn1)B(O)O,3,<rdkit.Chem.rdchem.Mol object at 0x7f8df7245d00>,True,True,,,,False
200,201,zirconium,CCO[Zr](OCC)(OCC)OCC,3,<rdkit.Chem.rdchem.Mol object at 0x7f8df72487b0>,True,True,,,,False
201,202,hemoglobin,CC1=C(C2=CC3=NC(=CC4=C(C(=C([N-]4)C=C5C(=C(C(=...,3,<rdkit.Chem.rdchem.Mol object at 0x7f8df7248800>,True,True,,,,False
204,206,covalent_metal,CCC(=O)O[Na],3,<rdkit.Chem.rdchem.Mol object at 0x7f8df72488f0>,True,True,,,,False


One of the filtered entries in this example contained a salt. We might consider removing the salt in this mixture and re-run this standardization script with the cleaned entry. 

#### Task 7: Normalize molecules

handle_charges.uncharge (Attempts to neutralize charges by adding and/or removing hydrogens where possible.)

In [27]:
# TODO: Finish all steps here

# Setting up the task_number
task_number = 7
# getting the valid entries from the step before
dataset = result1
# dataset.head(100)
dataset["normalized"] = dataset["mol"].apply(normalize)
result7 = dataset
result7.head()

NameError: name 'result1' is not defined

#### Task 8: Charges and Hydrogens TODO

In [None]:
# TODO: Add the functionlaity here

### Step 4: Normalization of Specific Chemotypes

More complex than just Normalization.

In [None]:
# TODO: Finish all steps here

# Normalization of specific chemotypes
# normalize_molecules

In [None]:
# TODO: Finish all steps here

# Treatment of tautomeric forms
# handle_tautomers

#### Task 9: Generate a canonicalized tautomer on SMILEs entries

In [None]:
# TODO: Finish all steps here

# Setting up the task_number
task_number = 9

dataset = result7

# Find all duplicate occurences in SMILEs
dataset["canonicalized tautomer"] = dataset["SMILEs"].apply(
    handle_tautomers.canonicalize_tautomer
)
dataset.head()

In [None]:
dataset.tail()

### Removal of duplicates

In [None]:
# TODO: Fine tune the output

# Analysis/removal of duplicates

# Setting up the task_number
task_number = 10

dataset = result7

# Find all duplicate occurences in SMILEs
dataset["duplicate?"] = dataset.duplicated(subset=["SMILEs"])

# Filter the duplicates out
failed_step_10 = dataset[dataset["duplicate?"] == True]
failed_step_10["Filtered_at"] = failed_step_10["Filtered_at"].apply(
    lambda x, task_number=task_number: task_number
)

dataset = dataset[dataset["duplicate?"] == False]
failed_step_10.tail()

In [None]:
# Manual inspection

# TODO: Create csv-exports for better readability of the subsets or jupyter notebook searchable tables

In [None]:
# Contatination of results for the end
test = pd.concat([failed_step_1, failed_step_2])
test = test.sort_values(by=["IDs"])

In [None]:
test

In [None]:
test = convert_format.convert_smiles_to_mol("CCO[Zr](OCC)(OCC)OCC")
test = disconnect_metals(test)
test = handle_charges.uncharge(test)
test = remove_salts(test)
# test = handle_fragments.choose_largest_fragment(test)
test