<a href="https://colab.research.google.com/github/sladem-tox/Rdkit-stuff/blob/main/InChI_Clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Canonical Smiles and Data Cleaning

When you download a new dataset from a publication it will requently contain a series of SMILES strings representing molecules. But SMILES strings themselves are not necessarily canonical so you could have two seamingly different molecules with different SMILES sequences that are acually the same.

So it is necessary to clean this data by converting each SMILES to canonical SMILES or to an InChI representation.
More on this can be gleaned here:
https://chem.libretexts.org/Courses/University_of_Arkansas_Little_Rock/ChemInformatics_(2017)%3A_Chem_4399_5399/2.3%3A_Chemical_Representations_on_Computer%3A_Part_III


In [1]:
#First we download our data from github
#Although Colab can mount your GoogleDrive it is a pain and actually quicker to access files from your Github account.
import pandas as pd
df = pd.read_csv("https://github.com/sladem-tox/Resbaz/raw/main/molecule_Phenethylamines_valid.csv")

In [6]:
df.head(2)

Unnamed: 0,Number,Name,SMILES,LogA
0,1,"1-(4-bromo-2,5dimethoxyphenyl)propan-2-amine",CC(Cc(c(OC)c1)cc(OC)c1Br)N,2.176
1,2,"2,5-dimethoxy-4-chloroamphetamine",CC(Cc(c(OC)c1)cc(OC)c1Cl)N,2.123


In [None]:
!pip install rdkit datamol molfeat

In [5]:
from rdkit import Chem, DataStructs
from rdkit.Chem import PandasTools, AllChem
import pandas as pd
import datamol as dm
from molfeat.trans import MoleculeTransformer

In [7]:
PandasTools.AddMoleculeColumnToFrame(df,'SMILES','Molecule')
df[["SMILES","Molecule"]].head(1)

Unnamed: 0,SMILES,Molecule
0,CC(Cc(c(OC)c1)cc(OC)c1Br)N,<rdkit.Chem.rdchem.Mol object at 0x7d2423af1620>


In [10]:
# Check that all SMILES were successfully converted to molecule objects.
df.Molecule.isna().sum()

0

Sometimes there are some molecules that can't be converted to mol objects because there are problems interpreting their SMILES. Remove these before continuing.

Also, sometimes there are SMILES that look different but are actually the same (e.g. tautomers). Here is some sample code to tell them apart.

In [11]:
# Canonical SMILES tester

from rdkit import Chem

# Define your two SMILES strings
smiles1 = "Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1"
smiles2 = "NC1=CC=C(OC2=CC(OC3=CC=C(N)C=C3)=CC=C2)C=C1"

# Create RDKit molecules from SMILES
mol1 = Chem.MolFromSmiles(smiles1)
mol2 = Chem.MolFromSmiles(smiles2)

# Check if the molecules were successfully created
if mol1 is None or mol2 is None:
    print("Invalid SMILES provided.")
else:
    # Generate canonical SMILES
    canonical_smiles1 = Chem.MolToSmiles(mol1, isomericSmiles=True)
    canonical_smiles2 = Chem.MolToSmiles(mol2, isomericSmiles=True)

    # Compare the canonical SMILES
    if canonical_smiles1 == canonical_smiles2:
        print("The structures are the same.")
    else:
        print("The structures are different.")


The structures are the same.


One way to make sure about the uniqueness of your molecules is to generate unique InChI key representations. Here is how to do that from the 'Molecule object' from RdKit.

In [8]:
df['inchi_key'] = df.Molecule.apply(Chem.MolToInchiKey)

In [12]:
df.head(1)

Unnamed: 0,Number,Name,SMILES,LogA,Molecule,inchi_key
0,1,"1-(4-bromo-2,5dimethoxyphenyl)propan-2-amine",CC(Cc(c(OC)c1)cc(OC)c1Br)N,2.176,<rdkit.Chem.rdchem.Mol object at 0x7d2423af1620>,FXMWUTGUCAKGQL-UHFFFAOYSA-N


Also, the SMILES can be converted to canonical SMILES. This is a good idea to do early in the project.

In [13]:
df['CanonSmi'] = df.Molecule.apply(Chem.MolToSmiles, True)

In [19]:
df.head(1)

Unnamed: 0,Number,Name,SMILES,LogA,Molecule,inchi_key,CanonSmi
0,1,"1-(4-bromo-2,5dimethoxyphenyl)propan-2-amine",CC(Cc(c(OC)c1)cc(OC)c1Br)N,2.176,<rdkit.Chem.rdchem.Mol object at 0x7d2423af1620>,FXMWUTGUCAKGQL-UHFFFAOYSA-N,COc1cc(CC(C)N)c(OC)cc1Br


In [20]:
# Create a DataFrame with the two tests SMILES from the Canonical SMILES tester above.
new_smiles = pd.DataFrame({'SMILES': ["Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1", "NC1=CC=C(OC2=CC(OC3=CC=C(N)C=C3)=CC=C2)C=C1"]})

# Concatenate this new DataFrame with the existing df
df = pd.concat([df, new_smiles], ignore_index=True)


In [30]:
# Regenerate the Molecule column
PandasTools.AddMoleculeColumnToFrame(df,'SMILES','Molecule')

In [31]:
# Use the Molecule column to generate Canonical SMILES
df['CanonSmi'] = df.Molecule.apply(Chem.MolToSmiles, True)

In [32]:
# Now note that the bottom two SMILES in the CanonSmi column are the same but those in the corresponding SMILES column are not.

df.tail(6)

Unnamed: 0,Number,Name,SMILES,LogA,Molecule,inchi_key,CanonSmi
110,113.0,"3,5-dimethoxy-4-methallyloxy phenethylamine",CC(COc(c(OC)cc(CCN)c1)c1OC)=C,1.04,<rdkit.Chem.rdchem.Mol object at 0x7d24235cfa00>,FOXJFBFFGULACD-UHFFFAOYSA-N,C=C(C)COc1c(OC)cc(CCN)cc1OC
111,114.0,"2-(benzo[d][1,3]dioxol-5-yl)-2-methoxyethanamine",COC(CN)c(cc1)cc2c1OCO2,0.477,<rdkit.Chem.rdchem.Mol object at 0x7d24235cfa70>,KUTKTMOZFCYDLZ-UHFFFAOYSA-N,COC(CN)c1ccc2c(c1)OCO2
112,115.0,"2,5,beta-trimethoxy-4-bromophenethylamine",COC(CN)c(c(OC)c1)cc(OC)c1Br,1.301,<rdkit.Chem.rdchem.Mol object at 0x7d24235cfae0>,FYTLQNZPDWLGNU-UHFFFAOYSA-N,COc1cc(C(CN)OC)c(OC)cc1Br
113,116.0,"2-(7-methoxybenzo[d][1,3]dioxol-5-yl)ethanamine",COc1c2OCOc2cc(CCN)c1,0.0,<rdkit.Chem.rdchem.Mol object at 0x7d24235cfb50>,ORXQUAPZHKCCAX-UHFFFAOYSA-N,COc1cc(CCN)cc2c1OCO2
114,,,Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1,,<rdkit.Chem.rdchem.Mol object at 0x7d24235cfbc0>,,Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1
115,,,NC1=CC=C(OC2=CC(OC3=CC=C(N)C=C3)=CC=C2)C=C1,,<rdkit.Chem.rdchem.Mol object at 0x7d24235cfc30>,,Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1
