<a href="https://colab.research.google.com/github/sladem-tox/Rdkit-stuff/blob/main/InChI_Clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Canonical Smiles and Data Cleaning

When you download a new dataset from a publication it will frequently contain a series of SMILES strings representing molecules. But SMILES strings themselves are not necessarily canonical so you could have two seamingly different molecules with different SMILES sequences that are acually the same.

So it is necessary to clean this data by converting each SMILES to canonical SMILES or to an InChI representation.
More on this can be gleaned here:
https://chem.libretexts.org/Courses/University_of_Arkansas_Little_Rock/ChemInformatics_(2017)%3A_Chem_4399_5399/2.3%3A_Chemical_Representations_on_Computer%3A_Part_III


In [1]:
#First we download our data from github
#Although Colab can mount your GoogleDrive it is a pain and actually quicker to access files from your Github account.
import pandas as pd
df = pd.read_csv("https://github.com/sladem-tox/Resbaz/raw/main/molecule_Phenethylamines_valid.csv")

In [2]:
df.head(2)

Unnamed: 0,Number,Name,SMILES,LogA
0,1,"1-(4-bromo-2,5dimethoxyphenyl)propan-2-amine",CC(Cc(c(OC)c1)cc(OC)c1Br)N,2.176
1,2,"2,5-dimethoxy-4-chloroamphetamine",CC(Cc(c(OC)c1)cc(OC)c1Cl)N,2.123


In [3]:
!pip install rdkit datamol molfeat

Collecting rdkit
  Downloading rdkit-2023.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datamol
  Downloading datamol-0.11.4-py3-none-any.whl (381 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m381.8/381.8 kB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting molfeat
  Downloading molfeat-0.9.4-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.9/163.9 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
Collecting loguru (from datamol)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting selfies (from datamol)
  Downloading selfies-2.1.1-py3-none-any.whl (35 kB)
Collecting s3fs>=2021.9 (from molfeat)
  Downloading s3fs-2023.9.2-py3-

In [4]:
from rdkit import Chem, DataStructs
from rdkit.Chem import PandasTools, AllChem
import pandas as pd
import datamol as dm
from molfeat.trans import MoleculeTransformer

In [5]:
# This code takes the SMILES column and generates a molecule object for each SMILES string in a new column "Molecule"
PandasTools.AddMoleculeColumnToFrame(df,'SMILES','Molecule')
df[["SMILES","Molecule"]].head(1)

Unnamed: 0,SMILES,Molecule
0,CC(Cc(c(OC)c1)cc(OC)c1Br)N,<rdkit.Chem.rdchem.Mol object at 0x7a09f7e3a1f0>


In [6]:
# Check that all SMILES were successfully converted to molecule objects.
# If some of the SMILES don't convert that will break subsequent code based on molecule objects so best to quickly check that they all worked.
df.Molecule.isna().sum()

0

Sometimes there are some molecules that can't be converted to mol objects because there are problems interpreting their SMILES.

<b>Remove broken smiles before continuing.</b>

Here is an example of SMILES that look different but are actually the same. They are conformational or rotational isomers.

Tautomers are another type of structural isomer where the molecule is able to interconvert between two structures. InChI keys will show two tautomers as the same molecule if they freely interconvert because InChI representation is aware of tautomerism.

This sample code uses isomeric smiles to distinguish conformational isomers.

In [7]:
# Canonical SMILES tester

from rdkit import Chem

# Define your two SMILES strings
smiles1 = "Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1"
smiles2 = "NC1=CC=C(OC2=CC(OC3=CC=C(N)C=C3)=CC=C2)C=C1"

# Create RDKit molecules from SMILES
mol1 = Chem.MolFromSmiles(smiles1)
mol2 = Chem.MolFromSmiles(smiles2)

# Check if the molecules were successfully created
if mol1 is None or mol2 is None:
    print("Invalid SMILES provided.")
else:
    # Generate canonical SMILES
    canonical_smiles1 = Chem.MolToSmiles(mol1, isomericSmiles=True)
    canonical_smiles2 = Chem.MolToSmiles(mol2, isomericSmiles=True)

    # Compare the canonical SMILES
    if canonical_smiles1 == canonical_smiles2:
        print("The structures are the same.")
    else:
        print("The structures are different.")


The structures are the same.


One way to make sure about the uniqueness of your molecules is to generate unique InChI key representations.

Here is how to do that from the 'Molecule object' from RdKit.

In [8]:
# Here we take the "Molecule" column from df and convert to InChI
df['inchi_key'] = df.Molecule.apply(Chem.MolToInchiKey)

In [9]:
df.head(1)  # Note the new InChI column.

Unnamed: 0,Number,Name,SMILES,LogA,Molecule,inchi_key
0,1,"1-(4-bromo-2,5dimethoxyphenyl)propan-2-amine",CC(Cc(c(OC)c1)cc(OC)c1Br)N,2.176,<rdkit.Chem.rdchem.Mol object at 0x7a09f7e3a1f0>,FXMWUTGUCAKGQL-UHFFFAOYSA-N


Also, the SMILES can be converted to canonical SMILES.

It is a good idea to do this early in the project.

In [10]:
# Here we generate SMILES from the molecule object but have the IsomericSmiles=True flag set.
df['CanonSmi'] = df.Molecule.apply(Chem.MolToSmiles, True)

In [11]:
df.head(1)

Unnamed: 0,Number,Name,SMILES,LogA,Molecule,inchi_key,CanonSmi
0,1,"1-(4-bromo-2,5dimethoxyphenyl)propan-2-amine",CC(Cc(c(OC)c1)cc(OC)c1Br)N,2.176,<rdkit.Chem.rdchem.Mol object at 0x7a09f7e3a1f0>,FXMWUTGUCAKGQL-UHFFFAOYSA-N,COc1cc(CC(C)N)c(OC)cc1Br


In [12]:
# Create a DataFrame with the two tests SMILES from the Canonical SMILES tester above.
new_smiles = pd.DataFrame({'SMILES': ["Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1", "NC1=CC=C(OC2=CC(OC3=CC=C(N)C=C3)=CC=C2)C=C1"]})

# Concatenate this new DataFrame with the existing df
df = pd.concat([df, new_smiles], ignore_index=True)


In [13]:
# Regenerate the Molecule column
PandasTools.AddMoleculeColumnToFrame(df,'SMILES','Molecule')

In [14]:
# Use the Molecule column to generate Canonical SMILES
df['CanonSmi'] = df.Molecule.apply(Chem.MolToSmiles, True)

In [15]:
# Now note that the bottom two SMILES in the CanonSmi column are the same *but* those in the corresponding SMILES column are not the same.

df.tail(6)

Unnamed: 0,Number,Name,SMILES,LogA,Molecule,inchi_key,CanonSmi
110,113.0,"3,5-dimethoxy-4-methallyloxy phenethylamine",CC(COc(c(OC)cc(CCN)c1)c1OC)=C,1.04,<rdkit.Chem.rdchem.Mol object at 0x7a09f7ec5a10>,FOXJFBFFGULACD-UHFFFAOYSA-N,C=C(C)COc1c(OC)cc(CCN)cc1OC
111,114.0,"2-(benzo[d][1,3]dioxol-5-yl)-2-methoxyethanamine",COC(CN)c(cc1)cc2c1OCO2,0.477,<rdkit.Chem.rdchem.Mol object at 0x7a09f7ec5a80>,KUTKTMOZFCYDLZ-UHFFFAOYSA-N,COC(CN)c1ccc2c(c1)OCO2
112,115.0,"2,5,beta-trimethoxy-4-bromophenethylamine",COC(CN)c(c(OC)c1)cc(OC)c1Br,1.301,<rdkit.Chem.rdchem.Mol object at 0x7a09f7ec5af0>,FYTLQNZPDWLGNU-UHFFFAOYSA-N,COc1cc(C(CN)OC)c(OC)cc1Br
113,116.0,"2-(7-methoxybenzo[d][1,3]dioxol-5-yl)ethanamine",COc1c2OCOc2cc(CCN)c1,0.0,<rdkit.Chem.rdchem.Mol object at 0x7a09f7ec5b60>,ORXQUAPZHKCCAX-UHFFFAOYSA-N,COc1cc(CCN)cc2c1OCO2
114,,,Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1,,<rdkit.Chem.rdchem.Mol object at 0x7a09f7ec5bd0>,,Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1
115,,,NC1=CC=C(OC2=CC(OC3=CC=C(N)C=C3)=CC=C2)C=C1,,<rdkit.Chem.rdchem.Mol object at 0x7a09f7ec5c40>,,Nc1ccc(Oc2cccc(Oc3ccc(N)cc3)c2)cc1


In [18]:
# Find duplicates in ONE dataset using the duplicated method from pandas
import pandas as pd

def has_duplicate_inchi(df):
    # Check for duplicates in the 'inchi_key' column
    duplicates = df.duplicated(subset='inchi_key', keep=False)

    # Check if there are any duplicates and print the result
    if duplicates.any():
        print("There are duplicate InChI values in the DataFrame.")
        return True
    else:
        print("No duplicate InChI values found in the DataFrame.")
        return False

# Example usage:
# Assuming 'df' is your DataFrame with an 'inchi_key' column
has_duplicate_inchi(df)


There are duplicate InChI values in the DataFrame.


True

In [None]:
# Find overlaps in two datasets
import pandas as pd

def count_and_report_overlapping_inchi_keys(df1, df2):
    # Extract the 'inchi_key' columns from both DataFrames
    inchi_keys1 = set(df1['inchi_key'])
    inchi_keys2 = set(df2['inchi_key'])

    # Calculate the count of overlapping values
    overlapping_count = len(inchi_keys1.intersection(inchi_keys2))

    if overlapping_count > 0:
        result = f"The number of overlapping InChI Keys between df1 and df2 is: {overlapping_count}"
    else:
        result = "No overlapping InChI Keys found between df1 and df2."

    return result

# Example usage:
# Assuming 'df1' and 'df2' are your DataFrames with 'inchi_key' columns
count_and_report_overlapping_inchi_keys(df1, df2)

