# Computing the Morgan fingerprints of molecules and comparing them 

Please cite our *Nature Protocols* paper, which features this Jupyter notebook: 

Tran-Nguyen, V. K., Junaid, M., Simeon, S. & Ballester, P. J. A practical guide to machine-learning scoring for structure-based virtual screening. *Nat. Protoc.* **18**, 3460–3511 (2023)

This is a Jupyter notebook that helps users compute the Morgan fingerprint of a molecule, then calculate the Tanimoto similarity of different molecules' Morgan fingerprints. Please refer to our Nature Protocols paper cited above for more information.

## 1. Install all required Python dependencies 

Several Python dependencies have to be installed beforehand: set up your protocol-env environment using conda and the yml file protocol-env.yml (downloaded from our github repository). 

In [None]:
import pandas as pd
import oddt.pandas as opd
from oddt.pandas import ChemDataFrame
from rdkit import DataStructs
from rdkit import Chem
from rdkit.Chem.PandasTools import RenderImagesInAllDataFrames
from rdkit.Chem import AllChem
import numpy as np

## 2. Load input file(s) 

Input compounds are provided in sdf for this notebook.

If you have separate files (e.g. training, test): first, use Steps 2.1 and 2.2 to load them separately, concatenate them, and save the resulting concatenated sdf in your computer; then go to Steps 2.3 and 2.4.

If you have only one sdf (already concatenated): skip Steps 2.1 and 2.2, go straight to Steps 2.3 and 2.4.

In [None]:
# STEP 2.1: Load separate sdfs and concatenate them:

dataset_1 = opd.read_sdf('Provide_the_pathway_to_your_first_input_sdf')
dataset_2 = opd.read_sdf('Provide_the_pathway_to_your_second_input_sdf')
dataset = pd.concat([dataset_1, dataset_2])

# If users have more than two input sdfs, add them as follows (remove the # before running the code):
# dataset_n = opd.read_sdf('Provide_the_pathway_to_your_nth_input_sdf')
# dataset = pd.concat([dataset_1, dataset_2, dataset_3, ..., dataset_n])

In [None]:
# STEP 2.2: Save the previously concatenated sdf in your computer:

ChemDataFrame.to_sdf(dataset, "Provide_the_pathway_to_the_directory_where_your_concatenated_sdf_is_stored",
                     molecule_column = 'mol', columns = list(dataset.columns))

In [None]:
# STEP 2.3: Define a function to load a concatenated sdf:

def LoadSDF(filename, idName='ID', molColName='ROMol', includeFingerprints=False,
            isomericSmiles=True, smilesName=None, embedProps=False, removeHs=True,
            strictParsing=True, sanitize=False):
    
    # Read the input sdf and return as Pandas data frame.
    # If embedProps=True, all properties will also get embedded in Mol objects in the molecule column.
    # If molColName=None, molecules will not be present in the resulting DataFrame (only properties will be read).

    if isinstance(filename, str):
        if filename.lower()[-3:] == ".gz":
            import gzip
            f = gzip.open(filename, "rb")
        else:
            f = open(filename, 'rb')
        close = f.close
    else:
        f = filename
        close = None
    records = []
    indices = []
    for i, mol in enumerate(
        Chem.ForwardSDMolSupplier(f, sanitize=False, removeHs=removeHs,
                                  strictParsing=strictParsing)):
        if not sanitize:
            Chem.SanitizeMol(mol, Chem.SanitizeFlags.SANITIZE_FINDRADICALS |
                             Chem.SanitizeFlags.SANITIZE_KEKULIZE |
                             Chem.SanitizeFlags.SANITIZE_SETAROMATICITY |
                             Chem.SanitizeFlags.SANITIZE_SETCONJUGATION |
                             Chem.SanitizeFlags.SANITIZE_SETHYBRIDIZATION |
                             Chem.SanitizeFlags.SANITIZE_SYMMRINGS,
                             catchErrors=True)
        if mol is None:
            print(i)
            continue
        row = dict((k, mol.GetProp(k)) for k in mol.GetPropNames())
        if molColName is not None and not embedProps:
            for prop in mol.GetPropNames():
                mol.ClearProp(prop)
        if mol.HasProp('_Name'):
            row[idName] = mol.GetProp('_Name')
        if smilesName is not None:
            try:
                row[smilesName] = Chem.MolToSmiles(
                    mol, isomericSmiles=isomericSmiles)
            except:
                log.warning(
                    'No valid smiles could be generated for molecule %s', i)
                row[smilesName] = None
        if molColName is not None and not includeFingerprints:
            row[molColName] = mol
        elif molColName is not None:
            row[molColName] = _MolPlusFingerprint(mol)
        records.append(row)
        indices.append(i)

    if close is not None:
        close()
    RenderImagesInAllDataFrames(images=True)
    return pd.DataFrame(records, index=indices)

In [None]:
# STEP 2.4: Load the concatenated sdf using the function defined above:

dataset_sdf = LoadSDF("Provide_the_pathway_to_the_directory_where_your_concatenated_sdf_is_stored")

## 3. Compute the Morgan fingerprints of input molecules

In [None]:
# Here we compute Morgan fingerprints of radius 2, 2048 bits:

fps = list()
for mol in dataset_sdf['ROMol']:
    fps.append(AllChem.GetMorganFingerprintAsBitVect(mol, radius = 2, nBits = 2048))

dataset_sdf["FP_2048b_r2"] = fps

## 4. Calculate the Tanimoto similarity of Morgan fingerprints and create a similarity matrix 

In [None]:
size = len(dataset_sdf)
similarity_matrix = np.zeros((size, size))

idx = 0
np_fps = list()
for fp in dataset_sdf["FP_2048b_r2"]:
    np_fp = np.zeros((1,))
    Chem.DataStructs.ConvertToNumpyArray(fp, np_fp)
    np_fps.append(np_fp)
    # Calculate Tanimoto similarity
    similarity = Chem.DataStructs.BulkTanimotoSimilarity(fp, dataset_sdf["FP_2048b_r2"], returnDistance=0)
    # Save it to similarity matrix
    similarity_matrix[idx] = similarity
    idx += 1
    
df_similarity = pd.DataFrame(similarity_matrix)
df_similarity.columns = list(dataset_sdf['ID'])
df_similarity.index = list(dataset_sdf['ID'])

In [None]:
# Save the similarity matrix as a csv:

df_similarity.to_csv("Provide_the_pathway_to_the_directory_to_store_the_output_similarity_matrix_in_csv")