<a href="https://colab.research.google.com/github/russodanielp/intro_cheminformatics/blob/google_colab/Lab%2005%20-%20Molecular%20Descriptors/colab_completed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Molecular Descriptors

## Aim of this lab

To understand and calculate common types of molecular descriptors, including chemical fingerprints.  

### Objectives

* Calculate Mordred Descriptors 
* Calculate MACCS Keys
* Calculate Morgan Fingerprints


### Molecular Descriptors

Molecular descriptors are the fundation of any quantitatively structure activity relationship.  Because we have a computational version of molecules (e.g., graphs) we can calculate molecular attributes, called descriptors, which are quantitative measures inherit on their chemical structure.  Depending on the software you use, they can be fairly few descirptors or even thousands.  

There are numerous sets of chemical descriptors that exists.  For example, [Molecular Operating Environment](https://www.chemcomp.com/Products.htm) and [Dragon](http://www.talete.mi.it/products/dragon_description.htm) softwares are commercial products that are often used to calculate molecular descriptors for sets of molecules.  However, there are several open-source solutions to this as well.  

Chemical descriptors are generally broken up into two categories.  

1) Molecular descriptors - usually continious (real valued numbers, floats) values describing inherit molecular attributes.  E.g., molecular weight, logP,  etc.

2) Molecular fingerprints - Binary (0, 1) or count-based (integers) values describing the number or presence of substructures in a chemical. 

### Traditional Molecular Descriptors

Here we will calculate traditional molecular descriptors.  Mordred calculates a variety of molecular descriptors (around 2,000 in total).

In [None]:
!pip install rdkit
!pip install mordred 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


First lets import our dataset

In [None]:
import pandas as pd
from rdkit.Chem import PandasTools

In [None]:
df = PandasTools.LoadSDF('DIAZEPAM_w_name.sdf')
df.head(3)

Unnamed: 0,Name,MolSmiles,Bio_Activity,ID,ROMol
0,Mol_0,CC(C)(C)OC(=O)c1c2n(cn1)-c3ccccc3C(=O)N(C2)C,-1.28,,<rdkit.Chem.rdchem.Mol object at 0x7f62645c4970>
1,Mol_1,CN1Cc2c(ncn2-c3ccc(cc3C1=O)Cl)C(=O)OC,-0.62,,<rdkit.Chem.rdchem.Mol object at 0x7f62645c49e0>
2,Mol_2,CCCOC(=O)c1c2n(cn1)-c3ccc(cc3C(=O)N(C2)C)Cl,-0.13,,<rdkit.Chem.rdchem.Mol object at 0x7f62645c4a50>


The [Mordred](https://github.com/mordred-descriptor/mordred) software is a molecular descriptors calculator available in Python.  The paper describing the software can be found [here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y?ref=https://githubhelp.com).

The package is broken up into two main classes `Descriptor` and `Calculator`

In [None]:
from mordred import Calculator, descriptors

calc_2d = Calculator(descriptors, ignore_3D=True)
 
num_2d = len(calc_2d)

print("There are", num_2d, "descritpors")

There are 1613 descritpors


In [None]:
one_mol = df.ROMol.iloc[0]

result = calc_2d(one_mol)
print(result)

Result({'ABC': 18.19951399504173, 'ABCGG': 15.54425468171948, 'nAcid': 0, 'nBase': 0, 'SpAbs_A': 28.422705531766375, 'SpMax_A': 2.5162242867902327, 'SpDiam_A': 4.868888912157172, 'SpAD_A': 28.422705531766375, 'SpMAD_A': 1.2357698057289728, 'LogEE_A': 4.077276458148709, 'VE1_A': 4.155551626568056, 'VE2_A': 0.180676157676872, 'VE3_A': 2.2573543045406486, 'VR1_A': 169.3863538843396, 'VR2_A': 7.364624081927809, 'VR3_A': 5.965091346325872, 'nAromAtom': 11, 'nAromBond': 11, 'nAtom': 42, 'nHeavyAtom': 23, 'nSpiro': 0, 'nBridgehead': 0, 'nHetero': 6, 'nH': 19, 'nB': 0, 'nC': 17, 'nN': 3, 'nO': 3, 'nS': 0, 'nP': 0, 'nF': 0, 'nCl': 0, 'nBr': 0, 'nI': 0, 'nX': 0, 'ATS0dv': 348.0, 'ATS1dv': 356.0, 'ATS2dv': 529.0, 'ATS3dv': 588.0, 'ATS4dv': 570.0, 'ATS5dv': 419.0, 'ATS6dv': 318.0, 'ATS7dv': 231.0, 'ATS8dv': 120.0, 'ATS0d': 145.0, 'ATS1d': 176.0, 'ATS2d': 297.0, 'ATS3d': 335.0, 'ATS4d': 348.0, 'ATS5d': 287.0, 'ATS6d': 240.0, 'ATS7d': 191.0, 'ATS8d': 162.0, 'ATS0s': 202.72916666666666, 'ATS1s': 142.

In [None]:
result[0:10]

[18.19951399504173,
 15.54425468171948,
 0,
 0,
 28.422705531766375,
 2.5162242867902327,
 4.868888912157172,
 28.422705531766375,
 1.2357698057289728,
 4.077276458148709]

In [None]:
result['ABC']

18.19951399504173

In [None]:
desc = calc_2d.pandas(df.ROMol)

100%|██████████| 42/42 [00:22<00:00,  1.84it/s]


In [None]:
desc.head()

Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,18.199514,15.544255,0,0,28.422706,2.516224,4.868889,28.422706,1.23577,4.077276,...,10.165121,72.073527,313.142641,7.455777,1093,41,126.0,150.0,8.951389,4.819444
1,16.417934,14.214142,0,0,27.011149,2.519343,4.893595,27.011149,1.286245,3.989157,...,10.085684,69.611613,305.056719,9.244143,820,40,114.0,139.0,7.75,4.611111
2,17.832148,15.088248,0,0,29.520464,2.52002,4.894438,29.520464,1.283498,4.070201,...,10.110542,71.859767,333.088019,8.540718,1114,42,122.0,147.0,8.25,5.111111
3,18.050928,15.35815,0,0,28.66838,2.520785,4.895415,28.66838,1.246451,4.073665,...,10.135432,71.957867,333.088019,8.540718,1094,42,124.0,149.0,8.861111,4.944444
4,19.246361,15.749331,0,0,31.472768,2.520896,4.895338,31.472768,1.311365,4.158025,...,10.221396,78.568529,345.088019,8.6272,1273,43,134.0,163.0,7.611111,5.111111


In [None]:
desc.describe()

Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
count,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,...,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0
mean,19.24812,16.220855,0.0,0.0,31.215106,2.543648,4.924842,31.215106,1.282254,4.138268,...,10.250665,74.70535,345.767467,8.463921,1284.404762,45.02381,133.952381,163.119048,8.247189,5.270172
std,2.214991,1.482307,0.0,0.0,3.584472,0.029436,0.058403,3.584472,0.033037,0.112348,...,0.156224,3.831548,39.313421,0.567942,374.659719,4.734348,15.960244,19.790614,0.986402,0.537785
min,14.894331,13.60112,0.0,0.0,24.701552,2.512669,4.796048,24.701552,1.217932,3.898359,...,9.965194,68.32582,271.095691,7.061338,617.0,35.0,104.0,128.0,6.638889,4.194444
25%,17.522647,15.087931,0.0,0.0,28.093175,2.520694,4.894266,28.093175,1.264775,4.046199,...,10.119955,71.782756,314.625073,8.046478,981.25,42.0,120.5,145.25,7.611111,4.888889
50%,19.036051,16.2364,0.0,0.0,30.581541,2.527067,4.918369,30.581541,1.286245,4.138302,...,10.230529,74.329461,345.088019,8.521554,1247.5,44.5,133.0,162.0,8.0,5.194444
75%,21.170284,17.572906,0.0,0.0,34.044096,2.575053,4.955581,34.044096,1.304365,4.239563,...,10.393003,77.876474,370.619966,8.837389,1577.0,48.75,148.0,180.75,9.118056,5.607639
max,23.379612,18.60208,0.0,0.0,38.542131,2.596222,5.034503,38.542131,1.338338,4.328142,...,10.514855,81.942054,417.068804,9.772683,2049.0,53.0,162.0,197.0,10.284722,6.222222


### Molecular Fingerprints

Molecular fingerints are usually binary and describe the presence of absence of certain chemical substructures.  

Generally, they are either key-based, meaning they denote the presence or absence of a predefined chemical fragment or set of atoms or hased fingerprints which do not have a predifined structures set.  Here we will calculate an example of each. 

* MACCS Keys [Ref.](https://pubs.acs.org/doi/10.1021/ci010132r)

Also known as MDL keys are 166 predefined substructures and we developed for substructure and database searching. 

* Morgan Fingerprints [Ref.](https://pubs.acs.org/doi/10.1021/ci100050t)

Morgan fingerprints, also know as extended-connectivity or ECFP fingerprints are a type of fingerprint that considers the atom environment around each atom in a molecule.  It relies on using the [Morgan Algorithm ](https://pubs.acs.org/doi/10.1021/c160017a018) to find all substructures of a an atom up to a certain number of atoms (e.g., all substructures 3 atoms long).  This number is called the diameter.  So, ECFP6 fingerprints calculate all fragments of all molecules from 1-6 atoms in length.  To keep track of unique substructures, a [hashing alogirthm](https://en.wikipedia.org/wiki/Hash_function) is applied to assign them a unique number and keep track of which molecules have which common substructures.  Because these numbers can get pretty large, its often necessary to "fold" these into a small predefined length (e.g., 1024, 2048).  


In [None]:
from rdkit.Chem import AllChem
from rdkit.Chem import MACCSkeys
from rdkit import Chem
import pandas as pd

First we write a function to calculate fingperints of each type.  

In [None]:
def calc_fp_from_mol(mol, method="maccs", n_bits=2048):
    """
    Encode a molecule from a RDKit Mol into a fingerprint.

    Parameters
    ----------
    mol : RDKit Mol
        The RDKit molecule.

    method : str
        The type of fingerprint to use. Default is MACCS keys.

    n_bits : int
        The length of the fingerprint.

    Returns
    -------
    array
        The fingerprint array.

    """

    if method == "maccs":
        return list(MACCSkeys.GenMACCSKeys(mol))
    elif method == "ecfp4":
        return list(AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=n_bits))
    elif method == "ecfp6":
        return list(AllChem.GetMorganFingerprintAsBitVect(mol, 3, nBits=n_bits))
    else:
        print(f"Warning: Wrong method specified: {method}. Default will be used instead.")
        return list(MACCSkeys.GenMACCSKeys(mol))

### MACCS Fingerprints

In [None]:
maccs_list = []

for mol in df.ROMol.tolist():
    maccs = calc_fp_from_mol(mol, method="maccs")
    maccs_list.append(maccs)

maccs = pd.DataFrame(maccs_list)
print("There are", maccs.shape[1],"MACCS fingerprints")
maccs.head()

There are 167 MACCS fingerprints


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,157,158,159,160,161,162,163,164,165,166
0,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
1,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
2,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
3,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
4,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0


### Morgan Fingerprints

Calculate Morgan Fingerprints at a bond diameter of 6 and folded into 1024 bits.  

In [None]:
ecfp_list = []

for mol in df.ROMol.tolist():
    ecfp6 = calc_fp_from_mol(mol, method="ecfp6", n_bits=1024)
    ecfp_list.append(ecfp6)

ecfp6 = pd.DataFrame(ecfp_list)
print("There are", ecfp6.shape[1], "MACCS fingerprints")
ecfp6.head()

There are 1024 MACCS fingerprints


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Let's set the indexes as the names of our molecules and save to a CSV file.  

In [None]:
names = df.Name

desc.index = names
maccs.index = names
ecfp6.index = names

desc.to_csv('mordred.csv')
maccs.to_csv('maccs.csv')
ecfp6.to_csv('ecfp6.csv')