<a href="https://colab.research.google.com/github/russodanielp/intro_cheminformatics/blob/google_colab/Lab%2005%20-%20Molecular%20Descriptors/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Molecular Descriptors

## Aim of this lab

To understand and calculate common types of molecular descriptors, including chemical fingerprints.  

### Objectives

* Calculate Mordred Descriptors 
* Calculate MACCS Keys
* Calculate Morgan Fingerprints


### Molecular Descriptors

Molecular descriptors are the fundation of any quantitatively structure activity relationship.  Because we have a computational version of molecules (e.g., graphs) we can calculate molecular attributes, called descriptors, which are quantitative measures inherit on their chemical structure.  Depending on the software you use, they can be fairly few descirptors or even thousands.  

There are numerous sets of chemical descriptors that exists.  For example, [Molecular Operating Environment](https://www.chemcomp.com/Products.htm) and [Dragon](http://www.talete.mi.it/products/dragon_description.htm) softwares are commercial products that are often used to calculate molecular descriptors for sets of molecules.  However, there are several open-source solutions to this as well.  

Chemical descriptors are generally broken up into two categories.  

1) Molecular descriptors - usually continious (real valued numbers, floats) values describing inherit molecular attributes.  E.g., molecular weight, logP,  etc.

2) Molecular fingerprints - Binary (0, 1) or count-based (integers) values describing the number or presence of substructures in a chemical. 

### Traditional Molecular Descriptors

Here we will calculate traditional molecular descriptors.  Mordred calculates a variety of molecular descriptors (around 2,000 in total).

First lets import our dataset

The [Mordred](https://github.com/mordred-descriptor/mordred) software is a molecular descriptors calculator available in Python.  The paper describing the software can be found [here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y?ref=https://githubhelp.com).

The package is broken up into two main classes `Descriptor` and `Calculator`

### Molecular Fingerprints

Molecular fingerints are usually binary and describe the presence of absence of certain chemical substructures.  

Generally, they are either key-based, meaning they denote the presence or absence of a predefined chemical fragment or set of atoms or hased fingerprints which do not have a predifined structures set.  Here we will calculate an example of each. 

* MACCS Keys [Ref.](https://pubs.acs.org/doi/10.1021/ci010132r)

Also known as MDL keys are 166 predefined substructures and we developed for substructure and database searching. 

* Morgan Fingerprints [Ref.](https://pubs.acs.org/doi/10.1021/ci100050t)

Morgan fingerprints, also know as extended-connectivity or ECFP fingerprints are a type of fingerprint that considers the atom environment around each atom in a molecule.  It relies on using the [Morgan Algorithm ](https://pubs.acs.org/doi/10.1021/c160017a018) to find all substructures of a an atom up to a certain number of atoms (e.g., all substructures 3 atoms long).  This number is called the diameter.  So, ECFP6 fingerprints calculate all fragments of all molecules from 1-6 atoms in length.  To keep track of unique substructures, a [hashing alogirthm](https://en.wikipedia.org/wiki/Hash_function) is applied to assign them a unique number and keep track of which molecules have which common substructures.  Because these numbers can get pretty large, its often necessary to "fold" these into a small predefined length (e.g., 1024, 2048).  


First we write a function to calculate fingperints of each type.  

In [None]:
def calc_fp_from_mol(mol, method="maccs", n_bits=2048):
    """
    Encode a molecule from a RDKit Mol into a fingerprint.

    Parameters
    ----------
    mol : RDKit Mol
        The RDKit molecule.

    method : str
        The type of fingerprint to use. Default is MACCS keys.

    n_bits : int
        The length of the fingerprint.

    Returns
    -------
    array
        The fingerprint array.

    """

    if method == "maccs":
        return list(MACCSkeys.GenMACCSKeys(mol))
    elif method == "ecfp4":
        return list(AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=n_bits))
    elif method == "ecfp6":
        return list(AllChem.GetMorganFingerprintAsBitVect(mol, 3, nBits=n_bits))
    else:
        print(f"Warning: Wrong method specified: {method}. Default will be used instead.")
        return list(MACCSkeys.GenMACCSKeys(mol))

### MACCS Fingerprints

### Morgan Fingerprints

Calculate Morgan Fingerprints at a bond diameter of 6 and folded into 1024 bits.  

Let's set the indexes as the names of our molecules and save to a CSV file.  