# Molecular Descriptors

## Aim of this lab

To understand and calculate common types of molecular descriptors, including chemical fingerprints.  

### Objectives

* Calculate RDKit Descriptors 
* Calculate MACCS Keys
* Calculate Morgan Fingerprints


### Molecular Descriptors

Molecular descriptors are the fundation of any quantitatively structure activity relationship.  Because we have a computational version of molecules (e.g., graphs) we can calculate molecular attributes, called descriptors, which are quantitative measures inherit on their chemical structure.  Depending on the software you use, they can be fairly few descirptors or even thousands.  

There are numerous sets of chemical descriptors that exists.  For example, [Molecular Operating Environment](https://www.chemcomp.com/Products.htm) and [Dragon](http://www.talete.mi.it/products/dragon_description.htm) softwares are commercial products that are often used to calculate molecular descriptors for sets of molecules.  However, there are several open-source solutions to this as well.  

Chemical descriptors are generally broken up into two categories.  

1) Molecular descriptors - continious (real valued numbers, floats) values describing inherit molecular attributes.  E.g., molecular weight, logP,  etc.

2) Molecular fingerprints - Binary (0, 1) or count-based (integers) values describing the number or presence of substructures in a chemical. 

### Traditional Molecular Descriptors

Here we will calculate traditional molecular descriptors.  The RDKit calculates a variety of molecular descriptors (around 200 in total) and chemical fingerprints.  The details between each descriptor and fingerprints can be found [here](https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors). 

There are functions in RDKit to calculate molecular descriptors for a single molecular and single descriptor.  

For example, we can calcluate the total polar surface area, or LogP....

But for large datasets and doing QSAR, we'll likely want to create ALL descriptors on all molecules.  For that we'll write a function that calculates all the descriptors available in RDKit and use that function for every chemical in our dataset.

### Molecular Fingerprints

Molecular fingperints are usually binary and describe the presence of absence of certain chemical substructures.  

Generally, they are either key-based, meaning they denote the presence or absence of a predefined chemical fragment or set of atoms or hased fingerprints which do not have a predifined structures set.  Here we will calculate an example of each. 

* MACCS Keys [Ref.](https://pubs.acs.org/doi/10.1021/ci010132r)

Also known as MDL keys are 166 predefined substructures and we developed for substructure and database searching. 

* Morgan Fingerprints [Ref.](https://pubs.acs.org/doi/10.1021/ci100050t)

Morgan fingerprints, also know as extended-connectivity or ECFP fingerprints are a type of fingerprint that considers the atom environment around each atom in a molecule.  It relies on using the [Morgan Algorithm ](https://pubs.acs.org/doi/10.1021/c160017a018) to find all substructures of a an atom up to a certain number of atoms (e.g., all substructures 3 atoms long).  This number is called the diameter.  So, ECFP6 fingerprints calculate all fragments of all molecules from 1-6 atoms in length.  To keep track of unique substructures, a [hashing alogirthm](https://en.wikipedia.org/wiki/Hash_function) is applied to assign them a unique number and keep track of which molecules have which common substructures.  Because these numbers can get pretty large, its often necessary to "fold" these into a small predefined length (e.g., 1024, 2048).  


First we write a function to calculate fingperints of each type.  

### MACCS Fingerprints

### Morgan Fingerprints

Calculate Morgan Fingerprints at a bond diameter of 6 and folded into 1024 bits.  