# Week 6: RDKit (part 2)
Last week, we learned about the RDKit library and how to use it to manipulate and visualize molecules. This week, we will go in deeper and learn about the following topics:

* Substructure matching
    * Maximum common substructure
* Conformer generation

## Substructure matching
Substructure matching is a common task in cheminformatics. It is used to find a specific substructure in a molecule. RDKit provides a few functions for substructure searching. One of them is [`HasSubstructMatch`](https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.HasSubstructMatch), which returns `True` if the molecule contains the substructure and `False` otherwise. Another function is [`GetSubstructMatch`](https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.GetSubstructMatch), which returns the indices of the atoms in the molecule that match the substructure. Let's see an example of substructure matching using the `HasSubstructMatch` function.

Let's start by loading the RDKit library and the drug molecules from last week's exercise.

In [None]:
# some imports
from rdkit import Chem
from rdkit.Chem import PandasTools
import pandas as pd
from pathlib import Path
import os

In [None]:
current_file = Path(os.path.abspath(''))
csv_file = current_file.parent / "week_05" / "chembl_drugs.csv"
df = pd.read_csv(csv_file, sep= ";")
df.head()

In [None]:
# remove the rows with NaN values for SMILES (you did it last week)
df = # TODO your code here

In [None]:
# add molecule objects to the df from SMILES in a column named "mol" using RDKit functions (PandasTools or Chem module)
# TODO your code here

### Tyrosina Kinase Inhibitors
Tyrosine kinases are enzymes that catalyze the transfer of phosphate groups from ATP to tyrosine residues in proteins. The phosphorylation of proteins by tyrosine kinases is an important mechanism for the activation of signal transduction pathways in response to extracellular stimuli. The uncontrolled (or excessive) activity of tyrosine kinases can lead to cancer. Therefore, tyrosine kinases are an important target for cancer therapy. In the binding pocket of these kinases there is a conserved lysine residue that binds to the phosphate group of ATP (the "hinge region"). The inhibitors of tyrosine kinases are designed to mimic ATP and there's substructures ("motifs") known to enable binding the hinge region. For more information look at this paper: [Designing of kinase hinge binders: A medicinal chemistry perspective](https://onlinelibrary.wiley.com/doi/epdf/10.1111/cbdd.14024?saml_referrer)

First, let's get the motif that binds the hinge region of tyrosine kinases:

![image.png](attachment:image.png)

In [None]:
indazole_smi = # TODO type the SMILES from the molecule in the picture and check if they match
indazole = Chem.MolFromSmiles(indazole_smi)
indazole

Let's test the [`HasSubstructMatch`](https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.HasSubstructMatch) function on one example molecule to see how it works.

In [None]:
test_mol = df["mol"][0]
test_mol

In [None]:
# TODO your code here: Read the documentation in the link above to understand how to check if indazole is in the test_mol

Now, let's query all the drug molecules for the presence of the hinge binding motif. use the [`HasSubstructMatch`](https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.HasSubstructMatch) function in combinations with `pandas.apply` to check if the molecule contains the motif.

In [None]:
df['has_indazole'] = # TODO your code here

`USAN Definitions` roughly tell us what class of drug we are looking at. We'll remove missing values for analysis purposes.

In [None]:
# remove rows that do not have a USAN definition
df = df.dropna(subset=["USAN Definition"])
df.head()

Filter the molecules:
* remove the molecules that do not contain the motif
* only keep molecules that are tyrosine kinase inhibitors

In [None]:
# 1) only keep those with the motif
df_fil = # TODO your code here
df_fil.head()

In order to filter by target/mechanism of action, we will look at the USAN Definitions, which tell us the type of molecules we're dealing with.

In [None]:
# extract all USAN Definitions
moa = df_fil['USAN Definition'].unique().tolist()
moa

In [None]:
# counts of moa in df_fil
df_fil['USAN Definition'].value_counts()

In this list by counts of molecules, we can see that anticancer drugs (antineoplastics) are the most common but also "tyrosine kinase inhibitors" are frequent. They appear with different names. Filter the dataframe so that the row `USAN Definitions` only needs to contain the string but does not need to match completely.

In [None]:
# 2) only keep tyrosine kinase inhibitors
tki = df_fil[df_fil['USAN Definition'].str.contains("tyrosine kinase inhibitor")]
tki

In [None]:
# visualize the remaining TKI structures using the PandasTools.FrameToGridImage function
# TODO your code here

The function [`GetSubstructMatch`](https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.GetSubstructMatch) returns the indices of the atoms in the molecule that match the substructure. Let's use this function to visualize the substructure in the first molecule.

In [None]:
mol = tki['mol'].iloc[0]
match_indices = # TODO your code here: Look up the documentation to understand how to get the indices of the substructure in the molecule
match_indices

In [None]:
#  highlight the indazole motif using the Draw.MolToImage function
from rdkit.Chem import Draw
img = # TODO your code here
img

### Thalidomide case
Let's look at a different example:

We'll extract the structure for the drug thalidomide and the substructure for the imide group. Thalidomide is a drug that was used in the late 1950s and early 1960s to treat morning sickness in pregnant women. However, it was later found to cause birth defects, and it was withdrawn from the market. The imide group is a functional group that contains a carbonyl group bonded to a nitrogen atom. Only the R-enantiomer of thalidomide is effective as a sedative, while the S-enantiomer is teratogenic. The imide group is present in both enantiomers. Let's check which enantiomer of thalidomide is present in our database.

In [None]:
s_thalidomide = Chem.MolFromSmiles('O=C1c2ccccc2C(=O)N1[C@H]3CCC(=O)NC3=O')
s_thalidomide

In [None]:
# find the Thalidomide molecule object in the dataframe
thalidomide = # TODO your code here
type(thalidomide)

In [None]:
# check if the Thalidomide molecule object has the S motif using the HasSubstructMatch function
# TODO your code here

In [None]:
thalidomide

It tells us that the molecule does contain the s-enantiomer. However, when we look at the molecule, we see that actually no stereochemistry is defined. RDKit does not consider stereochemistry by default when searching for substructures. We can change this by setting the `useChirality` parameter to `True`. Let's see if we can find the s-enantiomer of thalidomide in our database.

In [None]:
# check if thalidomide contains the S-enantiomer by enforcing chirality
# TODO your code here

In this example, we query for the full molecule and not just a substructure. This is actually a good approach to check if two molecules are identical or not.

### Maximum common substructure (MCS)
Back to our example with tyrosine kinase inhibitors:

We saw that with prior knowledge on which substructure to look for we can find interesting typrosine kinase inhibitors that are found in chemotherapeutics. But what if we don't know what substructure to look for? This is where the maximum common substructure (MCS) comes in. The MCS is the largest common substructure between two molecules. It can be used to find the common substructure between two molecules, which can be useful in many applications, such as finding the common substructure between two molecules, or finding the common substructure between two molecules. The MCS can be found using the `rdkit.Chem.rdFMCS` module.

In [None]:
from rdkit.Chem import rdFMCS

In [None]:
tki_mols = tki['mol'].tolist()
tki_mols[0]

Now, let's find the MCS between the list of molecules. Use the [`rdFMCS.FindMCS`](https://www.rdkit.org/docs/source/rdkit.Chem.rdFMCS.html) function. What does this function return and how can you visualize it?

In [None]:
mcs = # TODO your code here
mcs

We need to extract the SMARTS pattern from the MCS result and then visualize it. The SMARTS notation is similar to the SMILES structure in that it is string based but it described patterns rather than full structures. Here's a [link](https://ics.uci.edu/~dock/manuals/DaylightTheoryManual/theory.smarts.html#:~:text=All%20SMILES%20expressions%20are%20also,when%20used%20as%20a%20SMARTS.) on a quick overview of SMARTS for the curious.

In [None]:
mcs_smarts = mcs.smartsString
mcs_mol = # TODO your code here: get a mol object from a SMARTS string
mcs_mol

Great! This is exactly what we expected. The MCS is the substructure that we initially used to filter the molecules. Therefore, it is not surprising to find it as the MCS. What happens if we look at all the tyrosine kinase inhibitors? Will the MCS be the same?

Extract all the molecules from the unflitered dataframe and find the MCS. What do you observe?

In [None]:
# extract all TKI molecules from the dataframe
tki_all = df[df['USAN Definition'].str.contains("tyrosine kinase inhibitor")]
mols_all = tki_all['mol'].tolist()
len(tki_all)

In [None]:
# extract the MCS (no new code here - copy from above!)
mcs = # TODO your code here: find the MCS of all TKI molecules
mcs_smarts = # TODO your code here: get the SMARTS string of the MCS
mcs_mol = # TODO your code here: get a mol object from the SMARTS string
mcs_mol

Hmmm, this looks not like what we expected. Furthermore, having so many linear molecules in a set of drugs is somewhat surprising. Let's look at a few molecules and see if that makes sense:

In [None]:
from rdkit.Chem import Draw
Draw.MolsToGridImage(mols_all[:10], molsPerRow=5, subImgSize=(300, 300), legends=["MCS"] + tki_all['USAN Definition'].tolist())

We cannot observe this substrucutre in any of the molecules! What went wrong?

If we look into the documentation of the [`rdFMCS.FindMCS`](https://www.rdkit.org/docs/source/rdkit.Chem.rdFMCS.html) function we see that there is an argument `completeRingsOnly`that is set to False by default. This means that the MCS can be any substructure and does not have to be a ring. Let's set this argument to True and see if that changes the result.

In [None]:
# extract MCS again but this time with complete rings only
mcs = # TODO your code here
mcs_smarts = mcs.smartsString
mcs_mol = Chem.MolFromSmarts(mcs_smarts)
mcs_mol

Well, that looks even stranger. Maybe the structures are just too diverse. We can assume that there are mulitple substructures that can engage the hinge region and imitate the base of ATP. Let's lower the threshold that defines the number of molecules that need to contain the substructure. We can do this by setting the `threshold` argument in the `rdFMCS.FindMCS` function. Let's lower it stepwise until we get something more reasonable.

In [None]:
# extract MCS again but this time adding a threshold of 0.5
mcs = # TODO your code here
mcs_smarts = mcs.smartsString
mcs_mol = Chem.MolFromSmarts(mcs_smarts)
mcs_mol

Now, this already looks more interesting! This means that 50% of the TKI in our dataset contain this pattern. Let's visualize our molecules again together with the found MCS to see if it makes sense.

In [None]:
Draw.MolsToGridImage([mcs_mol] + mols_all, molsPerRow=5, subImgSize=(300, 300), legends=["MCS"] + tki_all['USAN Definition'].tolist())

## Conformer generation
A molecule can appear in mutliple conformations, meaning a specific geometric arrangement of atoms in a molecule. Conformer generation is the process of generating different conformations of a molecule. Conformer generation is important in many aspects of chemistry as often the molecules properties and bioactivity or reactivity depend on its conformation. RDKit provides a few functions for conformer generation in the `rdkit.Chem.AllChem` module. One of them is the [`EmbedMolecule`](https://www.rdkit.org/docs/source/rdkit.Chem.rdDistGeom.html) function, which generates a single conformation of a molecule. Another function is the [`EmbedMultipleConfs`](https://www.rdkit.org/docs/source/rdkit.Chem.rdDistGeom.html) function, which generates multiple conformations of a molecule. Let's see this in action on the S-thalidomide structure

In [None]:
from rdkit.Chem import AllChem

First, let's generate one conformer for the S-Thalidomide using [`EmbedMolecule`](https://www.rdkit.org/docs/source/rdkit.Chem.rdDistGeom.html):

In [None]:
# TODO your code here: generate one conformer (note: the function does not return a new molecule object but modifies the existing one)
s_thalidomide

RDKit already complains that we should consider calling `AddHs()`. Why is that? Let's add the hydrogens and try again.

In [None]:
s_thalidomide = # TODO your code here: add hydrogens
# TODO your code here: generate one conformer again
s_thalidomide

In [None]:
# see how many conformers the molecule has (should return 1)
s_thalidomide.GetNumConformers()

We already discussed that one molecule can usually have multiple conformations. You can observe this by rerunning the cell above several times. You should be able to see a new conformer every time. 

Now, let's generate `10` conformers with [`EmbedMultipleConfs`](https://www.rdkit.org/docs/source/rdkit.Chem.rdDistGeom.html) and visualize them.

In [None]:
# TODO your code here: generate 10 conformers (again, the function does not return a new molecule object but modifies the existing one)

In [None]:
# see how many conformers the molecule has (should return 10)
s_thalidomide.GetNumConformers()

For the set of conformers we can calculate the RMSDs between them.

**What is the RMSD?**
The root-mean square deviation (RMSD) of two structures is a measure of the average distance between the atoms of two superimposed structures. The basic equation for calculating RMSD is below:

$$RMSD = \sqrt{\frac{1}{n}\sum_{i=1}^{n}d_i^2}$$

where $n$ is the number of atoms, and $d_i$ is the distance between the atoms in the two structures. The lower the RMSD, the more similar the structures are.

Let's calculate the RMSDs between the conformers.

In [None]:
rmsds = # TODO your code here: calculate the RMSD between the conformers (note: you can write a RMSD function yourself or look up the RDKit documentation for a function that does this for you)

In [None]:
# plot distribution of RMSDs
import matplotlib.pyplot as plt
plt.hist(rmsds, bins=20);
plt.ylabel("Frequency")
plt.xlabel("RMSD")

Great! We can see that the molecule has multiple conformations that are actually different. This is important to consider when working with molecules, as the properties of the molecule can depend on its conformation. Often we are interested in the lowest energy conformer since it is the most stable one. Let's minimize the energy of the conformers by using a force field. A force field is a set of equations that describe the potential energy of a molecule. The potential energy is the energy that is stored in the molecule due to the arrangement of the atoms. We will change the arrangement slightly in order to minimize the potential energy. RDKit has mutliple force fields implemented from which we will use the UFF force field.

These three functions are used:
* [`UFFGetMoleculeForceField`](https://www.rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html#rdkit.Chem.rdForceFieldHelpers.UFFGetMoleculeForceField)
* [`CalcEnergy`](https://www.rdkit.org/docs/source/rdkit.ForceField.rdForceField.html#rdkit.ForceField.rdForceField.ForceField.CalcEnergy)
* [`Minimize`](https://www.rdkit.org/docs/source/rdkit.ForceField.rdForceField.html#rdkit.ForceField.rdForceField.ForceField.Minimize)

In [None]:
# loop over all conformers, minimize and calculate the energy
energies = []
for conf in s_thalidomide.GetConformers():
    ff = # TODO your code here: create a force field object with the first function
    energy_before = # TODO your code here: calculate the energy using the force field ff and second function
    # TODO your code here: minimize the energy using the force field and the thrid function
    energy_after = # TODO your code here: calculate the energy using the force field ff and second function
    energies.append(energy_after)
    print(f"Energy before: {energy_before:.4f}; Energy after: {energy_after:.4f}")
energies

We saw that each conformer gets adapted and the calculated energy is lowered after minimization. This is a good approach to get the most stable conformer of a molecule. Let's calculate the RMSDs between the minimized conformers.

In [None]:
rmsds = AllChem.GetConformerRMSMatrix(s_thalidomide)
plt.hist(rmsds, bins=20);
plt.ylabel("Frequency")
plt.xlabel("RMSD")

We can see that even though we minimized the structures the conformers are still different from each other. Likely, because there are several "good" conformers for this small structure. We can still select the minimal-energy conformer based on the calculated energies and numpy's `argmin` function.

In [None]:
import numpy as np
min_idx = np.argmin(energies)
min_energy = energies[min_idx]
min_energy