## Setting up the Google Colab

In order to prepare the Google Colab, please run the two cells below. After the second cell has run, restart the session.

In [None]:
!git clone https://github.com/wutobias/practicals-2023
!cp -r practicals-2023/3/Notebooks/* .

In [None]:
!pip install rdkit pandas scikit-learn numpy matplotlib nglview
!apt-get install pymol openbabel

# Structure Based Design

In this practical we will use structural information to guide the design of drug molecules. We will use the three dimensional structure of a protein and try to find molecules that form a stable complex with the protein. Note how this is different from our previous practical, where we only had information about the chemical structure and binding affinity of the ligands.
We will use the same target protein as in the previous practical, Human cAMP-dependent protein kinase (PKA), and the same set of binders for our structure-based design.

## Searching the PDB

Now that we have the chemical structures and binding affinities from our last practical, we want to start exploring structural information about the target. For this we will go to the PDB database (https://www.rcsb.org) and enter the UniProt-ID of our target protein `P17612` into the search mask. You will now see a whole list with structures that match our target protein. Look at the different information displayed for each entry.

1.) What is Xray crystal structure determination?

2.) Suggest criteria to pick the best structure from the list.

## Looking at the Protein Structure

### General Overview of the PDB data

Now pick the structure with pdb code `3OVV` by clicking on the entry.

1.) We have picked a structure that already contains an inhibitor. Could we have also picked one with an empty binding site (i.e. a holo structure)? Could the choice of crystal structure bias our results?

2.) Look at the panel `Sequence Annotations` and explain what the "DISORDER" chart tells you. Why is it higher at the ends of the sequence?

### Structural Overview

Click on `Structure` right next to `Structure in 3D` to inspect the three dimensional structure of the protein bound with an inhibitor. 

1.) Can you identify the inhibitor?

### Identifying the Binding Site

On the right panel click `+ Add` and select `Type->Protein` and pick the representation mode `Molecular Surface`.

1.) Can you identify the binding site? 

2.) Is the binding site buried or solvent exposed?

### Electron Density and Interaction Analysis

Click on `Electron Density` right next to `Structure in 3D` to inspect the electron density. First click on the inihibitor again. Now the blue transparent surface displays the electron density and the dashed lines display interatomic interactions.

1.) Is the inhibitor nicely modelled into the electron density?

2.) What are the big red spheres?

3.) Describe the interactions that the inhibitor can undergo. Which of these will be stronger, which one will be weaker?

## Recovering our database

We will recover our database from the previous practical and load it into a pandas dataframe (run the cell below).

In [None]:
import pandas as pd
output_df = pd.read_csv("./binders.csv")

from rdkit.Chem import Descriptors, Draw, PandasTools
molecules = pd.DataFrame(
    {"molecule_chembl_id"   : output_df["molecule_chembl_id"], 
     "smiles" : output_df["smiles"],
     "pIC50"  : output_df["pIC50"],
     "active" : [1 for _ in range(output_df.shape[0])]
    })

## Docking

Next we will run the docking using the program `autodock-vina`. This will take some time, so you won't do this now during the practical. Instead, your instructor has already carried out the docking for you. Of course, you are free to repeat it yourself after the practical.

In [None]:
from helpers import read_pdbqt
molecules = read_pdbqt(molecules)

## Re-dock the crystal structure

Next, we want to re-dock the ligand from the crystal structure. This was already done for you and the results can be visualized below (run the cell first).
The docked and the experimentally resolved ligand do not overlap for the post with the best score. Change the the pose index `POSE_IDX` until you find a docked pose that matches nicely with the experimentally resolved structure.

**Note**: If nothing will be displayed, we will be doing this excercise together.

1.) Describe how the correctly and incorrectly docked poses compare with the crystal structure of the ligand.

2.) Also run the cell two below in order to print out the scores for each of the docking poses. What does it tell you qualititatively about the shape of the scoring function surface?

In [None]:
import nglview as nv
POSE_IDX = 0

from google.colab import output
output.enable_custom_widget_manager()

view = nv.NGLWidget()
view.add_structure(
    nv.FileStructure("ligand.pdbqt"))
view.add_structure(
    nv.FileStructure("ligand_dock.pdbqt"))
view._remote_call('setSelection', target='compList', args=[f"/{POSE_IDX}"], 
               kwargs=dict(component_index=1))
view.add_structure(
    nv.FileStructure("3ovv_prot.pdbqt"))
view

In [None]:
with open("ligand_dock.pdbqt", "r") as fopen:
    counts = 0
    for line in fopen:
        if "REMARK VINA RESULT:" in line:
            line = line.rstrip().lstrip().split()
            print(f"Pose {counts}, Score {line[3]}")
            counts += 1

## Visualize docked ligands

Next, we will visualize some of the docked ligands and analyze their interactions with the target protein. For that purpose, download the file `docked_pdb.zip` and unpack it. Upload the pdb files to the PLIP server (https://plip-tool.biotec.tu-dresden.de/plip-web/plip/index) to analyse and visualize the interactions of the docked complexes.

1.) Identify hydrogen bonds and hydrophobic interactions.

2.) Below is a list of the ligands sorted by docking score. What are the differences between a high and a low scoring ligand.

In [None]:
molecules.sort_values("score-0")

## Develop your own Molecules

Below you have the chance to develop your own molecules by entering their smiles code into the Python dictionary below. They will be docked and saved into a combined (i.e. receptor + ligand) `.pdb` file which can be upload to the PLIP server and analysed.

In [None]:
my_molecules = {
    "Mol-1" : "CCCCCNCC"
}

from rdkit.Chem import Descriptors, Draw, PandasTools
my_molecules = pd.DataFrame(
    {"molecule_chembl_id"   : my_molecules.keys(), 
     "smiles" : my_molecules.values(),
     "pIC50"  : [999999. for _ in my_molecules],
     "active" : [0 for _ in my_molecules],
    })
PandasTools.RenderImagesInAllDataFrames(images=True)
PandasTools.AddMoleculeColumnToFrame(my_molecules, "smiles", includeFingerprints=True)

from helpers import ad4v_dock
ad4v_dock(my_molecules, "3ovv_prot.pdbqt", [ -7.731, -8.501, 19.163])

from helpers import read_pdbqt
my_molecules = read_pdbqt(my_molecules)

from helpers import combine_pdbqt
combine_pdbqt(my_molecules, "3ovv_prot.pdbqt")

In [None]:
my_molecules

## Extra Analysis: Lack of negative binding data

Note: This is analysis is not mandatory for the practical. Still, you're welcome to do it.

Ultimately we will want to explore how well our docking method will be able to identify true binders and seperate them out from the non-binders. This is another way of saying we want our docking method to generate many true positives and only little (or now) false positives. This will only work if our dataset contains **both** true negative examples (i.e. non-binders) and true positive examples (binders). However, we only have access to true binders, because this is what is usually published in the literature. To circumvent this problem, we will have to generate virtual non-binders, also called decoys. These decoy molecules are generated using the method of "Property-matched Decoys". For this to work, we will first cluster our molecules and then pick the cluster centers as the input for the `DUD-E` webserver (dude.docking.org/generate). Run the cell below to retrieve the smiles codes of the cluster centers.
Note: You don't have to generate these decoys now. Your instructor has generated them already.

1.) What are true positives and what are false positives?

2.) What are "Property-matched Decoys"? See this paper (DOI): doi.org/10.1021/jm300687e

3.) What is clustering and what does it achieve in this context?

In [None]:
from rdkit.Chem import AllChem as Chem
fps = list()
for smi in output_df["smiles"]:
    rdmol = Chem.MolFromSmiles(smi)
    fps.append(
        Chem.GetMorganFingerprintAsBitVect(rdmol, 2))
from helpers import ClusterFps
results = ClusterFps(fps, 0.6)
for idx_list in results:
    smi = output_df.loc[idx_list[0], "smiles"]
    print(smi)

## Combining the decoy database and our binder database

Now we will add the decoys to the database. We will add a column `active` that equals to `1` if the molecule is a binder and `0` if not.

1.) Compare some of the binders and non-binders. Do they look similar?

In [None]:
from rdkit.Chem import Descriptors, Draw, PandasTools
molecules = pd.DataFrame(
    {"molecule_chembl_id"   : output_df["molecule_chembl_id"], 
     "smiles" : output_df["smiles"],
     "pIC50"  : output_df["pIC50"],
     "active" : [1 for _ in range(output_df.shape[0])]
    })
smiles_list = list()
import glob
for path in glob.glob("dude-decoys/decoys/decoys.*.picked"):
    with open(path, "r") as fopen:
        for line in fopen:
            line = line.replace("ligand", "")
            c = line.rstrip().lstrip().split()
            smiles_list.append(c[0])
for idx, smi in enumerate(smiles_list[:100]):
    Nrows = molecules.shape[0]
    molecules.loc[Nrows] = [f"Decoy-{idx}", smi, 99999., 0]
PandasTools.RenderImagesInAllDataFrames(images=True)
PandasTools.AddMoleculeColumnToFrame(molecules, "smiles", includeFingerprints=True)

## Receiver Operator Characteristic

Next, we will look at how our docking method performs on the task of distinguishing true positives from false positives. For that we will plot the ROC (Receiver Operator Characteristic) and compute the AUC (area under the curve).

1.) What is the ROC and what does it tell us qualititavely?

2.) What does AUC tell us? Is a good value of 0.5 good?

In [None]:
from sklearn import metrics
y_true = list()
y_pred = list()
for row_idx, row in molecules[molecules.pIC50 < 4].iterrows():
    if row["score-0"] != None:
        y_true.append(row["active"])
        y_pred.append(-row["score-0"])
for row_idx, row in molecules[molecules.active == 0].iterrows():
    if row["score-0"] != None:
        y_true.append(row["active"])
        y_pred.append(-row["score-0"])

fpr, tpr, thresholds = metrics.roc_curve(y_true, y_pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
display.plot()