# T002 · Diffusion-based docking models

**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

- Hamza Ibrahim, CADD seminars 2023, Universität des Saarlandes (UdS)
- Michael Bockenköhler, 2023,  [Volkamer lab](https://volkamerlab.org), Universität des Saarlandes (UdS)
- Andrea Volkamer, 2023,  [Volkamer lab](https://volkamerlab.org), Universität des Saarlandes (UdS)

## Aim of this talktorial

In this talktorial, we present two state-of-the-art classes of generative models. We start by defining generative models and explain the fundamentals of two powerful classes of generative models. Afterward, we will explore the implementation of one type of the presented generative models in the field of drug discovery and the challenges that were encountered and how they were solved.

### Contents in *Theory*

* Generative models
    * Denoising diffusion probabilistic model (DDPM)
        * Forward process
        * Reverse process
        * DDPM training
            * Loss function
            * Network architecture
    * Score-based generative model
        * Score model with stochastic differential equations (SDEs)
* Diffusion-based docking models
    * Ligand pose manifold
    * Product space diffusion
    * Model architecture

### Contents in *Practical*

* Data preparation
    - Download PDB structure
    - Prepare input data
* DiffDock implementation
* Denoising visualization

### References

* Generative Modeling by Estimating Gradients of the Data Distribution: ([_arXiv_ (2019), 1907.05600](https://arxiv.org/abs/1907.05600))
* Denoising Diffusion Probabilistic Models ([_arXiv_ (2020), 2006.11239](https://arxiv.org/abs/2006.11239))
* Score-based generative modeling through stochastic differential equations: ([_arXiv_ (2021), 2011.13456](https://arxiv.org/abs/2011.13456))
* DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking: ([_arXiv_ (2023), 2210.01776](https://arxiv.org/abs/2210.01776))
* Equivariant Graph Neural Networks: ([_arXiv_ (2022), 2102.09844](https://arxiv.org/abs/2102.09844))
* Deep Unsupervised Learning using Nonequilibrium Thermodynamics: ([_arXiv_ (2015), 1503.03585](https://arxiv.org/pdf/1503.03585.pdf))
* [Diffusion Model Clearly Explained!](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166)
* [Generative Modeling by Estimating Gradients of the Data Distribution by Yang Song](https://yang-song.net/blog/2021/score/#connection-to-diffusion-models-and-others)

## Theory

## Generative models

Generative models are a category of machine learning models that generate new data by learning the data distribution of a given training data set by injecting noise to the input. In a nutshell ***"Creating noise from data is easy; creating data from noise is generative modeling."*** ([_arXiv_ (2021), 2011.13456](https://arxiv.org/abs/2011.13456)).

In this section, we are going to discuss two advanced techniques used in generative modeling. The first type is the denoising diffusion probabilistic model (DDPM), which generates new data by _"denoising"_ the input data to predict the data distribution. The second type is score-based generative models. It utilzies the stochastic differential equations (SDEs) to reconstruct the data using **score function**.

### Denoising diffusion probabilistic model.

DDPM or so called "diffusion model" is inspired by Physics by non-equilibrium thermodynamics ([_arXiv_ (2015), 1503.03585](https://arxiv.org/pdf/1503.03585.pdf)). It learns to generate new data depending on two main reciprocal processes. These two main reciprocal processes are represented as two sets of random variables organized in the form of Markov process.


1. Forward diffusion process → add noise to input data.
2. Reverse diffusion process → denoise noised data.
<a id='fig1'></a>

![DGM processes figure](images/basics_dgm.png)

*Figure 1:* 
Black arrows represent the forward diffusion process, while blue arrow represents the reverse diffusion process.
Figure is taken from: [Medium article](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166).

#### 1. Forward process

The first process adds gaussian noise sequentially to the input data $x_0$ by $T$ steps. After adding noise repeatedly, $x_T$ at some point becomes a complete static noise image as in [_figure 1_](#fig1). We can compute every successive state $\mathbb{x}_{t + 1}$ as the following : 

$$
q(\mathbb{x}_{t}|{x}_{t-1}) = \mathcal{N(\mathbb{x}_{t};\mathbb{\mu}_t = \sqrt{1 - \beta_t}{x}_{t-1}, \Sigma_t = \beta_t \mathbf{I})}, \tag{1}
$$
Where $q(\mathbb{x}_{t}|{x}_{t-1})$ denotes the conditioned probability conditioned by $\mathbb{x}_{t}$.

 $\mathbb{\mu}_t$ and $\Sigma_t({x}_t, t)$ represent the mean and covariance of next state distribution, respectively.


Adding noise knowing it is normally distributed, enables us to compute the state at any given step. This can be done by utilizing the [reparametrization trick](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166#228f). It prompts us to sample ${x}_{t}$ at any time step using ${x}_{0}$. It makes forward diffusion process much faster as following:
$$
{x}_{t} = \sqrt{{\bar{\alpha}}_t} {x}_0 + \sqrt{1 - {\bar\alpha}_t} {\epsilon}_0, \tag{2}
$$

Where ${\bar\alpha}_t = \prod_{s = 0}^{t}{1 - {\beta}_s}$ , and ${\epsilon}_0, ... , {\epsilon}_{t-2}, {\epsilon}_{t-1} \sim \mathcal{N (0 , \mathbf{I})}$

#### 2. Reverse process

Unfortunately, It's not possible to sample ${x}_{0}$ from ${x}_{t}$ using $q(\mathbb{x}_{t-1}|{x}_{t})$ like in the forward process, because reversing the noise is intractable, therefore **reverse diffusion process** is employed. As a solution $q(\mathbb{x}_{t-1}|{x}_{t})$ could be approximated by using a deep learning model (e.g. neural network), which predicts an approximation to the conditional probability distribution $\mathbb{p}_{\theta}(\mathbb{x}_{t-1}|{x}_{t})$, which modeled as a Gaussian distribution:

$$
\mathbb{p}_{\theta}(\mathbb{x}_{t-1}|{x}_{t}) = \mathcal{N(\mathbb{x}_{t-1};\mathbb{\mu}_\theta({x}_t, t), \Sigma_t({x}_t, t))}, \tag{3}
$$

By learning the conditional probability densities using a deep learning model the original image $x_0$ is reconstructed from the noisy image $\mathbb{x}_t$ as illustrated in _figure 1_. Allowing for the extraction of meaningful information from the noisy representation.

After explaining the two main processes of diffusion models, we start now with training the model.

#### Train a diffusion model

The objective of training a diffusion model is to recostructed given noised data by predicting the original data distribution. 

To effectively train a generative model, It's necessary to define an optimized loss function and the architecture of the deep learning model. In this section we will explain briefly the loss function and the network architecture, which is commonly employed in diffusion models then we'll explain the training process of diffusion models.


##### Loss function

Diffusion model has some similarities with variational autoencoders [(VAEs)](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73). They are both generative models used to learn data distribution to generate new data. Maximizing the log-likelihood as a loss function guides the model towards better prediction accuracy.

$$
\underset{\theta}{\text{max}}\sum_{i=1}^{N}\log{p_\theta(x_i)} \tag{4}
$$


In diffusion models the log-likelihood is intractable. However, we can indirectly optimize it by optimizing the [lower variational bound](https://en.wikipedia.org/wiki/Evidence_lower_bound). By skipping mathimatical details, [Ho et al.(2020)](https://arxiv.org/pdf/2006.11239.pdf) has simplified the loss function to:
$$
{L}_{t}^{simple} = \mathbb{E}_{t ,x_0, \epsilon}[|| \epsilon - {\epsilon}_\theta(x_t,t)||^2] \tag{5}
$$ 
where: 

$\epsilon \sim \mathcal{N}(0, \mathbb{I})$ is the actual noise added, which follows a standard normal distribution.

${\epsilon}_\theta(x_t,t) = {\epsilon}_\theta(\sqrt{\bar\alpha} x_0 + \sqrt{1 - \bar\alpha} \epsilon,t) $ denotes the approximated noise from a neural network using reparametrization trick that mentioned before.


The loss function is the difference between predicted and true values. As demonstrated, the simplified loss function is the mean square error (MSE) of the added noise and predicted noise.

In the context of diffusion models the true value corresponds to the distribution of added noise that is introduced to an image and the model's objective is to learn the original data distribution from the "noisy" version of the input data.

Once the loss function has been chosen, we can go to the next step, which is selecting an appropriate network architecture and training the diffusion model.


##### Network architecture

The most important requirement of the network is to have identical dimensionality for the input and the output. Therefore, usually, [U-Net](https://theaisummer.com/unet-architectures/) is commonly used for prediction tasks in diffusion models as a network architecture, more information can be found in the [appendix](#appendix).

By optimizing the gradient descent of the loss function, the model can be trained until it converges.

Now we got a clear idea of the outline of DDPM. In the next section, we'll explain another type of generative models, which is score-based generative model.

### Score-based generative model

Suppose a given data set {$x_1, ..., x_{N-1}, x_N$} which follows a certain probability distribution, denoted as $p(x)$. The primary goal of the score model is fitting a model to the given distribution $p(x)$ so that new data points can be generated by sampling from the learned distribution.

However, first, we need to think of a good way that represents the probability distribution well. One way is to model the probability density function (PDF) directly. So, let $f_\theta(x) \isin \mathbb{R}$ parameterized by $\theta$, which is learnable parameter. PDF of a probabilistic model $f_\theta(x)$ is defined as follow : 

$$
p_\theta(x)= \frac{e^{-f_\theta(x)}}{\mathbb{Z_\theta}}, \tag{6}
$$

As $\mathbb{Z_\theta} > 0$ is the normalizing constant, depends on $\theta$. The model can be trained by maximizing the log-likelihood of the PDF as in equation (4).

However, a general normalizing constant $\mathbb{Z_0}$ is most of the time intractable and we can't compute $p_\theta(x)$. To overcome the intractability problem of $\mathbb{Z_0}$, **score function** is modeled instead of the PDF, which is defined as :

$$
s_\theta(x) = \nabla_x \log {p(x)} \tag{7}
$$

$s_\theta(x)$ is a score-based model and can be parametrized without the need to evaluate $\mathbb{Z_0}$. By taking the gradient of the distribution, the normalizing constant becomes zero and we can ignore it as in equation 8.

$$
s_\theta(x) = - \nabla_x f_\theta(x) - \nabla_x \log{\mathbb{Z_0}} =  - \nabla_x f_\theta(x) \tag{8}
$$

As shown [_figure 2_](#fig2), there is no need to use normalization while parameterizing score functions. On the contrary, changes in data distribution undergo normalization, as the integration of area under the curve (AUC) must be one.

<a id='fig2'></a>

<p float="left">
  <img src="images/ebm.gif" height="280" width="400" />
  <img src="images/score.gif" height="280" width="400" /> 
</p>

*Figure 2:* 
Parameterization of probability density functions (on the left) and score functions (on right). 
Figures are taken from: [Yang-Song blogpost](https://yang-song.net/blog/2021/score/#connection-to-diffusion-models-and-others).

#### - Score model with stochastic differential equations (SDEs)

Adding multiple noise scale at different scales has shown an improvement to the model's ability to generate high quality samples. By increasing the noise scale to infinity, exact log-likelihood can be also obtained. By injecting noise to the data, the concept becomes similar to DDPM. However, the main difference in this scenario is that the noise perturbation is a continous-time [stochastic process](https://en.wikipedia.org/wiki/Stochastic_process#:~:text=A%20stochastic%20process%20is%20defined,measurable%20with%20respect%20to%20some) as shown in [figure 3](#fig3).

<a id='fig3'></a>

<img src="images/sde_schematic.jpg" height=400 width =1000  style="margin-left: auto; margin-right: auto;">

*Figure 3:* 
An overview of a forward and reverse SDE in general.
Figure is taken from: ([_arXiv_ (2021), 2011.13456](https://arxiv.org/abs/2011.13456)).

Like DDPM we get now the forward process and reverse process of SDE. Keep in mind that using SDE is not a unique approach and there are different ways to add noise perturbations, one way is shown in the example. However, the score function is used to generate data in the reverse SDE, as demonstrated in [_figure 3_](#fig3). which will use score matching for training the score-based model. For more comprehensive understanding, you can find more detailed information about training a score-based generative model in the appendix, [score model training](#scoremodel) section.

By now, two main state-of-the-art generative models are well-covered. In the next section, we discuss a novel tool called _DiffDock_. It has employed a generative model to tackle the challenge of molecular docking in the field of cheminformatics.

### Diffusion-based docking model.

The main concepts of creating a generative model are explained. It's important to note that implementing the generative model in molecular docking will not be the same as we explained. In this section, we discuss the challenges encountered in applying the generative model, especially the score-based generative model, in molecular docking and how these obstacles have been overcome in a real-case application.

#### 1. Ligand pose manifold

In order to have a diffusion-based docking model, you have to think of a manifold that suits ligand poses first where $L \isin \mathbb{R}^{3n} $ as $n$ is the number of atoms. If we start forward diffusing without setting any limitations for the degree of freedom, it becomes absurd and ligands will have unreasonable bond lengths and angles as in [_figure 4_](#fig4).

<a id='fig4'></a>

<img src="images/absurd_ligand.png"  style="margin-left: auto; margin-right: auto;">

*Figure 4:* 
Randomizing bond length and angles without keeping local structures fixed.

A solution to this problem is presented in [DiffDock paper](https://arxiv.org/pdf/2210.01776v2.pdf). They are inspired by traditional docking approaches by taking already embedded ligands in a 3D space using RDKit, which instantiates the angles and bond length of the atoms. Instead of thinking of a ligand as an element in euclidean space, they described ligand pose by four main parameters. 

1. Local structures like bond lengths, bond angles, chirality, and ring structure are generated using RDKit and kept fixed in order to maintain the integrity of the predicted ligand and the model.

2. Position of the ligand with the 3D translation group was left flexible to find the pocket and fit in it $\mathbb{R}^3$.

3. Rotation parameterization, where $Rotation \in {SO(3)}$ corresponds to 3D rigid rotation around the mass center of the ligand.

<img src="images/rotation.gif"  style="width:300px;">


*Figure 5:* 
GIF shows an example of the rotation of a methyl group in an Ethane structure. Figure is taken from [Proteopedia](https://proteopedia.org/wiki/index.php/Dihedral/Index)

4. Flexibility of torsion angles to fit in the pocket, where: $ \mathit{Torsions} \in \mathbb{T}^m$, which represent the changes in torsion angles around rotatable bonds in a ligand with a 2D rotation group ${SO(2)}$ copy.  

### Change GIF 
<img src="images/Phipsi-AH.gif"  style="width:400px;">

*Figure 6:* 
GIF illustrates the torsion angles and changes in them. As shown a torsion angle ϕ is defined by four covalently bonded atoms. Every three atoms defines a half-plane and when these planes intersect the angle between them is the torsion angle ϕ. Figure is taken from [Proteopedia](https://proteopedia.org/wiki/index.php/Dihedral/Index)

These four parameters have introduced a new challenge. The problem arises from the fact that there are several valid possibilities for making changes through rotations and alterations in torsion angles together. The used strategy in DiffDock is to _disentangle_ the degrees of freedom involved in docking, which aims to isolate the modification of torsion angles from other transformations such as rotation and translations.

To make sure that the changes in torsion were independent during docking, post-torsion RMSD alignment was performed to confirm that rotations and translations were orthogonal to torsion modifications.

By utilizing those parameters, it was possible to map ligand poses into submanifold $\mathcal{M}_c \subset \mathbb{R}^{3n}$, where they can easily define the diffusion process. This submanifold $\mathcal{M}_c$ facilitates diffusion over a space where ligand poses are represented in $(m + 6)$ dimensions, where $m$ denotes the count of rotatable bonds.

Fortunately, the ligand pose submanifold establishes a smooth mapping to the product space. As a result, we can now map displacements within the manifold of ligand poses to the product space. This product space, denoted as : $\mathbb{P} = \mathbb{R}^3 * {SO(3)} * \mathbb{T}^m$

#### 2. Product space diffusion

After mapping the ligand pose manifold to the product space, a score-based generative model with SDE is trained with **score matching** according to [Song et al. 2019](https://arxiv.org/abs/1907.05600) to compute the score of the diffusion kernel on the product space and sample from it. But here appears another problem which is how the score model will be diffused on the product space.

The problem is that most of the existing score-based generative models are designed for data on an euclidean space. However, [De Bortoli et al. 2022](https://arxiv.org/pdf/2202.02763.pdf) has developed a Riemannian score-based generative model (SGM) which is based on Riemannian manifold and gives the possibility to create SGMs of various manifolds.

The main concept is to consider the score model not as a vector field on the euclidean space, but rather as a vector field on the manifold where the score and the score model are components within the tangent space of every possible point on the manifold as represented in [_figure 7_](#fig7). 


<a id='fig7'></a>

<img src="images/tangent_space.png"  style="width:400px;">

*Figure 7:* The tangent space, denoted as ${T_xM}$ represents the set of all possible tangent vectors $v$ at $x$ as $x \isin \mathcal{M}$.

As mentioned before product space is the product of three manifolds. So, in order to proceed with the forward diffusion process on the product space, every manifold will be diffused independently according to [Rodol`a et al., 2019](https://arxiv.org/abs/1809.10940) and the tangent space will be computed as the sum of every manifold:

$$
T_g \mathbb{P} = T_r\mathbb{T^3} \oplus T_RSO(3) \oplus T_\theta SO(2)^m 
$$
Therefore, we can sample from diffusion kernel and conduct independet regression analysis against the true score within each group.

#### 3. Model architecture

The confidence model and the score model are constructed using [E(3)NN](https://arxiv.org/abs/2207.09453). For more information on E(3)NN, you can refer to [__Talktorial T036__](https://projects.volkamerlab.org/teachopencadd/talktorials/T036_e3_equivariant_gnn.html).

A score-based generative model is used to simulate the reverse diffusion starting from the "noisy" version of the ligand-protein interaction using reverse SDE to denoising and find the right binding pocket.

While the confidence model is responsible for ranking the arbitrarily generated number of ligands. It's trained as a classifier especially to rank the poses and find the best-generated conformers as demonstrated in [_figure 8_](#fig8).

Unlike other traditional docking tools, DiffDock doesn't predict the affinity of a ligand to a protein. It predicts a confidence score, which gives you a view of how good the ligand is. The higher the confidence score is, the better the quality of generated conformer and more likely to be a good ligand. 

<a id='fig8'></a>

![DiffDock workflow](images/DiffDock.png)

*Figure 8:* 
Overview of DiffDock workflow. Left: The model takes as input the separate ligand and protein
structures. Center: Score-based generative model where random initial poses are denoised via a reverse SDE over translational, rotational, and torsional degrees of freedom. Right:. The sampled poses are ranked by the
confidence model to produce a final prediction and confidence score.
Figure and discription taken from: [arXiv 2023](https://arxiv.org/pdf/2210.01776v2.pdf).

DiffDock has shown an significant improvement in comparison to tradition docking. It achieved a $38%%$% success rate in making predictions with RMSD below $2$, whereas the best traditional docking tool used got $28%$%.

## Practical

* Data preparation.
    - Download PDB structure
    - Prepare input file
* DiffDock implementation
* Denoising visualization

In the practical part we are going to implement _DiffDock_, which used in molecular docking trained using score-based generative model. It's now open-source and published on [Github](https://github.com/gcorso/DiffDock).

It's highly recommended to create an environment using environment.yml file through [Anaconda](), in order to get the code excuted without bugs.

#### Import dependancies

In [7]:
import os 
from PIL import Image
import urllib
from pymol import cmd
from pathlib import Path
from IPython.display import display, HTML

In [8]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Data preparation

Before starting _DiffDock_ implementation, input data has to be prepared. 

#### Prepare protein structure

If you want to download the protein structure. Set `protein_pdb` to your PDB code of the structure you want to use. You can use your own protein, but you need to place it in `DATA` path and set `protein_pdb` to your protein pdb structure.

In [9]:
#By default, it's set to transcription inhibitor, PDB ID is `6moa`

protein_id = '6moa'

In [10]:
# Download the PDB structure if it's not downloaded in `DATA` directory.

if f'{protein_id}.pdb' not in os.listdir('data'):
    urllib.request.urlretrieve(f'http://files.rcsb.org/download/{protein_id}.pdb', f'data/{protein_id}.pdb')
else:
    print(f'PDB structure with ID {protein_id} already downloaded.')

PDB structure with ID 6moa already downloaded.


It's highly recommended to prepare your query protein structure before you start working on it. Many protein targets often have incomplete information such as missing hydrogen atoms, inaccurate protonation states, tautomers or the wrong position of hydrogen atoms of protein and the corresponding ligand. Ignoring the inaccurate information can have a high impact on the reliability of your docking results.

An example of an open source tool that you can use to prepare your protein target is [Protoss](https://proteins.plus/help/protoss). It can be accessed through [Protein Plus server](https://proteins.plus/). you can upload your protein of interest and this tool can predict missing hydrogen atoms, determine reasonable protonation states and provide the right coordinates of hydrogen atoms in the binding pocket of the co-crystallized structure.

For preparing the query protein, follow the next steps:
1. Upload your query protein and follow instructions to use [Protoss tool](https://proteins.plus/help/protoss)
2. Once the protein structure is prepared, download it and save it in `DATA` directory
3. Rename the saved file by appending `_processed` to the `PDB ID`, as shown in next cell.
4. If you followed all previous steps, excute the next cell.

In [11]:
#Change protein ID to the prepared protein structure

if f"{protein_id}_processed.pdb" in os.listdir(DATA):
    protein_id = "6moa_processed"

#### Specify your query ligand

In this step, you need to specify your ligand input. There are different input types can be used. SDF file or SMILES string. In case you want to use sdf file, change `ligand` variable to the path of SDF file.

Unlike traditional docking tools, _DiffDock_ uses one quary per protein. Therefore, if you want to dock more than one ligand to the same protein, you can add it to `ligand_input` list.

In [12]:
# set ligand to ligand SMILES/SDF's path here
ligand = ["O=C(O)c1cc(/OCc2cccc3ccccc23)ccc1O"]

#Add molecule ID to this list
molecule_id = ['Molecule_1']

### DiffDock implementation

First step is to download _DiffDock_ software from its [Github repository](https://github.com/gcorso/DiffDock.git).
In order to maintain stability of the environment file, we wil checkout for a specific version of the _DiffDock_.

In [13]:
if "DiffDock" not in os.listdir(HERE):
    #specifiy version
    !git clone https://github.com/gcorso/DiffDock
else:
    print(f"DiffDock is alreay cloned.")

os.chdir("DiffDock")
!git checkout 3d45728415d2603dbceef5f6952817157f62d216
os.chdir("../")

DiffDock is alreay cloned.
HEAD is now at 3d45728 fix inference_utils.py #90


#### Configure your docking settings

It was found that it was the best configuration regarding a trade-off between the computational cost and accuracy of the results.

- __ligands_per_complex__ is the number of predicted molecules for every complex.
- __inference_steps__ is the number of denoising steps.
- __actual_steps__ is the number of actual denoising steps that are performed.
- __batch_size__ is the number of batches used. Reducing the number of batches decreases the computational resources required for execution.

In [14]:
#The default setting to use DiffDock is as following

ligands_per_complex = "5"
inference_steps = "20"
actual_steps = "18"
batch_size = "10"

The first execution takes around 15 minutes, it precomputes the cache of $SO(3)$ and torus $\mathbb{T}$ distribution table. It downloads automatically the checkpoints of the Evolutionary scale modeling (ESM) afterward which could take around 10 minutes, depending on your internet connection. 

ESM is used to predict the 3D protein structure, this can be utilized if your input is FASTA format and you want to predict the 3D protein structure as well. For further information, on how and why to use ESM, you can it on [ESM's repository](https://github.com/facebookresearch/esm).

In [15]:
os.chdir("DiffDock/")
for id, smiles in zip(molecule_id, ligand):
    diffdock_cmd = f"python -m inference --protein_path ../data/{protein_id}.pdb --ligand '{smiles}' --complex_name {protein_id} --out_dir {DATA} --inference_steps {inference_steps} --samples_per_complex {ligands_per_complex} --save_visualisation --batch_size {batch_size} --actual_steps {actual_steps} --no_final_step_noise"
    os.system(diffdock_cmd)
os.chdir("../")

Generating ESM language model embeddings
Processing 1 of 1 batches (1 sequences)
HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
Size of test dataset:  1


1it [00:40, 40.66s/it]


Failed for 0 complexes
Skipped 0 complexes
Results are in /home/ibrahim/Github/CADDSeminar_2023/notebook/T02_DiffusionBasedDocking/data


### Denoising visualization
In this step, we are going to visualize the denoising process of generated ligands within a Jupyter Notebook cell. Calling **--save_visualisation** argument, the model will save the inference steps of every ligand in one pdb file contains different coordinates of the ligand in results folder.

In order to visualize the process here, we'll follow three main steps:

- Split ligands in PDB file for every inference step to separate PDB files 

In [16]:
def split_ligands(pdb_path, rank):
    '''''
    
    Separate ligands of every inference step in a pdb file to individual pdb files

    @args: 
    pdb_path : .pdb file with multiple coordinates of a ligand for every inference step 
    
    rank : the rank of the predicted molecules

    @return:
    individual pdb files for each ligand in `DATA/visualization` directory  

    '''''
    
    #read pdb file
    with open(pdb_path, "r") as pdb_file:
        pdb_string = pdb_file.read()

    #split ligand
    pdb_models = pdb_string.strip().split("MODEL")
    model = 1

    #write individual ligands
    for pdb_model in pdb_models[1:]:

        #create `visualization` directory
        os.makedirs(DATA / protein_id / "visualization", exist_ok=True)

        with open(f"{str(DATA)}/{protein_id}/visualization/{model}.pdb", "w") as pdb_output_file:
            pdb_output_file.write(pdb_model)

        model += 1
    
    print(f"Rank {rank} ligand in different inference steps is now separated into individual PDB files.")

- Take screenshot of every ligand in every inference step with the query protein

In [17]:
def create_screenshots(inference_steps, rank):
    '''''
    Load PDB file of a query protein and every ligand from the inference steps and create an image for every inference step.

    @args:
    ------
    rank : the rank of the predicted molecules

    inference_steps : Number of generated ligands for every denoising step

    @return:
    --------
    create screenshots of a query protein, reference ligand, and the denoised ligands with multiple coordination 
    '''''
    
    for model in range(1, int(inference_steps)+2):

        #load protein and reference ligand and set preferred preferences
        cmd.load(f"data/{protein_id}.pdb", 'protein')
        cmd.select('ligand', 'hetatm')
        cmd.color('white', 'protein')
        cmd.show('surface', 'protein')
        cmd.hide('cartoon', 'protein')
        cmd.set('transparency', 0.5,'protein')
        cmd.color('green', 'ligand')

        #load predicted ligands for every inference state
        cmd.load(f"{str(DATA)}/{protein_id}/visualization/{model}.pdb", 'predicted_ligand')

        #select ligand (Heteroatoms)
        cmd.select('predicted_ligand')
        
        #change ligand color to yellow
        cmd.color('yellow', 'predicted_ligand')

        cmd.bg_color("white")
        cmd.orient('protein')
        cmd.remove("solvent") 
        # create a PNG image
        cmd.png(f'{str(DATA)}/{protein_id}/visualization/image_{model}_{rank}.png', dpi=300)  
        cmd.delete('all')

        #clean directory from individual pdb files 
        os.remove(f"{str(DATA)}/{protein_id}/visualization/{model}.pdb")

    print(f"Screenshots are taken for rank {rank} ligand")

- Create a graphics interchange format (GIF) for every generated ligand in different inference steps with the query protein

In [18]:
def create_gif(inference_steps, rank):
    '''''
    create a graphics interchange format (GIF) for every generated ligand in different inference steps with the query protein

    @args:
    ------
    rank : the rank of the predicted molecules

    @return:
    --------
    GIF of denoising process of ligand-protein complex 
    '''''
        
    #create a list that contains all paths of ligand-protein screenshots
    image_files = [f"{str(DATA)}/{protein_id}/visualization/image_{i}_{rank}.png" for i in range(1, int(inference_steps)+2)]
    frames = [Image.open(image) for image in image_files]
    
    #save frames of the GIF as an endless loop.
    frames[0].save(f"{str(DATA)}/{protein_id}/visualization/reverse_diffusion_rank_{rank}.gif", append_images=frames[1:],save_all=True, duration=100, disposal=2, loop=0)

    #clean the folder from screenshots
    [os.remove(image) for image in image_files]

    print(f"GIF of rank {rank} ligand is now ready!")

Note that the process of visualizing the ligands might require some time to excute, especially if it's more than three generated ligands due to the generation of high-quality screenshots.

In [27]:
if "visualization" not in os.listdir(DATA / protein_id ):

    #pass if GIF is already created
    for rank in range(1,int(ligands_per_complex)+1):
        
        split_ligands(f"{DATA}/{protein_id}/rank{rank}_reverseprocess.pdb", rank)
        create_screenshots(inference_steps, rank)
        create_gif(inference_steps, rank)

Rank 1 ligand in different inference steps is now separated into individual PDB files.


KeyboardInterrupt: 

In [20]:
gif_paths = [(f"{str(DATA)}/{protein_id}/visualization/reverse_diffusion_rank_{rank}.gif") for rank in range(1, int(ligands_per_complex)+1)]

# Generate HTML code to display the GIFs side by side
html_code = '<div style="display:flex;">'
for gif_path in gif_paths:
    html_code += f'<img src="{gif_path}" style="margin-right: 10px;">'
html_code += '</div>'

# Display the HTML code
display(HTML(html_code))

## Discussion

Throughout this talktorial, the fundamentals of two main types of generative models are explained, highlighting their powerful capability to generate new data by using minimal information as input. We've presented a case study in the field of cheminformatics, focusing on its application in molecular docking. However, there is still room for improvement in this field, primarily due to the distinct characteristics of chemical data and the challenges associated with it compared to other types of input data.

In the presented case study _DiffDock_, the researchers had to find a way to introduce noise to the chemical data set. Which is a crucial step to implement a score-based generative model. Additionally, It was also challenging to identify a suitable manifold that can accommodate the noised chemical data, allowing for diffusion across its product space.

_DiffDock_ demonstrated a significant improvement in docking power, with a 38% success rate. However, it's not without some limitations. Similar to many molecular docking literatures, it assumes that any given ligand-protein structure is a holo-structure (bound structure). Additionally, some protein structures have a cryptic binding site, which is not detected by _DiffDock_. Addressing these limitations and tackling them can increase the success rate of molecular docking approaches in future research.

## Quiz

1. What is special about _DiffDock_ in comparison to tradtional docking methods? Would you considered these differences to be advantageous or disadvantageous? and why?
2. What is the unit of the predicted _DiffDock_ output, and how does it compare to the outputs from traditional molecular docking tools, is it any better to get this output? and why?
3. Why does DiffDock hold the assumption of ligand-protein structure as a holo-structure? How could it be overcome? 

## Appendix
<a id='appendix'></a>

In the supplementary section, we provide detailed explanations of important concepts that required for a comprehensive understanding of discussed topics.

The U-Net architecture based on an encoder-decoder structure. In encoding, the spatial dimensions decreases while the number of channels increases keeping the important features of the input data. On the contrary, in decoding the spatial dimensions increase while number of channels decrease to produce the same spatial dimensions as the input data as illustrated in [figure 9](#fig9).

<a id='fig9'></a>

<img src="images/Unet-architecture.png"  style="margin-left: auto; margin-right: auto;">

*Figure 9:* 
An overview of U-Net architecture.
Figure is taken from: [AI Summer](https://theaisummer.com/static/fa507fda71846a516801bccb19474aec/0012b/Unet-architecture.png).

#### Score model training 
<a id='scoremodel'></a>
In order to train a score-based model, we need to compare between the model and the actual data distribution. This is done by minimizing a function that computes the distance between true score model and predicted score model like Fisher divergence, which defined as:

$$
\mathbb{E_{p(x)}}[|| \nabla_x \log{p_{(x)}} - \mathbb{s_\theta(x)}||_2^2] \tag{9}
$$

The ground-truth data score is unknown, which makes it infeasible to compute fisher divergence indirectly. However, **score matching** makes it feasible, because it can minimize fisher divergence without the estimation of the true data score and allow us to train the model.

The model can be trained, but the main objective of the score-based model, which is generating new data, is not achieved. [Langevin dynamics](https://en.wikipedia.org/wiki/Langevin_dynamics) is used as a sampling method to generate new data, accessing only its score function as shown in [figure 10](#fig10).


<a id='fig10'></a>
<img src="images/smld.jpg"  style="margin-left: auto; margin-right: auto;">

*Figure 10:* 
An overview of score-based generative modeling with score matching and langevin dynamics.
Figure and description are taken from: [Yang-Song blogpost](https://yang-song.net/blog/2021/score/#connection-to-diffusion-models-and-others).