<div class="alert alert-block alert-info">

<b>Thank you for contributing to TeachOpenCADD!</b>

</div>

<div class="alert alert-block alert-info">

<b>Set up your PR</b>: Please check out our <a href="https://github.com/volkamerlab/teachopencadd/issues/41">issue</a> on how to set up a PR for new talktorials, including standard checks and TODOs.

</div>

# · Diffusion-based docking models

**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

- Hamza Ibrahim, CADD seminars 2023, Universität des Saarlandes (UdS)
- Michael Bockenköhler, 2023,  [Volkamer lab](https://volkamerlab.org), Universität des Saarlandes (UdS)
- Andrea Volkamer, 2023,  [Volkamer lab](https://volkamerlab.org), Universität des Saarlandes (UdS)

## Aim of this talktorial

This talktorial presents new class of generative models which called diffusion probabilistic model. You will learn what diffusion models are and see various application implemented diffusion models in molecular docking.

### Contents in *Theory*


* Diffusion generative models (DGM).
    1. Forward process
    2. Reverse process
    3. Training a diffusion model
* Diffusion-based docking models.
    1. Ligand pose manifold
    2. Product space diffusion

### Contents in *Practical*

* Data preparation.
    - Download PDB structure
    - Prepare input file
* Diffusion-based docking model.
    - DiffDock implementation
    

### References

* Denoising Diffusion Probabilistic Models: [<i>arXiv</i> (2020)](https://arxiv.org/pdf/2006.11239.pdf?ref=assemblyai.com)
* Score-based generative modeling through stochastic differential equations: [<i>arXiv</i> (2021)](https://arxiv.org/pdf/2011.13456.pdf) 
* Equivariant Graph Neural Networks: [<i>arXiv</i> (2022)](https://arxiv.org/pdf/2102.09844.pdf)
* Structure-based Drug Design with Equivariant Diffusion Models: [<i>arXiv</i> (2022)](https://arxiv.org/pdf/2210.13695.pdf)
* DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking: [<i>arXiv</i> (2023)](https://arxiv.org/pdf/2210.01776v2.pdf)
* [Diffusion Model Clearly Explained!](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166)
* Deep Unsupervised Learning using Nonequilibrium Thermodynamics: [Sohl-Dickstein et al. <i>arXiv</i> (2021)](https://arxiv.org/pdf/1503.03585.pdf)
* Generative Modeling by Estimating Gradients of the Data Distribution: [Song et al. <i>arXiv</i> (2019)](https://arxiv.org/abs/1907.05600)
* Denoising Diffusion Probabilistic Models [Ho et al. <i>arXiv</i> (2020)](https://arxiv.org/abs/2006.11239)
* [Talktorial 008](https://projects.volkamerlab.org/teachopencadd/talktorials/T008_query_pdb.html)


## Theory

### Diffusion generative models (DGM).

Generative models are a category of machine learning models that have the capability to generate new data based on a given training data set. Diffusion probabilistic model (DPM) is a type of generative models that's inspired from Physics by non-equilibrium thermodynamics[ref]. DPM dpeneds on two main reciprocal processes that represent two sets of random variables organized in the form of Markov chains.
1. Forward Diffusion Process → add noise to input data.
2. Reverse Diffusion Process → denoise noised data.

#### 1. Forward process

The first process adds guassian noise sequentially to the input data $x_0$ by $T$ steps. As $T → \infty $, $x_T$ becomes a complete static noise image as in figure [1]. So every successive state $\mathbb{x}_{t + 1}$ could be computed as the following : 
$$
q(\mathbb{x}_{t}|{x}_{t-1}) = \mathcal{N(\mathbb{x}_{t};\mathbb{\mu}_t = \sqrt{1 - \beta}{x}_{t-1}, \Sigma_t = \beta_t \mathbf{I})},
$$
Where $q(\mathbb{x}_{t}|{x}_{t-1})$ denotes the distribution of the next state $\mathbb{x}_{t}$.

 $\mathbb{\mu}_t$ and $\Sigma_t({x}_t, t)$ represent the mean and covariance of next state distribution, respectively.

Utilizing [Reparametrization Trick](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166#228f) closed-Form formula could be derived, which prompts us to sample ${x}_{t}$ at any time step using ${x}_{0}$. It makes forward diffusion process much faster as following:
$$
{x}_{t} = \sqrt{{\u\alpha}_t} {x}_0 + \sqrt{1 - {\u\alpha}_t} {\epsilon}_0,
$$

Where ${\u\alpha}_t = \prod_{s = 0}^{t}{1 - {\beta}_s}$ , and ${\epsilon}_0, ... , {\epsilon}_{t-2}, {\epsilon}_{t-1} \sim \mathcal{N (0 , \mathbf{I})}$

#### 2. Reverse process

Unfortunetaly, It's not possible to sample ${x}_{0}$ from ${x}_{t}$ using $q(\mathbb{x}_{t-1}|{x}_{t})$ as in forward process, because reversing the noise is intractable, therefore **reverse diffusion process** is employed. As a solution $q(\mathbb{x}_{t-1}|{x}_{t})$ could be approximated by using a deep learning model (e.g. neural network), which predicts an approximation to the conditional probability distribution $\mathbb{p}_{\theta}(\mathbb{x}_{t-1}|{x}_{t})$, which modeled as a Gaussian distribution:

$$
\mathbb{p}_{\theta}(\mathbb{x}_{t-1}|{x}_{t}) = \mathcal{N(\mathbb{x}_{t-1};\mathbb{\mu}_\theta({x}_t, t), \Sigma_t({x}_t, t))},
$$

By learning the conditional probability densities using deep learning model the original image $x_0$ is reconstructed from the noisy image $\mathbb{x}_t$ as illustrated in _figure 1_. Allowing for the extraction of meaningful information from the noisy representation.


![ChEMBL web service schema](images/basics_dgm.png)

*Figure 1:* 
Black arrows represent the forward diffusion process, while blue arrow represents the reverse diffusion process
Figure is taken from: [Medium article](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166).

#### 3. Training a diffusion model.

In order to effectively train a diffusion model, It's necessary to define the loss function. By examining $q$ and $p$ combination of a variational autoencoder (VAE) and DPM, similarities in the framework is identified. Therefore, the optimization of the model depends on the negative log-likelihood of the training data set. [Ho et al. (2020)](https://arxiv.org/pdf/2006.11239.pdf) simplified the loss function to:
$$
{L}_{t}^{simple} = \mathbb{E}_{x_0, t, \epsilon}[|| \epsilon - {\epsilon}_\theta(\sqrt{a^u} x_0 + \sqrt{1 - a^u} \epsilon,t)||^2]
$$ 
where: 

$\epsilon \sim \mathcal{N}(0, \mathbb{I})$ is the actual noise added, whch follows a standard normal distribution.

${\epsilon}_\theta(\sqrt{\u\alpha} x_0 + \sqrt{1 - \u\alpha} \epsilon,t) = {\epsilon}_\theta(x_t,t) $ denotes the approximated noise from neural network using reparamarization trick that mentioned before.

It was found experimentally that optimizing the simplified loss function by comparing the targeted and predicted noises with mean squared error (MSE) outperforms the original evidence lower bound (ELBO).

[Ho et al. 2020](https://arxiv.org/pdf/2006.11239.pdf) kept the variance fixed in his simplifed loss function and let the network learn only the mean. However,[Nichol et al. 2021](https://arxiv.org/abs/2102.09672) improved the network to learn also the variance.

Usually [U-Net](https://theaisummer.com/unet-architectures/) is implemented as the architecture of the training network. The output will have the same size as the input. U-Net is employed by [Ho et al. 2020](https://arxiv.org/pdf/2006.11239.pdf).

### Diffusion-based docking model.

#### 1. Ligand pose manifold.

In order to have diffusion-based docking model, you have to think of manifold that suits ligand poses where $L \isin \mathbb{R}^{3n} $ as $n$ is the number of atoms. If we start forward diffusing without setting any limitations for the degree of freedom, it becomes absurd and ligands will have unreasonable bond lengths and angles.

A solution to this problem is presented in [DiffDock paper](https://arxiv.org/pdf/2210.01776v2.pdf). They are inspired from traditional docking approches by taking already embedded ligand in a 3D space using RDKit, which instantiates the angles and bond length of the atoms. Instead of thinking of a ligand as an element in an eucledian space, they described ligand pose by four main parameters to be mapped into submanifold where $M_c \subset \mathbb{R}^{3n}$
1. Local structures like bond lengths, bond angles, chirality and ring structure are kept fixed in order to maintain integrity of the model and diffusion will occur over $(m+6)-dim$ manifold, where $m$ is the number of rotatable bonds.
2. Position of ligand on the grid is left flexible to find the pocket in $\mathbb{R}^3$.
3. Rotation parameterization, where $Rotation \isin {SO(3)}$ correspnds rotating around the mass centre of the ligand.
4. Torsion angles flexibility to fit in the pocket, where: $ Torsions \isin \mathbb{T}^m$

We can map now ligand poses manifold into product space as : $\mathbb{P} = \mathbb{R}^3 * \isin {SO(3)} * \isin \mathbb{T}^m$

#### 2. Product space diffusion

Score matching

Develop a diffusion process on the product space is not appropriate using the normal DGM. However, [De Bortoli et al. 2022](https://arxiv.org/pdf/2202.02763.pdf) has developed Riemannian score-based generative model which can easily discribe various distribution and it's used with score-model as element of tangent space .

Standard score matching [Song et al. 2019]() is used to train the model. Now the only need method is to sample from and compute the score of the diffusion kernel on $\mathbb{P}$.

Starting with forward diffusion process. Forward diffusion computed of the direct sum of tangent space for each manifold independently since $\mathbb{P}$ is a product space. Forward stochastic differential equation (SDE) is defined as the follow:
$$
dx = \sqrt{d \sigma^2(t)/dt}dw
$$
Where $\sigma^2 = \sigma^2_{pos} ,  \sigma^2_{rot} ,  \sigma^2_{tr} $ and $w$ indicates [brownian motion](https://en.wikipedia.org/wiki/Brownian_motion).

##### Confidence score

![ChEMBL web service schema](images/DiffDock.png)

*Figure 2:* 
Overview of DIFFDOCK. Left: The model takes as input the separate ligand and protein
structures. Center: Randomly sampled initial poses are denoised via a reverse diffusion over trans-
lational, rotational, and torsional degrees of freedom. Right:. The sampled poses are ranked by the
confidence model to produce a final prediction and confidence score.
Figure and discription taken from: [arXiv 2023](https://arxiv.org/pdf/2210.01776v2.pdf).

### Use one protein and one ligand to implement diffdock

## Practical

We will now proceed to implement software used in molecular docking trained using diffusion generative model called DiffDock. It's now open-source and published on [Github](https://github.com/gcorso/DiffDock).

#### Import dependancies

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import os 
import csv
import urllib
from pathlib import Path

In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Data preparation

#### Dwonload PDB structure

In [3]:
# Specify the PDB code of the structure you want to download
pdb_code = '6w70'
# Download the PDB structure
urllib.request.urlretrieve(f'http://files.rcsb.org/download/{pdb_code}.pdb', f'data/{pdb_code}.pdb')

('data/6w70.pdb', <http.client.HTTPMessage at 0x7f31a7b05f70>)

In [4]:
# Give ligand SMILES here
smiles_ligand = "COc1ccc(cc1)n2c3c(c(n2)C(=O)N)CCN(C3=O)c4ccc(cc4)N5CCCCC5=O"

#### Prepare DiffDock input file

In [5]:
def prepare_diffdock_input(protein_path, smiles_ligand, output_name):
    
    """""
    Create csv file has same structure of diffdock input template.

    @Param:

    protein_path: Actual protein path from DiffDock directory (../ is added to go back)

    smiles_ligand: SMILES structure of the query ligand

    output_name: csv file name

    @Return: csv file path

    """""


    #Every line has path to same target and different smiles code.
    header = ['complex_name', 'protein_path', 'ligand_description', 'protein_sequence']

    if output_name not in os.listdir(DATA):
        with open(f"{DATA}/{output_name}", 'w', newline='') as file:

            # Create the CSV writer object
            writer = csv.writer(file)

            # Write the header row
            writer.writerow(header)

            writer.writerow(['', f"../{protein_path}", smiles_ligand, ''])
    else:
        print("Output file is already created.")

    print("Protein and corresponding ligand structure are now ready for DiffDock.")

    return f"{DATA}/{output_name}"

In [6]:
protein_path = f"data/{pdb_code}.pdb"
output_name = "diffdock_input.csv"

output_path = prepare_diffdock_input(protein_path, smiles_ligand, output_name)

Output file is already created.
Protein and corresponding ligand structure are now ready for DiffDock.


### Diffusion-based moleculer docking.
#### Diffusion implementation

##### Download DiffDock software from Github repository

In [7]:
if "DiffDock" not in os.listdir(HERE):
    !git clone https://github.com/gcorso/DiffDock.git
else:
    print(f"DiffDock is alreay cloned.")

DiffDock is alreay cloned.


#### Configure your docking settings (Doesn't work right now, working on a cpu environment file)

In [8]:
protein_ligand_csv = output_path
results_path = "results/"
samples_per_complex = "5"
inference_steps = "20"
actual_steps = "18"

In [10]:
os.chdir("DiffDock/")
diffdock_cmd = f'python -m inference --protein_ligand_csv {protein_ligand_csv} --out_dir {results_path} --inference_steps {inference_steps} --samples_per_complex {samples_per_complex} --batch_size 10 --actual_steps {actual_steps} --no_final_step_noise'
os.system(diffdock_cmd)
os.chdir("../")

Traceback (most recent call last):
  File "/home/hamza/mambaforge-pypy3/envs/dockm8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hamza/mambaforge-pypy3/envs/dockm8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/hamza/Desktop/Bioinformatics master UdS/SS23/CADDSeminar_2023/notebook/T02_DiffusionBasedDocking/DiffDock/inference.py", line 14, in <module>
    from utils.inference_utils import InferenceDataset, set_nones
  File "/home/hamza/Desktop/Bioinformatics master UdS/SS23/CADDSeminar_2023/notebook/T02_DiffusionBasedDocking/DiffDock/utils/inference_utils.py", line 5, in <module>
    from esm import FastaBatchedDataset, pretrained
ModuleNotFoundError: No module named 'esm'


## Discussion

Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.

## Quiz

Ask three questions that the user should be able to answer after doing this talktorial. Choose important take-aways from this talktorial for your questions.

1. Question
2. Question
3. Question

<div class="alert alert-block alert-info">

<b>Useful checks at the end</b>: 
    
<ul>
<li>Clear output and rerun your complete notebook. Does it finish without errors?</li>
<li>Check if your talktorial's runtime is as excepted. If not, try to find out which step(s) take unexpectedly long.</li>
<li>Flag code cells with <code># NBVAL_CHECK_OUTPUT</code> that have deterministic output and should be tested within our Continuous Integration (CI) framework.</li>
</ul>

</div>