<div class="alert alert-block alert-info">

<b>Thank you for contributing to TeachOpenCADD!</b>

</div>

<div class="alert alert-block alert-info">

<b>Set up your PR</b>: Please check out our <a href="https://github.com/volkamerlab/teachopencadd/issues/41">issue</a> on how to set up a PR for new talktorials, including standard checks and TODOs.

</div>

# · Diffusion-based docking models

**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

- Hamza Ibrahim, CADD seminars 2023, Universität des Saarlandes (UdS)
- Michael Bockenköhler, 2023,  [Volkamer lab](https://volkamerlab.org), Universität des Saarlandes (UdS)
- Andrea Volkamer, 2023,  [Volkamer lab](https://volkamerlab.org), Universität des Saarlandes (UdS)

## Aim of this talktorial

This talktorial presents new class of generative models which called diffusion probabilistic model. You will learn what diffusion models are and see various application implemented diffusion models in molecular docking.

### Contents in *Theory*


* Diffusion generative models (DGM).
    - Forward process
    - Reverse process
    - Training a diffusion model
* Diffusion-based docking models.
    - Ligand pose manifold
    - Product space diffusion

### Contents in *Practical*

* 
* Diffusion-based docking model.
    - Forward process
    - Reverse process
    - Loss function
    - Training
    

### References

* Denoising Diffusion Probabilistic Models: [<i>arXiv</i> (2020)](https://arxiv.org/pdf/2006.11239.pdf?ref=assemblyai.com)
* Score-based generative modeling through stochastic differential equations: [<i>arXiv</i> (2021)](https://arxiv.org/pdf/2011.13456.pdf) 
* Equivariant Graph Neural Networks: [<i>arXiv</i> (2022)](https://arxiv.org/pdf/2102.09844.pdf)
* Structure-based Drug Design with Equivariant Diffusion Models: [<i>arXiv</i> (2022)](https://arxiv.org/pdf/2210.13695.pdf)
* DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking: [<i>arXiv</i> (2023)](https://arxiv.org/pdf/2210.01776v2.pdf)
* [Diffusion Model Clearly Explained!](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166)
* Deep Unsupervised Learning using Nonequilibrium Thermodynamics: [Sohl-Dickstein et al. <i>arXiv</i> (2021)](https://arxiv.org/pdf/1503.03585.pdf)
* Generative Modeling by Estimating Gradients of the Data Distribution: [Song et al. <i>arXiv</i> (2019)](https://arxiv.org/abs/1907.05600)
* Denoising Diffusion Probabilistic Models [Ho et al. <i>arXiv</i> (2020)](https://arxiv.org/abs/2006.11239)


## Theory

### Diffusion generative models (DGM).

Generative models are a category of machine learning models that have the capability to generate new data based on a given training data set. Diffusion probabilistic model (DPM) is a type of generative models that's inspired by non-equilibrium thermodynamics[ref]. DPM dpeneds on two main reciprocal processes that represent two sets of random variables organized in the form of Markov chains.
1. Forward Diffusion Process → add noise to input data.
2. Reverse Diffusion Process → denoise noised data.

##### Forward process

The first process adds guassian noise sequentially to the input data $x_0$ by $T$ steps. As $T → \infty $, $x_T$ becomes a complete static noise image as in figure [1]. So every successive state $\mathbb{x}_{t + 1}$ could be computed as the following : 
$$
q(\mathbb{x}_{t}|{x}_{t-1}) = \mathcal{N(\mathbb{x}_{t};\mathbb{\mu}_t = \sqrt{1 - \beta}{x}_{t-1}, {\textstyle\sum}_t = \beta_t \mathbf{I})},
$$
Where $q(\mathbb{x}_{t}|{x}_{t-1})$ denotes the distribution of the next state $\mathbb{x}_{t}$, $\mathbb{\mu}_t$ denotes the mean of next state distribution and $\mathbb{\textstyle\sum}_t$ denotes the variance.

Utilizing [Reparametrization Trick](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166#228f) closed-Form formula could be derived, which prompts us to sample ${x}_{t}$ at any time step using ${x}_{0}$. It makes forward diffusion process much faster as following:
$$
{x}_{t} = \sqrt{{\u\alpha}_t} {x}_0 + \sqrt{1 - {\u\alpha}_t} {\epsilon}_0,
$$

Where ${\u\alpha}_t = \prod_{s = 0}^{t}{1 - {\beta}_s}$ , and ${\epsilon}_0, ... , {\epsilon}_{t-2}, {\epsilon}_{t-1} \sim \mathcal{N (0 , \mathbf{I})}$

##### Reverse process

Unfortunetaly, we can't sample ${x}_{0}$ from ${x}_{t}$ using $q(\mathbb{x}_{t-1}|{x}_{t})$ as in forward process, because reversing the noise is intractable, therefore we need *reverse diffusion process*. It could be approximated by a deep learning model (e.g neural network), which predicts conditional probability distribution $\mathbb{p}_{\theta}(\mathbb{x}_{t-1}|{x}_{t})$.

$$
\mathbb{p}_{\theta}(\mathbb{x}_{t-1}|{x}_{t}) = \mathcal{N(\mathbb{x}_{t-1};\mathbb{\mu}_\theta({x}_t, t), {\textstyle\sum}_t({x}_t, t))},
$$
$x_0$ is reconstructed from the noisy image $\mathbb{x}_t$ by learning the conditional probability densities using deep learning model as shown in figure. Allowing for the extraction of meaningful information from the noisy representation.


![ChEMBL web service schema](images/basics_dgm.png)

*Figure 1:* 
Black arrows represent the forward diffusion process, while blue arrow represents the reverse diffusion process
Figure is taken from: [Medium article](https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166).

#### Training a diffusion model.

To train a diffusion model, you have to define the loss function. If we take a step back we can see the similarity of $q$ and $p$ combination between a variational autoencoder (VAE) and DPM. Therefore, the optimization of the model depends on the negative log-likelihood of the training data set. [Ho et al. 2020](https://arxiv.org/pdf/2006.11239.pdf) made few simplifications to the loss function, with skipping details of the [full proof](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/#reverse-diffusion-process). 
$$
{L}_{t}^{simple} = \mathbb{E}_{x_0, t, \epsilon}[|| \epsilon - {\epsilon}_\theta(\sqrt{a^u} x_0 + \sqrt{1 - a^u} \epsilon,t)||^2]
$$ 
where: 

$\epsilon \sim \mathcal{N}(0, \mathbb{I})$ is the actual noise added.

${\epsilon}_\theta(\sqrt{\u\alpha} x_0 + \sqrt{1 - \u\alpha} \epsilon,t) = {\epsilon}_\theta(x_t,t) $ denotes the approximated noise from neural network using reparamarization trick that mentioned before.

It was found experimentally that optimizing the simplified loss function by comparing the targeted and predicted noises with mean squared error (MSE) outperforms the original evidence lower bound (ELBO).

[Ho et al. 2020](https://arxiv.org/pdf/2006.11239.pdf) kept the variance fixed in his simplifed loss function and let the network learn only the mean. However,[Nichol et al. 2021](https://arxiv.org/abs/2102.09672) improved the network to learn also the variance.

Usually [U-Net](https://theaisummer.com/unet-architectures/) is implemented as the architecture of the training network. The output will have the same size as the input. U-Net is employed by [Ho et al. 2020](https://arxiv.org/pdf/2006.11239.pdf).

### Diffusion-based docking model.

##### Ligand pose manifold.

In order to have diffusion-based docking model, you have to think of manifold that suits ligand poses where $L \isin \mathbb{R}^{3n} $ as $n$ is the number of atoms. If we start forward diffusing without setting any degree of freedom limitations, it becomes absurd and ligands will have unreasonable bond lengths and angles.

A solution to this problem is presented in [DiffDock paper](https://arxiv.org/pdf/2210.01776v2.pdf). They are inspired from traditional docking approches by taking already embedded ligand in a 3D space using RDKit, which instantiates the angles and bond length of the atoms. Instead of thinking of a ligand as an element in an eucledian space, they described ligand pose by four main characterstics to be mapped into submanifold where $M_c \subset \mathbb{R}^{3n}$
1. Local structures like bond lengths, bond angles, chairality and ring structure are kept fixed in order to maintain integrity of the model and diffusion will occur over $(m+6)-dim$ manifold, where $m$ is the number of rotatable bonds.
2. Position which left flexible to find the pocket in $\mathbb{R}^3$.
3. Rotation parameterization, where $Rotation \isin {SO(3)}$ correspnds rotating around the mass centre of the ligand.
4. Torsion angles flexibility to fit in the pocket, where: $ Torsions \isin \mathbb{T}^m$

We can map now ligand poses manifold into product space as : $\mathbb{P} = \mathbb{R}^3 * \isin {SO(3)} * \isin \mathbb{T}^m$

##### Product space diffusion

![ChEMBL web service schema](images/DiffDock.png)

*Figure 2:* 
Overview of DIFFDOCK. Left: The model takes as input the separate ligand and protein
structures. Center: Randomly sampled initial poses are denoised via a reverse diffusion over trans-
lational, rotational, and torsional degrees of freedom. Right:. The sampled poses are ranked by the
confidence model to produce a final prediction and confidence score.
Figure and discription taken from: [arXiv 2023](https://arxiv.org/pdf/2210.01776v2.pdf).

##### Confidence score

## Practical

How diffusion process occurs
Mapping ligand into submanifold
train a model??

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_swiss_roll
from helper_plot import hdr_plot_style
hdr_plot_style()
# Sample a batch from the swiss roll
def sample_batch(size, noise=1.0):
    x, _= make_swiss_roll(size, noise=noise)
    return x[:, [0, 2]] / 10.0
# Plot it
data = sample_batch(10**4).T
plt.figure(figsize=(16, 12))
plt.scatter(*data, alpha=0.5, color='red', edgecolor='white', s=40)

: 

<div class="alert alert-block alert-info">

<b>Imports</b>: Please add all your imports on top of this section, ordered by standard library / 3rd party packages / our own (<code>teachopencadd.*</code>). 
Read more on imports and import order in the <a href="https://www.python.org/dev/peps/pep-0008/#imports">"PEP 8 -- Style Guide for Python Code"</a>.
    
</div>

In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

<div class="alert alert-block alert-info">

<b>Relative paths</b>: Please define all paths relative to this talktorial's path by using the global variable <code>HERE</code>.
If your talktorial has input/output data, please define the global <code>DATA</code>, which points to this talktorial's data folder (check out the default folder structure of each talktorial).
    
</div>

### Title

_Explain what you will do and why here in the Markdown cell. This includes everything that has to do with the talktorial's storytelling._

In [3]:
# Add comments in the code cell if you want to comment on coding decisions

<div class="alert alert-block alert-info">

<b>Functions</b>: 

<ul>
<li>Please add <a href="https://numpydoc.readthedocs.io/en/latest/format.html">numpy docstrings</a> to your functions.</li>
<li>Please expose all variables used within a function in the function's signature (i.e. they must be function parameters), unless they are created within the scope of the function.</li>
<li>Please add comments to the steps performed in the function.</li>
<li>Please use meaningful function and parameter names. This applies also to variable names.</li>
</ul>
    
</div>

In [4]:
def calculate_ro5_properties(smiles):
    """
    Test if input molecule (SMILES) fulfills Lipinski's rule of five.

    Parameters
    ----------
    smiles : str
        SMILES for a molecule.

    Returns
    -------
    pandas.Series
        Molecular weight, number of hydrogen bond acceptors/donor and logP value
        and Lipinski's rule of five compliance for input molecule.
    """
    # RDKit molecule from SMILES
    molecule = Chem.MolFromSmiles(smiles)
    # Calculate Ro5-relevant chemical properties
    molecular_weight = Descriptors.ExactMolWt(molecule)
    n_hba = Descriptors.NumHAcceptors(molecule)
    n_hbd = Descriptors.NumHDonors(molecule)
    logp = Descriptors.MolLogP(molecule)
    # Ro5 conditions fulfilled
    conditions = [molecular_weight <= 500, n_hba <= 10, n_hbd <= 5, logp <= 5]
    ro5_fulfilled = sum(conditions) >= 3
    # Return True if no more than one out of four conditions is violated
    return pd.Series(
        [molecular_weight, n_hba, n_hbd, logp, ro5_fulfilled],
        index=["molecular_weight", "n_hba", "n_hbd", "logp", "ro5_fulfilled"],
    )

### Title 2

_Explain what you will do and why here in the Markdown cell. This includes everything that has to do with the talktorial's storytelling._

In [5]:
# Add comments in the code cell if you want to comment on coding decisions

## Discussion

Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.

## Quiz

Ask three questions that the user should be able to answer after doing this talktorial. Choose important take-aways from this talktorial for your questions.

1. Question
2. Question
3. Question

<div class="alert alert-block alert-info">

<b>Useful checks at the end</b>: 
    
<ul>
<li>Clear output and rerun your complete notebook. Does it finish without errors?</li>
<li>Check if your talktorial's runtime is as excepted. If not, try to find out which step(s) take unexpectedly long.</li>
<li>Flag code cells with <code># NBVAL_CHECK_OUTPUT</code> that have deterministic output and should be tested within our Continuous Integration (CI) framework.</li>
</ul>

</div>