# Denoising diffusion probabilistic models (DDPMs)


## Summary of the theory 

#### Markov chains

1. forward: $q(x^{(1:N)}\,|\,x^{(0)}) = \prod_{k=1}^{N} q(x^{(k)}\,|\,x^{(k-1)})$, where 
$$q(x^{(k)}\,|\,x^{(k-1)}) = \mathcal{N}(x^{(k)}; \sqrt{1-\beta_k}x^{(k-1)}, \beta_k \mathrm{I}_d)$$

2. reverse: $p_\theta(x^{(0:N)}) = p(x^{(N)}) \prod_{k=1}^{N} p_\theta(x^{(k-1)}\,|\,x^{(k)})$, where 
 
\begin{equation*}
  p_\theta(x^{(k-1)}\,|\,x^{(k)}) = \mathcal{N}\Big(x^{(k-1)}; \mathrm{\mu}_\theta(x^{(k)}, k), \sigma_k^2\mathrm{I}_d\Big)\,.
\end{equation*}

3. generated density: $p_\theta(x^{(0)})= \int p_\theta(x^{(0:N)})\, dx^{(1:N)}$.

#### Upper bound of nagative log-likelihood

\begin{equation}
  \begin{aligned}
    &\mathbb{E}_{q_0}\big(-\log p_\theta(x^{(0)})\big) \\
    = & \mathbb{E}_{q_0}\Big[-\log \Big(\int p_\theta(x^{(0:N)})\, dx^{(1:N)}\Big)\Big]  \\
    =& \mathbb{E}_{q_0}\Big[-\log \Big(\int \frac{p_\theta(x^{(0:N)})}{q(x^{(1:N)}\,|\,x^{(0)})} q(x^{(1:N)}\,|\,x^{(0)})\, dx^{(1:N)}\Big)\Big] \\
    \le & \mathbb{E}_{q_0}\Big[\int -\log\Big(\frac{p_\theta(x^{(0:N)})}{q(x^{(1:N)}\,|\,x^{(0)})}\Big) q(x^{(1:N)}\,|\,x^{(0)})\, dx^{(1:N)}\Big] \\
    =& \mathbb{E}_{\mathbb{Q}}\Big( -\log \frac{p_\theta(x^{(0:N)})}{q(x^{(1:N)}\,|\,x^{(0)})}\Big)\\
    =& \mathbb{E}_{\mathbb{Q}}\Big( -\log p(x^{(N)}) - \sum_{k=1}^{N}
    \log\frac{p_\theta(x^{(k-1)}\,|\,x^{(k)})}{q(x^{(k)}\,|\,x^{(k-1)})}\Big)=:L \,,
  \end{aligned}
  \label{variational-bound}
\end{equation}

#### Idea

 optimize $\theta$, or $\mathrm{\mu}_\theta(x^{(k)},k)$, by minimizing $L$.

#### Parametrization

The derivation in the lecture note suggests the parametrization

$$\mathrm{\mu}_\theta(x^{(k)}, k) = \frac{1}{\sqrt{\alpha_k}} \Big(x^{(k)}- \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}} \epsilon_\theta(x^{(k)}, k)\Big)$$
where $\epsilon_\theta(x^{(k)}, k)$ is a function modeled by a neural network.


#### Simplified Loss function

   $$L_{\mathrm{simple}}(\theta) = \mathbb{E}_{k, x^{(0)}, \epsilon} \Big[\big|\epsilon_\theta(\sqrt{\bar{\alpha}_k}x^{(0)} + \sqrt{1-\bar{\alpha}_k} \epsilon, k) - \epsilon\big|^2 \Big]$$


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors

import torch
import torch.nn as nn
import math

from torch.utils.data import DataLoader

import seaborn as sns

from sklearn.datasets import make_circles, make_moons

### Set paramters in DDPM

1. $\beta_k$,    which is used in $q(x^{(k)}\,|\,x^{(k-1)}) = \mathcal{N}(x^{(k)}; \sqrt{1-\beta_k}x^{(k-1)}, \beta_k \mathrm{I}_d)$
2.  $\alpha_k = 1 - \beta_k\,,\quad \bar{\alpha}_k = \prod_{i=1}^k \alpha_i$, which are constants in  

    $q(x^{(k)}|x^{(0)}) = \mathcal{N}(x^{(k)}; \sqrt{\bar{\alpha}_k}x^{(0)}, (1-\bar{\alpha}_k) \mathrm{I}_d)$.
    
3. $\sigma_k^2 = \beta_k$, which is used in 

    $p_\theta(x^{(k-1)}\,|\,x^{(k)}) = \mathcal{N}\Big(x^{(k-1)}; \mathrm{\mu}_\theta(x^{(k)}, k), \sigma_k^2\mathrm{I}_d\Big)$

In [None]:
N=100
beta_min = 0.001
beta_max = 0.02

beta_k = torch.linspace(beta_min, beta_max, N)

alpha_k = 1.0 - beta_k

alpha_bar_k = torch.ones_like(alpha_k)
for i in range(N):
    if i == 0 :
        alpha_bar_k[i] = alpha_k[i]
    else :
        alpha_bar_k[i] = alpha_bar_k[i-1] * alpha_k[i]

sigma_k = torch.sqrt(beta_k)

#### display the parameters

In [None]:
print ('beta_k=', beta_k, ', length=%d' % beta_k.shape)
print ('alpha_k=', alpha_k, ', length=%d' % alpha_k.shape)
print ('alpha_bar_k=', alpha_bar_k, ', length=%d' % alpha_bar_k.shape)

### feedforward neural network to represent $\epsilon(x,k):\mathbb{R}^d \times \{1,\dots, N\} \rightarrow \mathbb{R}^d$

The goal is to learn the mean $\mathrm{\mu}_\theta(x^{(k)}, k)$ in the transition density 
$p_\theta(x^{(k-1)}\,|\,x^{(k)}) = \mathcal{N}\Big(x^{(k-1)}; \mathrm{\mu}_\theta(x^{(k)}, k), \sigma_k^2\mathrm{I}_d\Big)$ of the reverse Markov chain.

We use the parametrization

$$\mathrm{\mu}_\theta(x^{(k)}, k) = \frac{1}{\sqrt{\alpha_k}} \Big(x^{(k)}- \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}} \epsilon_\theta(x^{(k)}, k)\Big)$$
where $\epsilon_\theta(x^{(k)}, k)$ is a function modeled by a neural network.

In [None]:
class noise_predictor(nn.Module):
    
    def __init__(self, dim, N=1.0):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(dim + 1, 100),
            nn.Tanh(),
            nn.Linear(100, 100), 
            nn.Tanh(),                      
            nn.Linear(100, 100),             
            nn.Tanh(),            
            nn.Linear(100, 100), 
            nn.Tanh(),            
            nn.Linear(100, dim),             
       )
        self.N = N * 1.0
        
    
    def forward(self, x, k):

        # combine x and t into one tensor    
        state = torch.cat((x, k/self.N), dim=1)
        
        # pass input to the network
        output = self.net(state)
        
        return output

### Training 

Loss function:

   $$L_{\mathrm{simple}}(\theta) = \mathbb{E}_{k, x^{(0)}, \epsilon} \Big[\big|\epsilon_\theta(\sqrt{\bar{\alpha}_k}x^{(0)} + \sqrt{1-\bar{\alpha}_k} \epsilon, k) - \epsilon\big|^2 \Big]$$
   

In [None]:
def training(X, model, learning_rate=1e-3, batch_size=1000, total_epochs=1000):
    
    # determine dimension from training data
    dim = X.shape[1]

    # Adam
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    # change the dataset to PyTorch tensor
    dataset = torch.tensor(X, dtype=torch.float32).reshape(-1,dim)

    # define a dataloader
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

    loss_list = []

    for epoch in range(total_epochs):   # for each epoch

        for idx, data in enumerate(data_loader):  # loop over all mini-batches 

            # for each state in mini-batch, uniformaly sample an index from 1,..., N
            k = torch.randint(low=1, high=N+1, size=(data.shape[0], 1))  

            # noise (standard Gaussian random variables)
            epsilon = torch.randn_like(data) 
            
            # xk given x0=data
            xk = torch.sqrt(alpha_bar_k[k-1]) * data + torch.sqrt(1-alpha_bar_k[k-1]) * epsilon

            # evaluate the model
            noise_pred = model(xk, k) 

            loss = torch.mean(torch.sum((noise_pred - epsilon)**2, dim=1)) 
            
            optimizer.zero_grad()
            
            # gradient step
            loss.backward()

            # update weights
            optimizer.step()

            if idx == 0:
                # record the loss    
                loss_list.append(loss.item())  
                if epoch % 100 == 0:
                    print ('epoch=%d\n   loss=%.4f' % (epoch, loss.item()))   
                    
    return loss_list         

### generate new samples by sampling the reverse Markov chain: 

$$
  x^{(k-1)} \sim p_\theta(x^{(k-1)}\,|\,x^{(k)}) = \mathcal{N}\Big(x^{(k-1)};
    \mathrm{\mu}_\theta(x^{(k)}, k), \sigma_k^2\mathrm{I}_d\Big)\,, \qquad k=N, \dots, 1\,.
$$
  where $\sigma_k^2 = \beta_k$ and
  $$\mathrm{\mu}_\theta(x^{(k)}, k) = \frac{1}{\sqrt{\alpha_k}} \Big(x^{(k)}- \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}} \epsilon_\theta(x^{(k)}, k)\Big)$$

Therefore, 
  
$$\begin{aligned}
x^{(k-1)} =& \mathrm{\mu}_\theta(x^{(k)}, k) + \sigma_k z \\
= & \frac{1}{\sqrt{\alpha_k}} \Big(x^{(k)}-
  \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}} \epsilon_\theta(x^{(k)}, k)\Big) +
  \sigma_k z\,, \quad ~\mathrm{where}~ z \sim \mathcal{N}(0, \mathrm{I}_d)\,.
\end{aligned}  
$$

In [None]:
def reverse_sampling(model, M, dim):
       
    # sample from prior (standard Gaussian)
    x = torch.randn(M*dim).reshape(M, dim)    
    
    for k in reversed(range(1,N+1)):
        if k > 1 : 
            z = torch.randn(M*dim).reshape(M, dim)  
        else :
            z = torch.zeros(M, dim)
        
        tmp = beta_k[k-1] / torch.sqrt(1-alpha_bar_k[k-1]) * model(x, torch.ones(M,1) * k)
        
        mu_theta = 1.0 / torch.sqrt(alpha_k[k-1]) * (x - tmp)
        
        x = mu_theta + sigma_k[k-1] * z
    
    return x

### prepare the dataset

In [None]:
n_samples = 10000

X, Y = make_moons(n_samples, noise=0.05, random_state=10)

fig, ax = plt.subplots(1,1, figsize=(5, 4))

ax.scatter(X[:, 0], X[:, 1])
ax.set_title("dataset")

plt.tight_layout()
plt.show()

print (X.shape)

### define the neural network model 

In [None]:
model = noise_predictor(dim=2, N=N)

### training 

In [None]:
# batch-size
batch_size = 1000

# total training epochs
total_epochs = 5000

# training 
loss_list = training(X, model, learning_rate=1e-3, batch_size=batch_size, total_epochs=total_epochs)

# plot the evolution of the loss function during training
fig, ax = plt.subplots(1,1, figsize=(5, 4))
ax.plot(loss_list)
ax.set_xlabel('epoch')
ax.set_title('loss vs epoch')    

### generate new samples 

In [None]:
X_gen = reverse_sampling(model, M=10000, dim=2)

X_gen = X_gen.detach().numpy()

### compare generated data with training data

In [None]:
fig, ax = plt.subplots(1,2, figsize=(8, 4))

ax[0].scatter(X[:, 0], X[:, 1])
ax[0].set_title("data")
ax[1].scatter(X_gen[:, 0], X_gen[:, 1])
ax[1].set_title("generated")
plt.tight_layout()
plt.show()