# Homework 1: Autoregressive models

## Task 1. Theory (5pt)

1. Consider the MADE model with a single hidden layer. The input object is $\mathbf{x} \in \mathbb{R}^m$. We denote by $\mathbf{W} \in \mathbb{R}^{h \times m}$ the matrix of weights between the input and the hidden layer, and by $\mathbf{V} \in \mathbb{R}^{m \times h}$ the matrix of weights between the hidden and the output layer ($h$ is the number of neurons in the hidden layer). Let us generate the correct autoregressive masks $\mathbf{M}_{\mathbf{W}} \in \mathbb{R}^{h \times m}$ and $\mathbf{M}_{\mathbf{V}} \in \mathbb{R}^{m \times h}$ (the generation algorithm is given in Lecture 1) for the direct order of variables
$$
    p(\mathbf{x}) = p(x_1) \cdot p(x_2 | x_1) \cdot \dots \cdot p(x_m | x_{m-1}, \dots, x_1).
$$ 
Each mask is a binary matrix of 0 and 1. Let's introduce the matrix $\mathbf{M} = \mathbf{M}_{\mathbf{V}} \mathbf{M}_{\mathbf{W}}$. Prove that:
    * $\mathbf{M}$ is strictly lower triangular (has zeros on the diagonal and above the diagonal);
    * $\mathbf{M}_{ij}$  is equal to the number of paths in the network graph between the output neuron $\hat{x}_i$ and the input neuron $x_j$.

```
your solution for task 1
```


2. Let's suppose we have 2 generative models for images of size $W \times H \times C$, where $W$ - image width, $H$ - image height, $C$ - number of channels. The first model $p_1(\mathbf{x} | \boldsymbol{\theta})$ outputs a discrete distribution for each pixel  $\text{Categorical}(\boldsymbol{\pi})$, где $\boldsymbol{\pi} = (\pi_1, \dots,  \pi_{256})$. The second model $p_2(\mathbf{x} | \boldsymbol{\theta})$ models a discrete distribution by a continuous mixture of logistic functions
$$
    p(\nu | \boldsymbol{\mu}, \mathbf{s}, \boldsymbol{\pi}) = \sum_{k=1}^K \pi_k p(\nu | \mu_k, s_k).
    $$
$$
    P(x | \boldsymbol{\mu}, \mathbf{s}, \boldsymbol{\pi}) = P(x + 0.5 | \boldsymbol{\mu}, \mathbf{s}, \boldsymbol{\pi}) - P(x - 0.5 | \boldsymbol{\mu}, \mathbf{s}, \boldsymbol{\pi}).
$$
    Each of the models outputs parameters of pixel distributions.

    * Calculate the dimensions of the output tensor for the model $p_1(\mathbf{x} | \boldsymbol{\theta})$ and for the model $p_2(\mathbf{x} | \boldsymbol{\theta})$. 
    * At what number of mixture components $K$ is the number of elements of the output tensor for $p_2(\mathbf{x} | \boldsymbol{\theta})$ becomes greater than $p_1(\mathbf{x} | \boldsymbol{\theta})$.

```
your solution for task 2
```

3. In the course, we will meet different divergences (not only $KL$). So let's get acquainted with the class of $\alpha$ - divergences:
$$
    D_{\alpha}(p || q) = \frac{4}{1 - \alpha^2} \left( 1 - \int p(x)^{\frac{1 + \alpha}{2}}q(x)^{\frac{1 - \alpha}{2}}dx\right).
$$
For each $\alpha \in [-\infty; +\infty]$ the function $D_{\alpha} (p || q)$ is a measure of the similarity of the two distributions, which could have different properties.
	  
      Prove that for $\alpha \rightarrow 1$ the divergence $D_{\alpha}(p || q) \rightarrow KL(p || q)$, and for $\alpha \rightarrow -1$ the divergence $D_{\alpha}(p || q) \rightarrow KL(q || p)$. (Hint: use the fact that $t^\epsilon = \exp(\epsilon \ln t) = 1 + \epsilon \ln t + O(\epsilon^2)$.)

```
your solution for task 3
```

Now it time to move on to practical part of homework.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import io
import itertools
import pickle
import os
from tqdm import tqdm

import torch
import torch.optim as optim
import torch.utils.data as data
import torch.nn as nn
import torch.nn.functional as F

from torchvision.utils import make_grid

USE_CUDA = torch.cuda.is_available()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Use the following functions to train your models.

In [None]:
def get_cross_entropy_loss(scores, labels):
    # ====
    # your code
    
    # ====


def test_get_cross_entropy_loss():
    input = torch.tensor([[1, 2, 3, 4],[5, 6, 7, 8]], dtype=torch.float32)
    target = torch.tensor([3, 1], dtype=torch.long)

    assert np.allclose(get_cross_entropy_loss(input, target).numpy(), 1.4402)

test_get_cross_entropy_loss()

Do not change these functions.

In [None]:
def train_epoch(model, train_loader, optimizer, use_cuda):
    model.train()
  
    train_losses = []
    for x in train_loader:
        if use_cuda:
            x = x.cuda()
        loss = get_cross_entropy_loss(model(x), x)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
    return train_losses


def eval_model(model, data_loader, use_cuda):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for x in data_loader:
            if use_cuda:
                x = x.cuda()
            loss = get_cross_entropy_loss(model(x), x)
            total_loss += loss * x.shape[0]
        avg_loss = total_loss / len(data_loader.dataset)
    return avg_loss.item()


def train_model(model, train_loader, test_loader, epochs, lr, use_tqdm=False, use_cuda=False):
    if use_cuda:
        model = model.cuda()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    train_losses = []
    test_losses = [eval_model(model, test_loader, use_cuda)]
    if use_tqdm:
        forrange = tqdm(range(epochs))
    else:
        forrange = range(epochs)
    for epoch in forrange:
        model.train()
        train_losses.extend(train_epoch(model, train_loader, optimizer, use_cuda))
        test_loss = eval_model(model, test_loader, use_cuda)
        test_losses.append(test_loss)

    return train_losses, test_losses


def plot_training_curves(train_losses, test_losses):
    plt.figure(figsize=(8, 6))
    n_epochs = len(test_losses) - 1
    x_train = np.linspace(0, n_epochs, len(train_losses))
    x_test = np.arange(n_epochs + 1)

    plt.plot(x_train, train_losses, label='train loss')
    plt.plot(x_test, test_losses, label='test loss')
    plt.legend()
    plt.title('training curves')
    plt.xlabel('Epoch')
    plt.ylabel('NLL')


def load_pickle(path, flatten=True):
    with open(path, 'rb') as f:
        data = pickle.load(f)
    train_data = data['train'].astype('float32')[:, :, :, [0]] > 128
    test_data = data['test'].astype('float32')[:, :, :, [0]] > 128
    train_data = np.transpose(train_data.astype('uint8'), (0, 3, 1, 2))
    test_data = np.transpose(test_data.astype('uint8'), (0, 3, 1, 2))
    if flatten:
        train_data = train_data.reshape(-1, 28 * 28)
        test_data = test_data.reshape(-1, 28 * 28)
    return train_data, test_data


def show_samples(samples, title, nrow=10):
    samples = torch.FloatTensor(samples).reshape(-1, 28, 28)
    samples = torch.unsqueeze(samples, axis=1)
    grid_img = make_grid(samples, nrow=nrow)
    plt.figure()
    plt.title(title)
    plt.imshow(grid_img.permute(1, 2, 0))
    plt.axis('off')
    plt.show()


def visualize_mnist_images(data, title):
    idxs = np.random.choice(len(data), replace=False, size=(100,))
    images = train_data[idxs]
    show_samples(images, title)

## Task 2: MADE on 2D data (5pt)

Train MADE model on single image (see paper for details: https://arxiv.org/abs/1502.03509).

You will work with bivariate data of the form $x = (x_0,x_1)$, where $x_0, x_1 \in \{0, \dots, \text{n_bins}\}$ (e.g. Categorial random variables). 

Implement and train a MADE model through MLE to represent $p(x_0, x_1)$ on the given image, with any autoregressive ordering of your choosing ($p(x_0, x_1) = p(x_0)p(x_1 | x_0)$ or $p(x_0, x_1) = p(x_1)p(x_0 | x_1)$). 

We advice you to think about what conditional distribution that you want to fit and how MADE's masks should look like. It may be useful to one-hot encode your inputs.

You do not have to change these functions (except the path to the data file. Download the file from here: https://drive.google.com/file/d/1GUthJrA5fBpvi593Swo36t8zaFw9Dyak/view?usp=sharing

In [None]:
def generate_2d_data(count, bins):
    # change the path to the image
    im = Image.open(os.path.join('drive', 'My Drive', 'DGM', 'homework_supplementary', 'dgm.png')).resize((bins, bins)).convert('L')
    im = np.array(im).astype('float32')
    dist = im / im.sum()

    pairs = list(itertools.product(range(bins), range(bins)))
    idxs = np.random.choice(len(pairs), size=count, replace=True, p=dist.reshape(-1))
    samples = np.array([pairs[i] for i in idxs])

    split = int(0.8 * len(samples))
    return dist, samples[:split], samples[split:]


def plot_2d_data(train_data, test_data):
    bins = int(max(test_data.max(), train_data.max()) - min(test_data.min(), train_data.min())) + 1
    train_dist, test_dist = np.zeros((bins, bins)), np.zeros((bins, bins))
    for i in range(len(train_data)):
        train_dist[train_data[i][0], train_data[i][1]] += 1
    train_dist /= train_dist.sum()

    for i in range(len(test_data)):
        test_dist[test_data[i][0], test_data[i][1]] += 1
    test_dist /= test_dist.sum()

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
    ax1.set_title('Train Data')
    ax1.imshow(train_dist, cmap='gray')
    ax1.axis('off')
    ax1.set_xlabel('x1')
    ax1.set_ylabel('x0')

    ax2.set_title('Test Data')
    ax2.imshow(test_dist, cmap='gray')
    ax2.axis('off')
    ax2.set_xlabel('x1')
    ax2.set_ylabel('x0')

    plt.show()
    
    
def plot_2d_distribution(true_dist, learned_dist):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
    ax1.imshow(true_dist, cmap='gray')
    ax1.set_title('True Distribution')
    ax1.axis('off')
    ax2.imshow(learned_dist, cmap='gray')
    ax2.set_title('Learned Distribution')
    ax2.axis('off')

In [None]:
COUNT = 20000
BINS = 60

image, train_data, test_data = generate_2d_data(COUNT, BINS)
plot_2d_data(train_data, test_data)

Now we have to implement masked dense layer. It is a core component of MADE model. It acts like a usual dense layer, but firstly multiplies the weights by the predefined binary mask.

In [None]:
class MaskedLinear(nn.Linear):
    def __init__(self, in_features, out_features):
        super().__init__(in_features, out_features)
        self.register_buffer('mask', torch.ones(out_features, in_features))

    def set_mask(self, mask):
        self.mask.data.copy_(torch.from_numpy(mask.astype(np.uint8).T))

    def forward(self, input):
        # ====
        # your code
        
        # ====


layer = MaskedLinear(2, 2)

x = torch.tensor([1, 2], dtype=torch.float32)
output = layer(x).detach().numpy()

layer.set_mask(np.array([[0, 0], [0, 0]]))
assert np.allclose(layer(x).detach().numpy(), layer.bias.detach().numpy())

layer.set_mask(np.array([[1, 1], [1, 1]]))
assert np.allclose(layer(x).detach().numpy(), output)

In [None]:
def to_one_hot(labels, d):
    """
        The function takes categorical labels of size: batch_size x n_dims.
        One-hot encodes them to d bins and then reshapes the result to batch_size x (n_dims * d)
    """
    assert len(labels.shape) == 2
    one_hot = F.one_hot(labels.to(torch.int64), d)
    return one_hot.view((labels.shape[0], -1)).float()

    
class MADE(nn.Module):
    def __init__(self, nin, bins, hidden_sizes):
        super().__init__()
        self.nin = nin
        self.nout = nin * bins
        self.bins = bins
        self.hidden_sizes = hidden_sizes
        # we will use the trivial ordering of input units
        self.ordering = np.arange(self.nin)

        # ====
        # your code
        # define a simple MLP (sequence of MaskedLinear and ReLU) neural net 
        # self.net = nn.Sequential(list of layers)
        # do not place ReLU at the end of the network!
        # note: the first layer of model should have nin * bins input units
        
        # ====

        self.create_mask()  # builds the initial self.m connectivity

        
    def create_mask(self):
        # ====
        # your code
        # 1) The ordering of input units from 1 to m (self.ordering).
        # 2) Assign the random number k from 1 to m − 1 to each hidden unit. 
        #    This number gives the maximum number of input units to which the unit can be connected.
        # 3) Each hidden unit with number k is connected with the previous layer units 
        #   which has the number is less or equal than k.
        # 4) Each output unit with number k is connected with the previous layer units 
        #    which has the number is less than k.
        
        # ====

        # set the masks in all MaskedLinear layers
        layers = [l for l in self.net.modules() if isinstance(l, MaskedLinear)]
        for l, m in zip(layers, masks):
            l.set_mask(m)

    def visualize_masks(self):
        prod = self.masks[0]
        for idx, m in enumerate(self.masks):
            plt.figure(figsize=(3, 3))
            plt.title(f'layer: {idx}')
            plt.imshow(m, vmin=0, vmax=1, cmap='gray')
            plt.show()

            if idx > 0:
                prod=prod.dot(m)

        plt.figure(figsize=(3, 3))
        plt.title('prod')
        plt.imshow(prod, vmin=0, vmax=1, cmap='gray')
        plt.show()

    def forward(self, x):
        assert len(x.size()) == 2
        assert x.shape[1] == self.nin

        # ====
        # your code
        # 1) apply one hot encoding to x
        # 2) apply the model
        # 3) reshape and transpose the output to (batch_size, self.bins, self.nin)
        
        # ====

    def sample(self, n, use_cuda=True):
        # read carefully and understand the sampling process
        xs = []
        for _ in range(n):
            x = torch.randint(0, self.bins, (1, self.nin))
            if use_cuda:
                x = x.cuda()
            for it in range(self.nin):
                probs = F.softmax(model(x)[0], dim=0).T
                distr = torch.distributions.categorical.Categorical(probs)
                x[0, it] = distr.sample()[it]
            xs.append(x)
        xs = torch.cat(xs)
        return xs.cpu().numpy()

In [None]:
# ====
# your code
HIDDEN_SIZES = 
# ====

model = MADE(2, BINS, HIDDEN_SIZES)


def test_model_output(model):
    assert [10, BINS, 2] == list(model(torch.randint(0, BINS, (10, 2))).size())


def test_create_mask(model):
    prod = np.ones((1, BINS * 2))
    for m in model.masks:
        assert set(np.unique(m)).issubset((True, False))
        prod = prod.dot(m)
    assert np.allclose(prod, np.repeat(np.array([[0, BINS * np.prod(HIDDEN_SIZES)]]), BINS))

test_create_mask(model)
test_model_output(model)

Now we will visualize the model masks. It should helps you to understand whether the model is correct.

In [None]:
model.visualize_masks()

In [None]:
# ====
# your code
# you have to choose these parameters by yourself
BATCH_SIZE = 
EPOCHS = 
LR = 
# ====

train_loader = data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = data.DataLoader(test_data, batch_size=BATCH_SIZE)
train_losses, test_losses = train_model(model, train_loader, test_loader, epochs=EPOCHS, lr=LR, use_cuda=USE_CUDA)

assert test_losses[-1] < 4.0

In [None]:
def get_distribution(model, use_cuda=True):
    x = np.mgrid[0:model.bins, 0:model.bins].reshape(2, model.bins ** 2).T
    # ====
    # your code
    # 1) take the model output for the grid x (shape: bins ** 2, bins, 2)
    # 3) apply log_softmax to get log probs (shape: bins ** 2, bins, 2)
    # 4) apply torch.gather to gather vaalues indexed by grid x (shape: bins ** 2, 2)
    # 5) sum the log probs over dim=1 (shape: bins ** 2)
    # 6) exponentiate it (shape: bins ** 2)
    # 7) return an array BINS x BINS with probabilities of each pixel

    
    # ====


distribution = get_distribution(model, USE_CUDA)
assert distribution.shape == (BINS, BINS)

plot_training_curves(train_losses, test_losses)
plot_2d_distribution(image, distribution)

In [None]:
# draw samples from model 

samples = model.sample(5000)
plot_2d_data(train_data, samples)

## Task 3: MADE on MNIST (3pt)


You do not have to change this functions (except the path to the data file, download it from here: https://drive.google.com/file/d/1Ms-RBybrueI3_w2CRj7lM9mYjfvFRL6w/view?usp=sharing

In [None]:
# change the path to the file
train_data, test_data = load_pickle(os.path.join('drive', 'My Drive', 'DGM', 'homework_supplementary', 'mnist.pkl'))
visualize_mnist_images(train_data, 'MNIST samples')

In [None]:
# ====
# your code
HIDDEN_SIZES = 
# ====

BINS = 2

model = MADE(28 * 28, BINS, HIDDEN_SIZES)

def test_model_output(model):
    assert [10, BINS, 28 * 28] == list(model(torch.randint(0, 2, (10, 28 * 28))).size())


test_model_output(model)

In [None]:
# show on your masks and assure that they are correct
model.visualize_masks()

In [None]:
# ====
# your code
# you have to choose these parameters by yourself
BATCH_SIZE = 
EPOCHS = 
LR = 
# ====

train_loader = data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = data.DataLoader(test_data, batch_size=BATCH_SIZE)
train_losses, test_losses = train_model(model, train_loader, test_loader, epochs=EPOCHS, lr=LR, use_tqdm=True, use_cuda=USE_CUDA)

assert test_losses[-1] < 0.16

In [None]:
plot_training_curves(train_losses, test_losses)

In [None]:
samples = model.sample(25)
show_samples(samples, title='MNIST samples', nrow=5)