# Variational Autoencoder for the MNIST-Data-Set

[Tobias Haase](https://tchaase.github.io/)

In the following I adapted [a blog on VAEs in particular for this data-set](https://debuggercafe.com/convolutional-variational-autoencoder-in-pytorch-on-mnist-dataset/). From this blog I took the basic code and then tried to understand whats going on from there. 

## Set Up
Firstly I am loading the required modules.

In [3]:

import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
import torchvision.transforms as transforms
from torchvision.utils import save_image

from torch.utils.data import DataLoader
from torchvision.utils import make_grid

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from tqdm import tqdm
import imageio 

  from .autonotebook import tqdm as notebook_tqdm
  warn(f"Failed to load image Python extension: {e}")


Then I will define the functions I will use to train and validate the network. 

Firstly, the loss function is defined. The KL divergence is computed manually. There are three inputs: Firstly the reconstruction loss. 
Then there are the mean and the variance, which are related to the VAE's latent space.
The loss is defined as the sum of the KL divergence and the reconstruction loss here, thus the sum is returned as the final loss.  


In [4]:

def final_loss(bce_loss, mu, logvar):
    """
    This function will add the reconstruction loss (BCELoss) and the 
    KL-Divergence.
    KL-Divergence = 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    :param bce_loss: recontruction loss
    :param mu: the mean from the latent vector
    :param logvar: log variance from the latent vector
    """
    BCE = bce_loss 
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return BCE + KLD


Next, the training function is defined. This takes the criterion refers to the loss function, which is the criterion that needs to be minimized. 

In [5]:
def train(model, dataloader, dataset, device, optimizer, criterion):
    model.train()
    running_loss = 0.0
    counter = 0
    for i, data in tqdm(enumerate(dataloader), total=int(len(dataset)/dataloader.batch_size)):
        counter += 1  #There is a counter that is initialized at 0, which goes up for every batch?
        data = data[0]
        data = data.to(device)
        optimizer.zero_grad()
        reconstruction, mu, logvar = model(data)
        bce_loss = criterion(reconstruction, data)
        loss = final_loss(bce_loss, mu, logvar)
        loss.backward()  # Using the loss, backpropagation occurs. Thus, all the tensors that will be connected to this, will be involved in this computation. 
        running_loss += loss.item() #This here defined for every step along the way, how high is the loss. 
        optimizer.step()  #That with a gradient will be updated in one step according to the documentation. 
    train_loss = running_loss / counter 
    return train_loss #The function returns only the training loss. 

Lastly, we need to validate that the training worked. Two things differentiate this from the training: Firstly, there is no backpropagation step here! The evaluation  does not impact the training. Then, images are saved. This is saved then according to the functions outlined in the utilis section. 


In [6]:

def validate(model, dataloader, dataset, device, criterion):
    model.eval()
    running_loss = 0.0
    counter = 0
    with torch.no_grad():
        for i, data in tqdm(enumerate(dataloader), total=int(len(dataset)/dataloader.batch_size)):
            counter += 1
            data= data[0]
            data = data.to(device)
            reconstruction, mu, logvar = model(data)
            bce_loss = criterion(reconstruction, data)
            loss = final_loss(bce_loss, mu, logvar)
            running_loss += loss.item()
        
            # save the last batch input and output of every epoch
            if i == int(len(dataset)/dataloader.batch_size) - 1:
                recon_images = reconstruction
    val_loss = running_loss / counter
    return val_loss, recon_images

### Utility functions
 This section contains utility functions related to saving plots and images. They should not clutter up the training part as they are not relevant to training. 

In [7]:
to_pil_image = transforms.ToPILImage() #conversion of a tensor to an image, this will later be used to generate .gif images!

#The following function converts the PILutilityges to .gif files. The accepted data are numpy arrays!
def image_to_vid(images):     
    imgs = [np.array(to_pil_image(img)) for img in images]
    imageio.mimsave('../outputs/generated_images.gif', imgs)

#This function is equal to the function above, but for outputs of the VAE
def save_reconstructed_images(recon_images, epoch):
    save_image(recon_images.cpu(), f"../outputs/output{epoch}.jpg")  #The save image function comes from torchvision.utilis!

#Finally, here the training and validation losses are saved into a plot! This is done via matplotlib. 
def save_loss_plot(train_loss, valid_loss):
    # loss plots
    plt.figure(figsize=(10, 7))
    plt.plot(train_loss, color='orange', label='train loss')
    plt.plot(valid_loss, color='red', label='validataion loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.savefig('../outputs/loss.jpg')
    plt.show()

## Model

Next I will define the parameters for the model. 
The kernel size is 4x4. This means that there is a 4 pixel wide and high kernel that is used for the convolutioin. 

In [8]:
kernel_size = 4 # (4, 4) kernel
init_channels = 8 # initial number of filters, first layers output. 
image_channels = 1 # MNIST images are grayscale
latent_dim = 16 # latent dimension for sampling

In the next step the full model is formulated.

In [9]:
# define a Conv VAE
class ConvVAE(nn.Module):
    def __init__(self):
        super(ConvVAE, self).__init__()
 
        # encoder - Initially the VAE 
        self.enc1 = nn.Conv2d(
            in_channels=image_channels, out_channels=init_channels, kernel_size=kernel_size, 
            stride=2, padding=1
        )
        self.enc2 = nn.Conv2d(
            in_channels=init_channels, out_channels=init_channels*2, kernel_size=kernel_size, 
            stride=2, padding=1
        )
        self.enc3 = nn.Conv2d(
            in_channels=init_channels*2, out_channels=init_channels*4, kernel_size=kernel_size, 
            stride=2, padding=1
        )
        self.enc4 = nn.Conv2d(
            in_channels=init_channels*4, out_channels=64, kernel_size=kernel_size, 
            stride=2, padding=0
        )
        # fully connected layers for learning representations <-- here there is the bottleneck
        self.fc1 = nn.Linear(64, 128)
            # Using the 128 nodes as input for the bottleneck, the mean and variance of the latent channels are computed. 
        self.fc_mu = nn.Linear(128, latent_dim)
        self.fc_log_var = nn.Linear(128, latent_dim)
        self.fc2 = nn.Linear(latent_dim, 64)

        # decoder 
        # Here there is the reverse ordere of the encoder, starting with 0 padding and moving on to a kernel with padding. 
        self.dec1 = nn.ConvTranspose2d(
            in_channels=64, out_channels=init_channels*8, kernel_size=kernel_size, 
            stride=1, padding=0
        )
        self.dec2 = nn.ConvTranspose2d(
            in_channels=init_channels*8, out_channels=init_channels*4, kernel_size=kernel_size, 
            stride=2, padding=1
        )
        self.dec3 = nn.ConvTranspose2d(
            in_channels=init_channels*4, out_channels=init_channels*2, kernel_size=kernel_size, 
            stride=2, padding=1
        )
        self.dec4 = nn.ConvTranspose2d(
            in_channels=init_channels*2, out_channels=image_channels, kernel_size=kernel_size, 
            stride=2, padding=1
        )

    #The following is what allows us to backpropagate throught the model. One cannot do so through a random node, but we can sample from a random node and with the reparametrization trick 
    # still backpropagate through not random / deterministic nodes!
    def reparameterize(self, mu, log_var):
        """
        :param mu: mean from the encoder's latent space
        :param log_var: log variance from the encoder's latent space
        """
        std = torch.exp(0.5*log_var) # standard deviation
        eps = torch.randn_like(std) # `randn_like` as we need the same size - this returns a vector filled with random entries. 
        sample = mu + (eps * std) # sampling
        return sample
 
    def forward(self, x):
        # encoding
        x = F.relu(self.enc1(x))
        x = F.relu(self.enc2(x))
        x = F.relu(self.enc3(x))
        x = F.relu(self.enc4(x))
        batch, _, _, _ = x.shape
        x = F.adaptive_avg_pool2d(x, 1).reshape(batch, -1)  # This line I don't understand. 
        hidden = self.fc1(x)
        # get `mu` and `log_var`
        mu = self.fc_mu(hidden)
        log_var = self.fc_log_var(hidden)
        # get the latent vector through reparameterization
        z = self.reparameterize(mu, log_var)
        z = self.fc2(z) # in this layer the reparametrization trick is applied. 
        z = z.view(-1, 64, 1, 1)
 
        # decoding; 
        torch.manual_seed(0)
        x = F.relu(self.dec1(z))
        x = F.relu(self.dec2(x))
        x = F.relu(self.dec3(x))
        reconstruction = torch.sigmoid(self.dec4(x)) #Sigmoid for the final layer, this is fitting as we use cross entropy as to compute the loss? Linear functions would be associated with mean squared errors etc. 
        return reconstruction, mu, log_var

## Training

Firstly a few preparations are done. 

In [10]:
matplotlib.style.use('ggplot')
# Firstly, the device is set to the GPU instead of the CPU for me, as I have cuda available. 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# initialize the model - here a convolutional VAE is used which will be computed on my GPU. What does this mean? It uses a convolution to analyze the images. 
model = ConvVAE().to(device)
# set the learning parameters
lr = 0.001  # I have no idea if this is large. 
epochs = 10
batch_size = 32
# Trained 100 rounds, 64 samples each episode.  
optimizer = optim.Adam(model.parameters(), lr=lr)  # An ADAM optimizer is used. This means that the optimization occurs in one step, although I still need to read more on what this does exactly. 
criterion = nn.BCELoss(reduction='sum')  #BCE stands for Binary Cross Entropy, thus cross entropy is used as the criterion. 
# a list to save all the reconstructed images in PyTorch grid format
grid_images = []  #An empty list is created. 




The data is transformed and loaded in the following. 

In [11]:
# --- Data Transformation: 
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
])
# The original data-set available from MNIST have a size of 28x28 pixels. Above they are transformed. 

# --- training set and train data loader
trainset = torchvision.datasets.MNIST(
    root='../input', train=True, download=True, transform=transform
)
# The data-set is loaded. It comes from the source of torchvision.data.set and is downloaded. 
trainloader = DataLoader(
    trainset, batch_size=batch_size, shuffle=True
)
# Here is the important set of actually loading the data. The batch sizes are also defined, thus it is defined how data will be sampled for later training?

# --- validation set and validation data loader
testset = torchvision.datasets.MNIST(
    root='../input', train=False, download=True, transform=transform
)
testloader = DataLoader(
    testset, batch_size=batch_size, shuffle=False
)

Thus, every epoch 64 samples are run through the model. 

In [12]:
print("Overall the test portion of the data-set has a length of %f", len(trainset))
print("Overall the test portion of the data-set has a length of %f", len(testset))

trainloader 
trainset

Overall the test portion of the data-set has a length of %f 60000
Overall the test portion of the data-set has a length of %f 10000


Dataset MNIST
    Number of datapoints: 60000
    Root location: ../input
    Split: Train
    StandardTransform
Transform: Compose(
               Resize(size=(32, 32), interpolation=bilinear, max_size=None, antialias=None)
               ToTensor()
           )

Again empty lists are initialized for the training loss and the validation loss to allow for later plotting. 

In [None]:
train_loss = []
valid_loss = []

Here the training function is defined. For every epoch in the range of epochs the following function will be computed. This means that for the first episode
I will draw from the training set, compute the loss, backpropagate and rinse and repeat. 
The bread and butter of this are the train and validate functions. they require the input of the model, what data will be used to train, what data will be used to test. 
What is the device (for me GPU), then how will it be optimized (ADAM), in regards to the previously defined criterion of binary cross entropy. 


In [1]:
for epoch in range(epochs):
    print(f"Epoch {epoch+1} of {epochs}")
    train_epoch_loss = train(
        model, trainloader, trainset, device, optimizer, criterion
    )
    valid_epoch_loss, recon_images = validate(
        model, testloader, testset, device, criterion
    )
    # Next the values that result from this will be stored. 
    train_loss.append(train_epoch_loss)
    valid_loss.append(valid_epoch_loss)
    # save the reconstructed images from the validation loop
    save_reconstructed_images(recon_images, epoch+1)
    # convert the reconstructed images to PyTorch image grid format, which can then be saved!
    image_grid = make_grid(recon_images.detach().cpu())
    grid_images.append(image_grid)
    print(f"Train Loss: {train_epoch_loss:.4f}")
    print(f"Val Loss: {valid_epoch_loss:.4f}") #Up to 4 decimals will be printed, float format. 

# save the reconstructions as a .gif file
image_to_vid(grid_images)
# save the loss plots to disk
save_loss_plot(train_loss, valid_loss)
print('TRAINING COMPLETE')

NameError: name 'epochs' is not defined