# Variational AutoEncoder

![](https://cdn-images-1.medium.com/max/2600/1*22cSCfmktNIwH5m__u2ffA.png)

**VAE** can do generation of data that similar to it have seen before. But before diving into VAE, let's understand how simple **AutoEncoder** works. <br><br>
AutoEncoder consist of two parts: **encoder** and **decoder** networks.<br>
![](https://blog.keras.io/img/ae/autoencoder_schema.jpg)
**Encoder** takes your data as input and produces some continuous representation (aka **latent variable**) of given samples. This representation should have smaller dimension than data. The reason is - AE must take only most important information about data and throw away non-important just like PCA, but in this case it works non-linearly. <br><br>
**Decoder** takes this representation as input and tries to reconstruct the original data. <br><br>
Loss of this network is some defined distance between original input and reconstructed output. In Images it is usually pixel-wise **Binary CrossEntropy Loss**.

## Variational AE

There is small change in latent representation.
Now encoder will produce two vectors: vector of means and vector of standart deviations.
![](https://cdn-images-1.medium.com/max/1600/1*CiVcrrPmpcB1YGMkTF7hzA.png)
Using this mean and std vectors we can generate some amount of new samples and propagate them through decoder.

## Kullback–Leibler divergence
For this model we need to define new loss. KL divergence measures how two distributions are different. Exact mathematical formula: 
![](https://cdn-images-1.medium.com/max/1600/0*opyFpDwDt0H8rfCv) <br>
P and Q are **Probability Density Functions** (PDF) of distributions. We know that log(1)=0, so when Q and P are equal, KL distance is 0.<br><br>
For VAEs it is distance between hidden variables and standart normal distribution. This forces latent space to be distributed normally.<br> Derived formula for VAEs is:
![](https://cdn-images-1.medium.com/max/1200/1*uEAxCmyVKxzZOJG6afkCCg.png)
μ - mean and σ - std <br><br>
If we will use only KL Loss then encodings in hidden space will be distributed randomly and near the center of space. Decoder will not be able to reconstruct something from this noise. Out goal to differentiate different classes to different clusters. <br><br>
**Reconstruction loss** will help us to separate classes. It is a distance between original data and generated. Typically, in Images reconstruction loss is pixel-wise BCE or MSE. <br><br>
**Final loss** will be the combination of these two. <br>
`Loss = KL + ReconstructLoss`

## Reparametrization trick
Probably, you wondered: How gradients from reconstruction loss goes to encoder through sampling? Sampling is not differentiable.<br> Here is some method called `reparametrization`.<br>
The key insight is that `N(μ, σ) == N(0, 1) * σ + μ`<br>
Encoder predicts means and sigmas and combines them with standart normal noise, so that gradients now is available for encoder

## Why we should use VAE, not AE?
VAE is more complicated and requires knowledge of math, what's the sacral meaning of this? <br><br>
As I said, VAE makes strong assumption about distribution of latent variable. It is better for us because vectors of samples from the same class will be lie continuously. We can do better clusterization, interpolation. Generating of new data will be easier. If hidden space would have gaps between clusters then sampling from this space would produce bad results.

# Time to plunge into the code

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

import torch
from torch import nn, optim
import torch.nn.functional as F
from torchvision import datasets, transforms
from torchvision.utils import save_image
from torch.utils.data import Dataset, DataLoader
from torch.autograd import Variable

from PIL import Image

from tqdm import tqdm_notebook as tqdm

In [None]:
batch_size = 32

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Custom dataset

In [None]:
class DogDataset(Dataset):
    def __init__(self, img_dir, transform1=None, transform2=None):
    
        self.img_dir = img_dir
        self.img_names = os.listdir(img_dir)
        self.transform1 = transform1
        self.transform2 = transform2
        
        self.imgs = []
        for img_name in self.img_names:
            img = Image.open(os.path.join(img_dir, img_name))
            
            if self.transform1 is not None:
                img = self.transform1(img)
                
            self.imgs.append(img)

    def __getitem__(self, index):
        img = self.imgs[index]
        
        if self.transform2 is not None:
            img = self.transform2(img)
        
        return img

    def __len__(self):
        return len(self.imgs)

In [None]:
# First preprocessing of data
transform1 = transforms.Compose([transforms.Resize(64),
                                transforms.CenterCrop(64)])

# Data augmentation and converting to tensors
random_transforms = [transforms.RandomRotation(degrees=10)]
transform2 = transforms.Compose([transforms.RandomHorizontalFlip(p=0.5),
                                 transforms.RandomApply(random_transforms, p=0.3), 
                                 transforms.ToTensor(),
                                 transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
                                 
train_dataset = DogDataset(img_dir='../input/all-dogs/all-dogs/',
                           transform1=transform1,
                           transform2=transform2)

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=batch_size,
                          shuffle=True,
                          num_workers=4)

### Examples of data

In [None]:
x = next(iter(train_loader))

fig = plt.figure(figsize=(25, 16))
for ii, img in enumerate(x):
    ax = fig.add_subplot(4, 8, ii + 1, xticks=[], yticks=[])
    
    img = img.numpy().transpose(1, 2, 0)
    plt.imshow((img+1.)/2.)

## VAE Model
The code below is based on https://github.com/atinghosh/VAE-pytorch

In [None]:
class VAE(nn.Module):
    def __init__(self, latent_dim=128, no_of_sample=10, batch_size=32, channels=3):
        super(VAE, self).__init__()
        
        self.no_of_sample = no_of_sample
        self.batch_size = batch_size
        self.channels = channels
        self.latent_dim = latent_dim
        
        
        # Encoder
        def convlayer_enc(n_input, n_output, k_size=4, stride=2, padding=1, bn=False):
            block = [nn.Conv2d(n_input, n_output, kernel_size=k_size, stride=stride, padding=padding, bias=False)]
            if bn:
                block.append(nn.BatchNorm2d(n_output))
            block.append(nn.LeakyReLU(0.2, inplace=True))
            return block
        
        self.encoder = nn.Sequential(
            *convlayer_enc(self.channels, 64, 4, 2, 2),               # (64, 32, 32)
            *convlayer_enc(64, 128, 4, 2, 2),                         # (128, 16, 16)
            *convlayer_enc(128, 256, 4, 2, 2, bn=True),               # (256, 8, 8)
            *convlayer_enc(256, 512, 4, 2, 2, bn=True),               # (512, 4, 4)
            nn.Conv2d(512, self.latent_dim*2, 4, 1, 1, bias=False),   # (latent_dim*2, 4, 4)
            nn.LeakyReLU(0.2, inplace=True)
        )
        
        
        # Decoder
        def convlayer_dec(n_input, n_output, k_size=4, stride=2, padding=0):
            block = [
                nn.ConvTranspose2d(n_input, n_output, kernel_size=k_size, stride=stride, padding=padding, bias=False),
                nn.BatchNorm2d(n_output),
                nn.ReLU(inplace=True),
            ]
            return block
        
        self.decoder = nn.Sequential(
            *convlayer_dec(self.latent_dim, 512, 4, 2, 1),           # (512, 8, 8)
            *convlayer_dec(512, 256, 4, 2, 1),                       # (256, 16, 16)
            *convlayer_dec(256, 128, 4, 2, 1),                       # (128, 32, 32)
            *convlayer_dec(128, 64, 4, 2, 1),                        # (64, 64, 64)
            nn.ConvTranspose2d(64, self.channels, 3, 1, 1),          # (3, 64, 64)
            nn.Sigmoid()
        )

    def encode(self, x):
        '''return mu_z and logvar_z'''
        x = self.encoder(x)
        return x[:, :self.latent_dim, :, :], x[:, self.latent_dim:, :, :]
    
    def decode(self, z):
        z = self.decoder(z)
        return z.view(-1, 3 * 64 * 64)

    def reparameterize(self, mu, logvar):
        if self.training:
            # multiply log variance with 0.5, then in-place exponent
            # yielding the standard deviation

            sample_z = []
            for _ in range(self.no_of_sample):
                std = logvar.mul(0.5).exp_()
                eps = Variable(std.data.new(std.size()).normal_())
                sample_z.append(eps.mul(std).add_(mu))
            return sample_z
        
        else:
            return mu

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        
        if self.training:
            return [self.decode(z) for z in z], mu, logvar
        else:
            return self.decode(z), mu, logvar

    def loss_function(self, recon_x, x, mu, logvar):

        if self.training:
            BCE = 0
            for recon_x_one in recon_x:
                BCE += F.binary_cross_entropy(recon_x_one, x.view(-1, 3 * 64 * 64))
            BCE /= len(recon_x)
        else:
            BCE = F.binary_cross_entropy(recon_x, x.view(-1, 3 * 64 * 64))

        KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
        KLD /= self.batch_size * 3 * 64 * 64

        return BCE + KLD

Input dim is `64 x 64 x 3 = 12288`<br>
Latent dim is `4 x 4 x 32 = 512`<br>
Bottleneck is 24 times smaller than input image! Autoencoder should keep most important information

In [None]:
lr = 0.001
epochs = 50
latent_dim = 32

model = VAE(latent_dim, batch_size=batch_size).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)

### Image for validation

In [None]:
plt.imshow((x[0].numpy().transpose(1, 2, 0)+1)/2)
plt.show()

## Train loop

In [None]:
for epoch in range(1, epochs+1):
    model.train()
    print(f'Epoch {epoch} start')
    
    for batch_idx, data in enumerate(train_loader):
        data = data.to(device)
        optimizer.zero_grad()

        recon_batch, mu, logvar = model(data)
        loss = model.loss_function(recon_batch, data, mu, logvar)

        loss.backward()
        optimizer.step()
        
    model.eval()
    recon_img, _, _ = model(x[:1].to(device))
    img = recon_img.view(3, 64, 64).detach().cpu().numpy().transpose(1, 2, 0)
    
    plt.imshow((img+1.)/2.)
    plt.show()

## Check how well VAE reconstruct images

In [None]:
reconstructed, mu, _ = model(x.to(device))
reconstructed = reconstructed.view(-1, 3, 64, 64).detach().cpu().numpy().transpose(0, 2, 3, 1)

fig = plt.figure(figsize=(25, 16))
for ii, img in enumerate(reconstructed):
    ax = fig.add_subplot(4, 8, ii + 1, xticks=[], yticks=[])
    plt.imshow((img+1.)/2.)

## Walk in latent space from one dog to another

In [None]:
first_dog_idx = 0
second_dog_idx = 1

dz = (mu[second_dog_idx] - mu[first_dog_idx]) / 31
walk = Variable(torch.randn(32, latent_dim, 4, 4)).to(device)
walk[0] = mu[first_dog_idx]

for i in range(1, 32):
    walk[i] = walk[i-1] + dz
walk = model.decoder(walk).detach().cpu().numpy().transpose(0, 2, 3, 1)

fig = plt.figure(figsize=(25, 16))
for ii, img in enumerate(walk):
    ax = fig.add_subplot(4, 8, ii + 1, xticks=[], yticks=[])
    plt.imshow((img+1.)/2.)

## Generate random noise and run decoder on this

In [None]:
samples = Variable(torch.randn(32, latent_dim, 4, 4)).to(device)
samples = model.decoder(samples).detach().cpu().numpy().transpose(0, 2, 3, 1)

fig = plt.figure(figsize=(25, 16))
for ii, img in enumerate(samples):
    ax = fig.add_subplot(4, 8, ii + 1, xticks=[], yticks=[])
    plt.imshow((img+1.)/2.)

I was inspired by:
 - https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
 - https://github.com/atinghosh/VAE-pytorch