# Lab 5 Part I: Image Autoencoder with CNNs


------------------------------------------------------
*Neural Networks. Bachelor in Data Science and Engineering*

*Pablo M. Olmos pamartin@ing.uc3m.es*

*Aurora Cobo Aguilera acobo@tsc.uc3m.es*

------------------------------------------------------

In this notebook, we'll build a convolutional autoencoder to compress the CIFAR10 dataset. The encoder portion will be made of convolutional and pooling layers and the decoder will be made of **transpose convolutional layers** that learn to "upsample" a compressed representation.

Note: a big part of the following material is a personal wrap-up of [Facebook's Deep Learning Course in Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188). So all credit goes for them!!

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

Image(url= "https://iq.opengenus.org/content/images/2019/03/autoencoder_1.png", width=800, height=200)

## Part I. Load CIFAR10. Visualize images

Lets copy code that we've used in previous labs to load, normalize and visualize CIFAR10 images.

In [None]:
import torch
import numpy as np
from torchvision import datasets
import torchvision.transforms as transforms

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=256,
                                          shuffle=True, num_workers=2)

testset = datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)

testloader = torch.utils.data.DataLoader(testset, batch_size=256,
                                         shuffle=False, num_workers=2)

Lets visualize one image ...

In [None]:
traindata = iter(trainloader)

images, labels = next(traindata)

print(images[1].shape)

def rescale(img):
    img = img / 2 + 0.5     # unnormalize to pot
    npimg = img.numpy()
    return np.transpose(npimg, (1, 2, 0))

plt.imshow(rescale(images[0,:,:,:]))

> **Exercise**: Check the range of every pixel in a normalized image. This will help you to select the appropiate activation function in the autoencoder.

In [None]:
# YOUR CODE HERE



> **Exercise:** Create a validation set with 5000 images

In [None]:
# YOUR CODE HERE



---
## Part II. Create the convolutional  autoencoder

### Encoder
The encoder part of the network will be a typical convolutional pyramid. Each convolutional layer 
will be followed by a max-pooling layer to reduce the dimensions of the layers. 

### Decoder

The decoder though might be something new to you. The decoder needs to convert from a narrow representation to a wide, reconstructed image. For example, the representation could be a 7x7x4 max-pool layer. This is the output of the encoder, but also the input to the decoder. We want to get a 28x28x1 image out from the decoder so we need to work our way back up from the compressed representation. A schematic of the network is shown below.

**Note:** For CIFAR 10 we start with 32x32x3 images, and the encoder output will be of dimension 8x8xC, where C is the number of feature maps at the output. We will study the effect of C in reconstructing images. 

In [None]:
Image(url= "https://iq.opengenus.org/content/images/2019/03/autoencoder_3.png", width=400)


In the MNIST example, the encoder layer has size 7x7x4 = 196. The original images have size 28x28 = 784, so the encoded vector is 25% the size of the original image. These are just suggested sizes for each of the layers. Feel free to change the depths and sizes, in fact, you're encouraged to add additional layers to make this representation even smaller! Remember our goal here is to find a small representation of the input data.

### Transpose Convolutions, Decoder

This decoder uses **transposed convolutional** layers to increase the width and height of the input layers. They work almost exactly the same as convolutional layers, but in reverse. A stride in the input layer results in a larger stride in the transposed convolution layer. For example, if you have a 3x3 kernel, a 3x3 patch in the input layer will be reduced to one unit in a convolutional layer. Comparatively, one unit in the input layer will be expanded to a 3x3 path in a transposed convolution layer. PyTorch provides us with an easy way to create the layers, [`nn.ConvTranspose2d`](https://pytorch.org/docs/stable/nn.html#convtranspose2d). 

It is important to note that transpose convolution layers can lead to artifacts in the final images, such as checkerboard patterns. This is due to overlap in the kernels which can be avoided by setting the stride and kernel size equal. In [this Distill article](http://distill.pub/2016/deconv-checkerboard/) from Augustus Odena, *et al*, the authors show that these checkerboard artifacts can be avoided by resizing the layers using nearest neighbor or bilinear interpolation (upsampling) followed by a convolutional layer.  For simplicity, **we will put a convolution layer after the transpose convolutional layers to remove the artifacts.**

> **Exercise:** Complete the following code, in which we build the autoencoder using a series of convolutional layers, pooling layers, and transpose convolutional layers. When building the decoder, recall that transpose convolutional layers can upsample an input by a factor of 2 using a stride and kernel_size of 2. 

In [None]:
import torch.nn as nn
from torch import optim
import time

# define the NN architecture
class ConvAutoencoder(nn.Module):
    def __init__(self,C=4):
        super().__init__()

        #C is the number of feature maps of the encoder's output

        self.n_channel_latent = C

        print("The dimension of the latent representation is {0:f}".format((32/4)**2*self.n_channel_latent))
        
        ## Encoder layers ##
        # conv layer (depth from 3 --> 16), 3x3 kernels
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, 
                               kernel_size=3, stride=1, padding=1)
        
        # conv layer (depth from 16 --> C), 3x3 kernels
        # YOUR CODE HERE
        self.conv2 = # YOUR CODE HERE
        
        # Max pool layer
        self.pool = nn.MaxPool2d(2, 2)
        
        ## decoder layers ##
        ## a kernel of 2 and a stride of 2 will increase the spatial dims by 2

        # trans conv layer (depth C --> 16). We increase the spatial dims by 2
        self.t_conv1 = nn.ConvTranspose2d(self.n_channel_latent, 16, kernel_size=2, stride=2)

        # trans conv layer (depth 16 --> 16). We increase the spatial dims by 2
        self.t_conv2 = # YOUR CODE HERE

        # conv layer (depth 16 --> 3) no spatial reduction!!
        self.conv3 = nn.Conv2d(in_channels=16, out_channels=3, 
                               kernel_size=1, stride=1, padding=0)        

        ## BN Layers

        # One BN layer after the first two convolutional layers

        self.BN_1 = # YOUR CODE HERE

        self.BN_2 = # YOUR CODE HERE

        # One BN layer after the two transpose convolutional layers

        self.BN_3 = # YOUR CODE HERE

        self.BN_4 = # YOUR CODE HERE

        # We use RELU activation for all layers
        self.relu    = nn.ReLU()
        
        # And the appropiate activation at the decoder's output
        self.decoder_activation = # YOUR CODE HERE
        

    def forward(self, x):
        


        # YOUR CODE HERE (many lines)        
        

        # x is input batch of images
        # latent is the encoder output        
        
        return x,latent

In [None]:
# initialize the NN for C=16 layers
model = ConvAutoencoder(C=16)
print(model)

---
## Part III. Training

> **Exercise:** Complete the class below to include a training method which monitors the reconstrution loss in both the training and validation datasets. In [this blog](https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7) you can find a good summary of Pytorch Loss Functions. For the problem at hand, decide which one is more convenient. 


In [None]:
class ConvAutoencoder_extended(ConvAutoencoder):
    
    def __init__(self, epochs=10, lr=0.01,C=4):
        
        super().__init__(C)
        self.lr = lr    
        self.optim = optim.Adam(self.parameters(), self.lr)   
        self.epochs = epochs

        self.criterion = # YOUR CODE HERE

        self.loss_during_training = []
        self.valid_loss_during_training = []

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.to(self.device)
        
    def trainloop(self,trainloader,validloader):
        
        # Optimization Loop
        
        for e in range(int(self.epochs)):
            
            running_loss = 0.
            start_time = time.time()
            for images, labels in trainloader:
        
                labels = images # To train an Autoencoder the label is the input

                # Move input and label tensors to the default device 
                images, labels = images.to(self.device), labels.to(self.device)

                self.optim.zero_grad()
            
                out = self.forward(images)[0]

                loss =  self.criterion(out, labels)
                
                running_loss += loss.item()

                loss.backward()
                
                self.optim.step()
                   
            self.loss_during_training.append(running_loss/len(trainloader))
            self.valid_loss_during_training.append(self.eval_performance(validloader))
                

            if(e % 1 == 0): # Every epoch

                print("Epoch %d. Training loss: %f, Validation loss: %f, Time per epoch: %f seconds" 
                      %(e,self.loss_during_training[-1],self.valid_loss_during_training[-1],
                       (time.time() - start_time)))

    def eval_performance(self,dataloader):

        # YOUR CODE HERE (many lines)
                

        

> **Exercise:** Train for 5 epochs the model for 16 feature maps at the output of the encoder (C=16) and plot both training and validation losses. Can you visualize overfitting?

In [None]:
# Your code here


In [None]:
# Your code here


### Checking out the results

Below we plot some of the **test images** along with their reconstructions. These look a little rough around the edges, likely due to the artifacts that tend to happen with transpose layers.

> **Exercise:** Complete the code (just one line!)

In [None]:
# obtain one batch of test images
dataiter = iter(testloader)
images, labels = dataiter.next()

batch_size=256

images_cuda, labels_cuda = images.to(autoencoder.device), labels.to(autoencoder.device)

# latent representation for the minibtach
output,latent = # YOUR CODE HERE

# prep images for display


# output is resized into a batch of iages
output = output.view(batch_size, 3, 32, 32)
# use detach when it's an output that requires_grad. We use .cpu() to move the result back to cpu from gpu
output = output.cpu().detach()

# plot the first ten input images and then reconstructed images
fig, axes = plt.subplots(nrows=2, ncols=10, sharex=True, sharey=True, figsize=(25,4))

# input images on top row, reconstructions on bottom
for i in range(10):
    axes[0,i].imshow(rescale(images[i,:,:,:]))
    axes[1,i].imshow(rescale(output[i,:,:,:]))

    axes[0,i].get_xaxis().set_visible(False) # Remove legend
    axes[0,i].get_yaxis().set_visible(False)

    axes[1,i].get_xaxis().set_visible(False) # Remove legend
    axes[1,i].get_yaxis().set_visible(False)    
    

## Part IV. Visualizing the effect of the encoder size

To analyze the effect of C, the number of feature maps at the encoder's output, represent the validation loss after training 5 epochs our autoencoder for different values of C between 2 and 32.

In [None]:
# Your code here

## Part V. Visualize the data the encoder's output in 2D using t-SNE

[t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 100 or 200 variables) if the number of features is very high. At the end, this is what we used the autoencoder for!

> **Exercise** For C=4, compute the latent representation of a mini-batch of CIFAR 10 images. Dont forget to move the latent representation to the cpu (in case a gpu was used).

In [None]:
# YOUR CODE HERE

We implement TSNE using sklearn implementation

In [None]:

from sklearn.manifold import TSNE

# Reshape latent representation to a single vector
latent = latent_cpu.reshape(batch_size,-1)
images_np = np.squeeze(images)

latent_tsne = TSNE(n_components=2).fit_transform(latent)


In [None]:
# With this code we can visualize images in a 2D scatter plot

from matplotlib.offsetbox import OffsetImage, AnnotationBbox
 
def plot_latent_space_with_images(images,latent,xmin=-1,xmax=1,ymin=-1,ymax=1):
 
    # images --> Minibatch of images (numpy array!)
    # latent --> Matrix of 2D representations (numpy array!)
 
    f, ax = plt.subplots(1,1,figsize=(8, 8))
    # ax is a figure handle
    ax.clear()
    for i in range(len(images)):
        im = OffsetImage(rescale(images[i,:,:,:]))
        ab = AnnotationBbox(im, latent[i,:],frameon=False)
        ax.add_artist(ab)
    #We set the limits according to the maximum and minimum values found for the latent projections
    ax.set_xlim(xmin,xmax)
    ax.set_ylim(ymin,ymax)
    ax.set_title('Latent space Z with Images')

In [None]:
plot_latent_space_with_images(images, latent_tsne,-10,10,-10,10)

Do the neighbours in the 2D space make any sense?