
# **Exercise 10: Convolutional Neural Networks Part 2** AUTOENCODER

## **Exercise 09.1: Theory** 

### **1.1 Describe the basic building blocks of a CNN in your own words**



**1. Feature Extractor / Backbone:** Extracts (visual) features from the raw data necessary to solve the trained task. The feature extractor of a typical CNN consists of:

**1.1 Convolutional Layers:** Convolution layers are layers where filters are applied to the original image or to other feature maps. The most important parameters of the layer are: 

**Number of Filters (out_channels):** The number of applied filters. This number corresponds to the number of feature_maps in the output block.

**Stride:** Sets how far the filter moves in each step. For an image we can set the step size for the horizontal and vertical direction.

**Kernel Size:** The size of a convolution filter (for image width and height)

**Padding:** is used to control the height and width of the convolution layer output. For images, each side of the convolution layer input image is expanded with pixels filled with a predefined value.

**1.2 Pooling Layers:**
Pooling layers are used to reduce the size of feature maps by pooling the features in the feature map. The most important parameters are kernel size, stride and padding. There are several types of pooling layers:

**Max Pooling:** Select the maximum value in the pooling kernel.

**Average Pooling:** Calculate the average value of all values in the pooling kernel.

**Global Max/Average Pooling:** Summarizes entire feature maps into a single layer by selecting the maximum value or calculating the average value.

**2. Head:** The top of the network that uses the extracted features from the backbone and makes the final prediction for the given task. More complex models can have multiple heads that solve different tasks but share a backbone/feature extractor.






### **1.2 Describe two advantages of a CNN over an MLP for image-based classification**



1. More parameter-efficient: A filter is slid over the image instead of transferring the entire image to a fully-conencted layer.
2. It is easier for the CNN to use spatial information: The input image remains in its original form and is not converted into a flat, 1D vector.
3. Better visual interpretability: The learned filters can be interpreted visually using various techniques. (Example of a online tools and articels: https://tensorspace.org/html/playground/lenet.html, https://distill.pub/2017/feature-visualization/, https://distill.pub/2018/building-blocks/)



## **Exercise 09.2: Application**

### **2.1 Read the Cifar10 or MNIST Dataset and plot some examples (train/eval/test split)**

We use Pytorch to read the Cifar10 or MNIST dataset for our model and make sure that it is split into train/eval/test subsets. There is no official eval subset for Cifar10 (/ MNIST), so we use some data from the training subset to create our own eval subset. To see if the dataset is read correctly, we present some examples and the corresponding class with Matplotlib.

Tip: torchvision.datasets is a simple, high-level API for downloading and reading datasets.

In [None]:
DATASET = "MNIST" # MNIST or CIFAR

In [None]:
import torch

from torchvision import datasets
# some handy functions to use along widgets
from IPython.display import display, Markdown, clear_output
# widget packages
import ipywidgets as widgets

# Transforms allow us to apply transformations on the data. 
# A simple example is the transform "transforms.ToTensor()" which allows us to return a dataset of tensors instead of PIL images.
# Other transformations include normalizations and augmentations, such as cropping parts of an image.
from torchvision import transforms

# For random processes in Torch, we can set an arbitrary number as the seed. This makes the program deterministic and the random processes behave exactly the same every time the program is started.
torch.manual_seed(42)

# All datasets are subclasses of torch.utils.data.Dataset and have a similar API. 
# With the first argument (root) we specify the location where we the data is located or saved.
# If train=True the training set is loaded, if False the test set is loaded.
# Many datasets can be downloaded automatically when download=True. 
# The transform argument allows us to apply transformations to the data. These transformations can consist of several transformations (especially augmentations) in real applications
if DATASET == "MNIST":
  dataset = datasets.MNIST('../data', train=True, download=True, transform=transforms.ToTensor())
else:
  dataset = datasets.CIFAR10('../data', train=True, download=True, transform=transforms.ToTensor())

# Randomly split a dataset into non-overlapping new datasets of given lengths.
# We use this function to create a training and evaluation set from the official training data.
# The training set will have 40000 examples, the evaluation set 10000.
if DATASET == "MNIST":
  train_dataset, eval_dataset = torch.utils.data.random_split(dataset, [50000, 10000])
else:
  train_dataset, eval_dataset = torch.utils.data.random_split(dataset, [40000, 10000])

# The test set has 10000 examples.
if DATASET == "MNIST":
  test_dataset = datasets.MNIST('../data', train=False, transform=transforms.ToTensor())
else:
  test_dataset = datasets.CIFAR10('../data', train=False, transform=transforms.ToTensor())

# Print information of the dataset.
print("Succesfully read the dataset:")
print(f"The training data contains {len(train_dataset)} examples.")
print(f"The evaluation data contains {len(eval_dataset)} examples.")
print(f"The test data contains {len(test_dataset)} examples.")


Now we can use the dataset to plot some examples.

In [None]:
import matplotlib.pyplot as plt

# The label of each given sample (image) is simply a number. For visualization, we can create a map that maps this number to the class name.  
cifar10_labels_map = {
    0: "airplane",
    1: "automobile",
    2: "bird",
    3: "cat",
    4: "deer",
    5: "dog",
    6: "frog",
    7: "horse",
    8: "ship",
    9: "truck",
}

# Prepare the plot.
figure = plt.figure(figsize=(8, 8))

# We plot 9 images in a 3x3 grid.
cols, rows = 3, 3

# For each iamge we want to plot:
for i in range(1, cols * rows + 1):
    # We create an index of the dataset at random to randomly select a sample.
    sample_idx = torch.randint(len(train_dataset), size=(1,)).item()

    # With this index we select the sample and its label.
    img, label = train_dataset[sample_idx]

    # And we plot the image with the corresponding label name.
    figure.add_subplot(rows, cols, i)
    if DATASET != "MNIST":
      plt.title(cifar10_labels_map[label])
    plt.axis("off")

    # IMPORTANT: The sample is given in [c,w,h] shape([image-channels, image-widht, image-height] e.g., [3,32,32] for Cifar10). 
    # Pytorch uses this format for performance reasons on the hardware.
    # To plot the image we need to bring it back into the "normal" [w,h,c] shape.
    img = img.permute((1,2,0))
    plt.imshow(img.squeeze())

plt.show()

### **2.2 Build your own CNN with Pytorch**

In [None]:
from torch._C import NoneType
import torch.nn as nn
from torchsummary import summary

import torch.nn.functional as F

# We define an CNN class that is a subclass of nn.Module.
# nn.Module is the base class for all neural networks modules in Pytorch and provides basic functions 
class CNN(nn.Module):
    def __init__(self, isMnist=False, preEmbeddingChannels=8):
        super(CNN, self).__init__()

        in_channels = 1 if isMnist else 3

        # Convolutional Layer 1 with 32 filters and kernel 3
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=3,stride=2,padding=1)
        # Convolutional Layer 2 with 64 filters and kernel 3
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3,stride=2,padding=1)
        # Convolutional Layer 3 with 64 filters and kernel 3
        self.conv3 = nn.Conv2d(64, preEmbeddingChannels, kernel_size=3,stride=2,padding=1)

        self._size = None

        # TODO: define the missing layers        


    def encode(self,x):
      x = F.relu(self.conv1(x))
      x = F.relu(self.conv2(x))
      x = F.relu(self.conv3(x))

      if self._size == None:
          self._size = x.shape
        
      e = torch.flatten(x, 1) # flatten all dimensions except batch
      return e

    def decode(self,e):
        # this will give you the images with the same shape as after the last encoding conv layer
        x = e.view(-1,self._size[1],self._size[2],self._size[3])

        # TODO: perform the decoding using transpose convolutions and activations
        return x

        
    def forward(self, x):
        # encode the image
        e = self.encode(x)

        # decode back to image
        x = self.decode(e)        
        return x, e

### **2.2.2 Print the summary of the model**
This helps us to iterativly define the missing layers until we are back to our target image shapes:
 - $1 \times 28 \times 28$ for MNIST
 - $3 \times 32 \times 32$ for CIFAR

In [None]:
_model = CNN(isMnist=(DATASET=="MNIST")).cuda()

if DATASET == "MNIST":
  summary(_model, (1, 28, 28))
else:
  summary(_model, (3, 32, 32))

### **2.3 Train your model on Cifar10. Use Tensorboard to log your runs.**

In [None]:
import torch.nn.functional as F

def train(train_loader, model, optimizer, epoch, log_interval, device):
    # First, we put the model into training mode. 
    model.train()
    
    # We iterate over the entire data set. 
    # The variable data contains the images. 
    # The variable target contains the name of the images.
    # batch_idx is the index of the current batch in the epoch.
    for batch_idx, (data, target) in enumerate(train_loader):

        # We put the data onto the selected device.
        # Shapes: 
        # data: [b,c,w,h]
        # target: [b] - Single number of the label for each sample.
        # autoencoder target = data
        #data, target = data.to(device), target.to(device)
        data, target = data.to(device), data.to(device)

        # Sets the gradients of all optimized torch.Tensors to zero.
        optimizer.zero_grad()

        # We call the forward function of our model to get the prediction of the model for our data.
        # output shape: [b,10] -  The outputs of the 10 neurons in the last layer, where each neuron represents a class.
        output, embedding = model(data)

        # We calculate the training loss.
        loss = F.mse_loss(output, target)

        # Computes the gradient of current tensor w.r.t. graph leaves (input).
        loss.backward()

        # ll optimizers implement a step() method, that updates the parameters.
        optimizer.step()

        # We print the results each log_interval steps.
        if batch_idx % log_interval == 0:
          step = batch_idx * len(data)
          epoch_progress = round((100. * batch_idx / len(train_loader)),2)
          print(f"Train Epoch: {epoch} [Step {step} from {len(train_loader.dataset)} ({epoch_progress}%)]\t Loss: {loss.item()}")
    return loss

In [None]:
import torch.nn.functional as F

# This function validates the model on the given data.
def val(data_loader, model, training_epoch, device, mode="eval"):
    # First, we put the model into evaluation mode. 
    model.eval()
    
    # We define variables for the loss and the number of correctly classified samples.
    loss = 0
    correct = 0

    # We do not want our validation function to participate in the gradient calculation (more on this in the next lecture).
    with torch.no_grad():
        # We iterate over the entire data set. 
        # The variable data contains the images. 
        # The variable target contains the name of the images.
        for data, target in data_loader:

            # We put the data onto the selected device.
            # Shapes: 
            # data: [b,c,w,h]
            # target: [b] - Single number of the label for each sample.
            # autoencoder target = data
            #data, target = data.to(device), target.to(device)
            data, target = data.to(device), data.to(device)

            # We call the forward function of our model to get the prediction of the model for our data.
            # output shape: [b,10] -  The outputs of the 10 neurons in the last layer, where each neuron represents a class.
            output, embedding = model(data)

            # We calculate the validation loss.
            loss += F.mse_loss(output, target, reduction='mean').item()  # sum up batch loss

    # For total loss, the sum of the losses from each example is defined by the number of samples in the dataset.
    loss /= len(data_loader)        

    # We print the results.
    print(f"\nValidation for {mode} data after Epoch: {training_epoch}\t Loss {loss}\n")
    return loss

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir ./runs

In [None]:
from torch.utils.tensorboard import SummaryWriter
import torchvision.models as models

# The writer to write date into tensorboard.
writer = SummaryWriter()

# Training Hyperparameters

# Autoencoder: 
# Decide how many channels the resulting feature map could have AFTER encoding
pre_embedding_channels = 4

# We want to run our model on a GPU, so we choose "cuda" as the device. Alternatively, we can run our model on the CPU with "cpu".
device = "cuda"
# The batch size will be explained in more detail in the next lecture. With batch_size we define the number of samples that are fed to the model in one step.
batch_size = 64

# The number of epochs we want to train the model. 
epochs = 25

# The learning rate used in the optimizer 
learning_rate = 0.0001

# The number of classes in the dataset
num_classes = 10 

# Print results each log_interval_steps steps.
log_interval_steps = 100

# Test the model after each epoch.
eval_interval_epochs = 1

# With the DataLoader we can load data from our dataset step by step. 
# With the argument batch_size we can specify how mach samples are loaded in one batch.
# With shuffel it can be specified whether the data set should be shuffled and returned in a random order. If all data is loaded once and shuffel=true, the data set will be shuffled again.
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
eval_loader = torch.utils.data.DataLoader(eval_dataset, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size)

# We load the model into the VRAM of the GPU.
model = CNN(isMnist=(DATASET=="MNIST"),preEmbeddingChannels=pre_embedding_channels).to(device)
# model = models.resnet18(pretrained=False, num_classes=num_classes).to(device) 
# model = models.efficientnet_b0(pretrained=False, num_classes=num_classes).to(device) 

# To plot the model graph in tensorboard, we need an example input.
example_image = train_dataset[0][0].unsqueeze(0).to(device)
#writer.add_graph(model, example_image)

# Print model
print(model)

# The optimizer used to train the model
# With model.parameters() we inform the optimizer about the parameters taht shoudl be optimized.
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# For each epoch
for epoch in range(1, epochs + 1):
    # We train the model on the entire train data.
    train_loss = train(train_loader, model, optimizer, epoch, log_interval_steps, device)

    # Check if we want to evaluate the model at the end of this epoch
    if epoch % eval_interval_epochs == 0:
        # Evaluate the model on the entire train data.
        eval_loss = val(eval_loader, model, epoch, device)
        # Write the training loss, eval loss and eval accuracy into tensorboard.
        writer.add_scalar('Loss/train', train_loss, epoch)
        writer.add_scalar('Loss/eval', eval_loss, epoch)

In [None]:
# We test our final tuned model on the test data to report our final results.
test_loss = val(eval_loader, model, epoch, device, mode="test")
# Write the training loss, test loss and test accuracy into tensorboard.
writer.add_scalar('Loss/test', test_loss, epoch)

### **2.4 Visualize the results**

In [None]:
import numpy as np
def plotRandomSampleReconstruction():
  sample_idx = torch.randint(len(test_dataset), size=(1,)).item()
  img, _ = test_dataset[sample_idx]

  input = torch.tensor(img).unsqueeze(0).to(device)

  output, e = model(input)

  input = img.permute((1,2,0))
  output = output.clone().detach().cpu().numpy()
  output = np.squeeze(output)

  if DATASET != "MNIST":
    output = np.moveaxis(output,0,-1)


  fig, (ax1, ax2) = plt.subplots(1, 2)
  ax1.imshow(np.squeeze(input))
  ax2.imshow(np.squeeze(output))
  plt.show()

In [None]:
button = widgets.Button(description='Plot Random Sample Reconstruction')
out = widgets.Output()
def on_button_clicked(_):
      with out:
          clear_output()
          plotRandomSampleReconstruction()
button.on_click(on_button_clicked)
widgets.VBox([button,out])