In [1]:
import math
import numpy as np
from matplotlib import pyplot as plt
from sklearn.manifold import TSNE


# Advanced Machine Learning SS 2023
## Group Portfolio for the Final Grade

Welcome to the final portfolio of the Advance Machine Learning Course 2023. Please read the following instructions carefully. 

<div class="alert alert-block alert-info">

* In this Jupyter Notebook, you find **12 exercises**. Each exercise is about one of the lectures your heard this semester.

* The portfolio should be solved as a **group of three** that was registered in Stud.IP. Deadline for handing the portfolio in is the **19th of July 2023 23:59 CET (German time)**.

* Do **not** add, delete, or rearrange cells. Also, only modify those cells which say to insert an answer or code (double click on the cell that says "Insert your answer here")

* If you did not participate in the tutorials and want to run the code segments in the portofolio, follow the installation instructions you find on stud.ip under files -> 0. Tutorials -> Installation Guides 

* If you include code that is intended to be run to generate a solution, make sure that we can run your code without any modifications to the submitted notebook (i.e., no missing imports, top-to-bottom execution of cells, no absolute paths, etc.)

* The exercise 11, reflection of AI tools is **mandatory** for the portfolio.

* Please fill out the individual contribution statement at the end
</div>

*Good luck!*

##### **Please enter the names and student ID numbers of all group members here:**

### Excercise 1: NN Basics (Kieran)

Consider the following Neural Network given with a input, hidden and output layer and loss function as Binary Cross entropy. The hidden layer uses ReLU and output layer uses sigmoid as activation function. The output label can be either 1 or 0. For the calculations the input x and weights in hidden layer and output layer are given and the label for the corresponding input is 0. <br>

![alternative text](NN_diagram_1.png) <br>
$input, x = \begin{bmatrix}x_{1} \\x_{2} \end{bmatrix} =\begin{bmatrix}-1 \\2 \end{bmatrix}$ and label, y = 0<br>
$W^{T}_{hidden} = \begin{bmatrix} w_{11}&w_{21}\\w_{12}&w_{22}\end{bmatrix}= \begin{bmatrix} 0.2&0.1\\-0.1&0.2\end{bmatrix} $ <br>
a = $W^{T}_{hidden}x$ <br>
output of the hidden layer, $h = ReLU(a)$ <br>
$W^{T}_{out} = \begin{bmatrix}0.1 & 0.2 \end{bmatrix} $ <br>
$a_{out}$ = $W^{T}_{out}h$ <br>
output of the output layer, $y' = \sigma (a_{out})$ <br>

Show that gradient descent with back-propagation reduces the Binary Cross Entropy loss using the following steps: <br>

a) Compute the current loss $L_{(1)}$ using forward propagation. (2 points) <br>


$input, x = \begin{bmatrix}x_{1} \\x_{2} \end{bmatrix} =\begin{bmatrix}-1 \\2 \end{bmatrix}$ and label, y = 0<br><br>
$W^{T}_{hidden} = \begin{bmatrix} w_{11}&w_{21}\\w_{12}&w_{22}\end{bmatrix}= \begin{bmatrix} 0.2&0.1\\-0.1&0.2\end{bmatrix} $ <br><br>
a = $W^{T}_{hidden}x$ <br><br>
a = $\begin{bmatrix} 0.2 \cdot -1 + 0.1 \cdot 2 \\ -0.1 \cdot -1 + 0.2 \cdot 2 \end{bmatrix} = \begin{bmatrix} 0 \\ 0.5 \end{bmatrix}$<br><br>
output of the hidden layer, $h = ReLU(a) = \begin{bmatrix} \max(0,0) \\ \max(0,0.5) \end{bmatrix} = \begin{bmatrix} 0 \\ 0.5 \end{bmatrix}$ <br><br>
$W^{T}_{out} = \begin{bmatrix}0.1 & 0.2 \end{bmatrix} $ <br><br>
$a_{out}$ = $W^{T}_{out}h = 0.1 \cdot 0 + 0.2 \cdot 0.5 = 0.1$ <br><br>
output of the output layer, $y' = \sigma (a_{out}) = \frac{1}{1+e^{-a_{out}}} = \frac{1}{1+e^{-0.1}}$ <br><br>
Binary Cross Entropy Loss: $L_{(1)}(y,y') = - ((y \cdot \log(y')) + (1-y) \cdot \log(1-y')) = - 1 \cdot \log(1-y') = - \log(1-y')$<br><br>

In [12]:
loss = - math.log(1 - (1/(1+ math.exp(-0.1))))
print('L_1 = ', loss)

L_1 =  0.744396660073571


b) Compute the gradients of all weights using Back-propagation and apply Gradient descent with a learning rate of 0.1 to update the weights. (3 points) <br>


$\frac{\partial L_{(1)}}{\partial y'} = \frac{\partial - ((y \cdot \log(y')) + (1-y) \cdot \log(1-y'))}{\partial y'} = - (\frac{y}{y'} - \frac{1-y}{1-y'}) = - (\frac{y(1-y') - y'(1-y)}{y'(1-y')}) = - (\frac{y - y'}{y'(1-y')}) = \frac{y'-y}{y'(1-y')}$<br><br>
$\frac{\partial y'}{\partial a_{out}} = \frac{\partial \frac{1}{1+e^{-a_{out}}}}{\partial a_{out}} = \frac{0 - \frac{\partial (1+e^{-a_{out}})}{\partial a_{out}}}{(1+e^{-a_{out}})^2} = \frac{1}{1+e^{-a_{out}}} (\frac{e^{-a_{out}}}{1+e^{-a_{out}}}) = y' (\frac{1 + e^{-a_{out}} - 1}{1+e^{-a_{out}}}) = y' (1 - \frac{1}{1+e^{-a_{out}}}) = y'(1-y')$<br><br>
$\frac{\partial a_{out}}{\partial{w_1}} = \frac{\partial w_1 \cdot h_1 + w_2 \cdot h_2}{\partial w_1} = h_1 = 0$<br><br>
$\frac{\partial a_{out}}{\partial{w_2}} = \frac{\partial w_1 \cdot h_1 + w_2 \cdot h_2}{\partial w_2} = h_2 = 0.5$<br><br>
$\frac{\partial a_{out}}{\partial{h_1}} = \frac{\partial w_1 \cdot h_1 + w_2 \cdot h_2}{\partial h_1} = w_1 = 0.1$<br><br>
$\frac{\partial a_{out}}{\partial{h_2}} = \frac{\partial w_1 \cdot h_1 + w_2 \cdot h_2}{\partial h_2} = w_2 = 0.2$<br><br>
$\frac{\partial h_1}{\partial{a_1}} = \frac{\partial \max(0,a_1)}{\partial a_1} = 1$ as $a_1 \geq 0$, otherwise $0$<br><br>
$\frac{\partial h_2}{\partial{a_2}} = \frac{\partial \max(0,a_2)}{\partial a_2} = 1$ as $a_2 \geq 0$, otherwise $0$<br><br>
$\frac{\partial a_1}{\partial{w_{11}}} = \frac{\partial w_{11} \cdot x_1 + w_{21} \cdot x_2}{\partial w_{11}} = x_1$<br><br>
$\frac{\partial a_1}{\partial{w_{21}}} = \frac{\partial w_{11} \cdot x_1 + w_{21} \cdot x_2}{\partial w_{21}} = x_2$<br><br>
$\frac{\partial a_2}{\partial{w_{12}}} = \frac{\partial w_{12} \cdot x_1 + w_{22} \cdot x_2}{\partial w_{12}} = x_1$<br><br>
$\frac{\partial a_2}{\partial{w_{22}}} = \frac{\partial w_{12} \cdot x_1 + w_{22} \cdot x_2}{\partial w_{22}} = x_2$<br><br>


So, we can now easily calculate the partial derivatives of the loss function with respect to the weights:<br><br>
$\frac{\partial L_{(1)}}{\partial w_1} = \frac{\partial L_{(1)}}{\partial y'} \cdot \frac{\partial y'}{\partial a_{out}} \cdot \frac{\partial a_{out}}{\partial{w_1}} = \frac{y'-y}{y'(1-y')} \cdot y'(1-y') \cdot h_1 = h_1 (y'-y) = 0$<br><br>
$\frac{\partial L_{(1)}}{\partial w_2} = \frac{\partial L_{(1)}}{\partial y'} \cdot \frac{\partial y'}{\partial a_{out}} \cdot \frac{\partial a_{out}}{\partial{w_2}} = \frac{y'-y}{y'(1-y')} \cdot y'(1-y') \cdot h_2 = h_2 (y'-y) = 0.5y'$<br><br>
$\frac{\partial L_{(1)}}{\partial w_{11}} = \frac{\partial L_{(1)}}{\partial y'} \cdot \frac{\partial y'}{\partial a_{out}} \cdot \frac{\partial a_{out}}{\partial{h_1}} \cdot \frac{\partial h_1}{\partial{a_1}} \cdot \frac{\partial a_1}{\partial{w_{11}}} = \frac{y'-y}{y'(1-y')} \cdot y'(1-y') \cdot h_1 \cdot 1 \cdot x_1 = h_1 \cdot x_1 (y' - y) = 0$<br><br>
$\frac{\partial L_{(1)}}{\partial w_{21}} = \frac{\partial L_{(1)}}{\partial y'} \cdot \frac{\partial y'}{\partial a_{out}} \cdot \frac{\partial a_{out}}{\partial{h_1}} \cdot \frac{\partial h_1}{\partial{a_1}} \cdot \frac{\partial a_1}{\partial{w_{21}}} = \frac{y'-y}{y'(1-y')} \cdot y'(1-y') \cdot h_1 \cdot 1 \cdot x_2 = h_1 \cdot x_2 (y'-y) = 0$<br><br>
$\frac{\partial L_{(1)}}{\partial w_{12}} = \frac{\partial L_{(1)}}{\partial y'} \cdot \frac{\partial y'}{\partial a_{out}} \cdot \frac{\partial a_{out}}{\partial{h_2}} \cdot \frac{\partial h_2}{\partial{a_2}} \cdot \frac{\partial a_2}{\partial{w_{12}}} = \frac{y'-y}{y'(1-y')} \cdot y'(1-y') \cdot h_2 \cdot 1 \cdot x_1 = h_2 \cdot x_1 (y'-y) = 0.5 \cdot -1 \cdot y' = -0.5y'$<br><br>
$\frac{\partial L_{(1)}}{\partial w_{22}} = \frac{\partial L_{(1)}}{\partial y'} \cdot \frac{\partial y'}{\partial a_{out}} \cdot \frac{\partial a_{out}}{\partial{h_2}} \cdot \frac{\partial h_2}{\partial{a_2}} \cdot \frac{\partial a_2}{\partial{w_{22}}} = \frac{y'-y}{y'(1-y')} \cdot y'(1-y') \cdot h_2 \cdot 1 \cdot x_2 = h_2 \cdot x_1 \cdot (y'-y) = 0.5 \cdot 2 \cdot y' = y'$<br><br>

We can now calculate the new weights with a learning rate of 0.1:<br><br>
$w^{(2)}_1 = w_1 + 0.1 \cdot \frac{\partial L_{(1)}}{\partial w_1} = w_1 - 0.1 \cdot 0 = w_1 = 0.1$<br><br>
$w^{(2)}_2 = w_2 + 0.1 \cdot \frac{\partial L_{(1)}}{\partial w_2} = w_2 - 0.1 \cdot 0.5y' = 0.2 - 0.05y'$<br><br>
$w^{(2)}_{11} = w_{11} + 0.1 \cdot \frac{\partial L_{(1)}}{\partial w_{11}} = w_{11} - 0.1 \cdot 0 = w_{11} = 0.2$<br><br>
$w^{(2)}_{21} = w_{21} + 0.1 \cdot \frac{\partial L_{(1)}}{\partial w_{21}} = w_{21} - 0.1 \cdot 0 = w_{21} = 0.1$<br><br>
$w^{(2)}_{12} = w_{12} + 0.1 \cdot \frac{\partial L_{(1)}}{\partial w_{12}} = w_{12} - 0.1 \cdot -0.5y' = -0.1 + 0.05y'$<br><br>
$w^{(2)}_{22} = w_{22} + 0.1 \cdot \frac{\partial L_{(1)}}{\partial w_{22}} = w_{22} - 0.1y' = 0.2 - 0.1y'$<br><br>


c) Perform Forward-propagation again with the updated weights and recompute the loss $L_{(2)}$ (2 points)

$input, x = \begin{bmatrix}x_{1} \\x_{2} \end{bmatrix} =\begin{bmatrix}-1 \\2 \end{bmatrix}$ and label, y = 0<br><br>
$W^{T}_{hidden(2)} = \begin{bmatrix} w^{(2)}_{11}&w^{(2)}_{21}\\w^{(2)}_{12}&w^{(2)}_{22}\end{bmatrix}= \begin{bmatrix} 0.2&0.1\\-0.1 + 0.05y'_{(1)}&0.2 - 0.1y'_{(1)}\end{bmatrix} $ <br><br>
a = $W^{T}_{hidden(2)}x$ <br><br>
a = $\begin{bmatrix} 0.2 \cdot -1 + 0.1 \cdot 2 \\ (-0.1 + 0.05y'_{(1)}) \cdot -1 + (0.2 - 0.1y'_{(1)}) \cdot 2 \end{bmatrix} = \begin{bmatrix} 0 \\ 0.5 - 0.25y'_{(1)}\end{bmatrix}$<br><br>
output of the hidden layer, $h = ReLU(a) = \begin{bmatrix} \max(0,0) \\ \max(0,0.5 - 0.25y'_{(1)}) \end{bmatrix} = \begin{bmatrix} 0 \\ 0.5 - 0.25y'_{(1)} \end{bmatrix}$ <br><br>
$W^{T}_{out(2)} = \begin{bmatrix}0.1 & 0.2 - 0.05y'_{(1)} \end{bmatrix} $ <br><br>
$a_{out}$ = $W^{T}_{out(2)}h = 0.1 \cdot 0 + (0.2 - 0.05y'_{(1)}) \cdot (0.5 - 0.25y'_{(1)}) = (0.2 - 0.05y'_{(1)}) \cdot (0.5 - 0.25y'_{(1)})$ <br><br>
output of the output layer, $y'_{(2)} = \sigma (a_{out}) = \frac{1}{1+e^{-a_{out}}}$ <br><br>
Binary Cross Entropy Loss: $L_{(2)}(y,y') = - ((y \cdot \log(y')) + (1-y) \cdot \log(1-y')) = - 1 \cdot \log(1-y') = - \log(1-y')$<br><br>


In [9]:
y_dash_1 = 1/(1 + math.exp(-0.1))
a_out = (0.2 - 0.05*(y_dash_1)) * (0.5 - 0.25*(y_dash_1))
loss = - math.log(1 - (1/(1+ math.exp(- a_out))))
print('L_2 = ', loss)


L_2 =  0.7256960391963214


### Exercise 2: Regularization and Optimization (Benedikt)

#### a) Questions

| Model | 1. | 2. | 3. | 4. |
| --- | --- | --- | --- | --- |
| Train Error | 0.39 | 0.25 | 0.5 | 0.15 |
| Test Error | 0.4 | 0.3 | 0.5 | 0.6 |

Let's say, you're training a neural network to classify male and female avatar faces. Your dataset is difficult: a person manages to achieve only a 85% accuracy on it. At different stages during the development of your model you achieve the following four train, test error tuples.

**1. Explain why you would look at the train and test error, what the difference between them is, and how they relate to the error of the model. (maximum 5 sentences):** (1 points)

The two errors are specific for different phenomena to observe in a DNN, by itself and in combination, with it, it is possible to evaluate the model. The train Error by itself, allows a statement to apply if the model has learned the training data or if the model is to much biased, in other words if the underlying assumptions made are correct, for example using a LDA. The test error shows how good the model is in field. When considering both errors the difference between them is important, no difference means that the model has generalized as good as the underling assumptions allow. In general the difference can be considered as variance. 

**2. Match the model performance and Bias/Variance situations to each dartboard hit illustration in the figure below and give a short explanation (1 sentence) to each combination. Also match each illustration with a model from the table above.** (2 points)

*Example solution:*
* A - *Bias/Variance: a, Model performance: a, Model: 1*
* B - *Bias/Variance: b, Model performance: b, Model: 2*

Bias/Variance: **a.** high variance + high bias, **b.** high variance, **c.** high bias, **d.** low bias + low variance <br>
Model performance: **a.** best model, **b.** worst model, **c.** overfitting, **d.** underfitting

![Figure 1](images/dartboard_illustration_Q2.jpg)

If the model is the dart player, the task would be to learn were the middle of the board is (training error -> bias) & show were the middle of the board is (test error -> variance).

A - Bias/Variance: d, Model performance: a, Model: - 
The middle of the board is learned and is found by the model. 

B - Bias/Variance: b, Model performance: c, Model: -
The middle of the board is learned but is not precisely found by the model.

D - Bias/Variance: c, Model performance: d, Model: -
the middle is not learned and the falsy learned middle is not found by the model.

C - Bias/Variance: a, Model performance: b, Model: -
the middle is not learned but the falsy learned middle is found by the model.

The table values (models) cannot be clearly assigned to the databoard,  Bias/Variance and performance combination. There fore in the following a assignment just between the table values and the databoards are taken, and the table and the model performance are chosen:

A - Model: 1  |  a - Model: 2 

&nbsp;
B - Model: 4  |  b - Model: 4  

&nbsp;
C - Model: 2  |  c - Model: 4  

&nbsp;
D - Model: 3  |  d - Model: 3  

&nbsp;

**3. Why is Regularization important in machine learning and how does the L2 regularization effect the model training process? Give another example of a regularization method!** (1 point)

Regularization is important if a model tends to over fit. In general all regularization methods are techniques in which the model is forced to generalize, due to keeping the weights small. L2 regularization effect the model training process by adding a penalty term to the loss function. This penalty term is proportional to the squared magnitude of the model's weights, so very large weights have high costs. With the L2 regularization another hyperparameter is tuneable. Alpha is scaling the impact of the L2 term. The goal is to find the balance between the costs by wrongly assigned training data and added "large weight" cost. 
Another regularization method is drop-out, in which n randomly chosen neurons are dropping out. So the model can't relay on specific weights during training so it will start to generalize. It is comparable to the random forest idea.        

**4. Is overfitting an issue for massive-scale models, like large-language models? What evidence for or against overfitting do you see?** (1 point)

Overfitting is learning the training-data due to many data in comparensive to learnable weights. Large-Language models like GPT-3 has 175 billion parameters to learn (CGPT). Running into to many data by mistake is not really realistic. 

#### b) Implementation
Use the following code cell to:

**1. Implement your own binary Cross-Entropy loss function.**  (2 points)

**2. Add L2 regularization using the L2 regularized *binary* objective function.** (2 points)

Notes:
* Do not change existing code, only add your code. Follow the hints in the comments.
* Implement BINARY cross-entropy loss. There will be NO points for implementing categorical CE!
* Do NOT use loops when implementing the loss function (no "for", "while")! We will deduct points if loops are used!

In [22]:
import torch
from torch import nn 

class MyCrossEntropyLoss(nn.Module):
    
    def __init__(self, params, l2=0):
        super().__init__()
        self.l2 = l2
        self.para = params

    def forward(self, y_predicted, y_target):
        y_predicted = F.sigmoid(y_predicted, dim=1)
        
        ### start; your code here
        # cross-entropy term
        # take log of predicted output probabilities of your samples and then multiply with target vectors
        # for stability: add a small epsilon before taking log to avoid nan values when taking log of 0
        eps = 1e-8
        y_predicted = y_predicted + eps
        log_y_pre_1 = torch.log(y_predicted)
        log_y_pre_0 = torch.log(abs(1-y_predicted))
        # loss function
        loss_vec = y_target * log_y_pre_1 + (1-y_target) * log_y_pre_0
        # sum CE loss over all samples
        loss_abs = torch.sum(loss_vec)
        # take average of the CE sum to get avg loss per sample
        loss_neg = torch.mul(1/loss_vec.size(-2), loss_abs)
        # negate the result
        loss = -1 * loss_neg
        ### end of your code;

        # using L2 regularization
        if self.l2 > 0:
            # loop over all parameters (weight matrices and bias vectors)
            for p in self.para:
                # p.data contains the current parameter values. In the used network all weights are matrices, so
                # we filter biases based on that fact
                if len(p.data.size()) == 1:
                    # skip bias vectors
                    continue
                    
                ### start; your code here
                # loss term for L2 regularization
                # calculate L2 term (squared L2 norm of weight matrix)
                l2_term = torch.norm(p)**2 # CGPT
                # calculate weight of L2 term using self.l2
                weight = self.l2/2*loss_vec.size(-2)
                # multiple weight and term and add to CE loss
                loss = loss + l2_term*weight
                ### end of your code;


        return loss

**3. What does the ```y_predicted = F.sigmoid(y_predicted, dim=1)``` do and what is the benefit of it ?** (1 point)

The F.sigmoid() is computing a new tensor which has the same shape as y_predict with entry sigmoid(y_predicted_i). The sigmoid function is mapping the the y_i on the interval [0:1]. There fore the output is a probability. This is specific for pytorch, that the output layer is not part of the model. 

### Exercise 3 : End-to-End Systems (Kieran)

**1. Explain what an End-to-End Deep Learning System is by discussing an example in the context of Machine Translation** (2 points)

End-To-End learning is a term used to describe the training and solving of a complex problem with a single model (in particular using Neural Networks). Machine Translation is the ML process of translating text from one language to another, without any human input or interference.  End-To-End Machine Translation, also commonly named Neural Machine Translation, approaches the solving of such a task as a single, contained model. This model, most commonly a sequence-to-sequence (Seq2Seq) neural network, receives as input a sentence or text block from a language and translates it into semantically similar, fluent block of text in the target language. The model is completely closed and interconnected and, being trained on large amounts of paired data from both languages, learns general patterns to create output. For each new output the entire context is available and used (each of the input words, along with all previously formed output), as shown in the diagram below:  

<figure width='100%' style="display:flex; align-items: center; justify-content: center">
  <img src='./images/end-2-end.png' width='50%'>
</figure>

Such End-2-End Machine Translation models have huge advantages over their predecessors as, with enough quality training data, they are able to 'understand' the broader context and meaning of an input. This meaning can then be translated into a cohesive, semantically complex and contextual output. A very clear example of an End-to-End learning system would be the task of creating an image from a French Sentence - a modular system may first use a Machine Translation module to translate the French sentence to an English sentence, before using a image generation module to create an image from this English sentence. An End-to-End system would generate an image directly from the input French sentence. 

text-src: https://phrase.com/blog/posts/neural-machine-translation/ <br>
img-src: https://opennmt.net/

**2. The Amarican superhero Daredevil who is blind wants to question the only suspect regaring the investigation he is conducting. However, the suspect is deaf and only communicates in Italian Sign Language. He hires you and asks you to build a system that will enable her to interrogate this suspect.** (4 points)

i) How would your system look like that is solving this task? Give a rough outline of both: an End-to-End system as well as a modular system (input/output and involved processes/modules (rough) for each system).

ii) What are the advantages/disadvantages of each of your systems (End2End vs modular) you described in i)?

Grading conditions / important notes:

For (i): Make sure to identify all sub-problems your system needs to tackle in order to appropriately solve the described task.

<i>i) How would your system look like that is solving this task?</i><br><br>
Speech/Braille -> Italian Sign Language (Video or Robot)<br><br>
<strong>End-To-End system:<br></strong>
Input: Speech<br>
 -> End-To-End Module<br>
Output: Italian Sign Language (Video or Robot)<br><br>

<strong>Modular system:<br></strong>
Input: Speech<br>
 -> Acoustic Model (Audio Feature Extraction)<br>
 -> Phonetic Model (Sound to Syllables or 'phoneme')<br>
 -> Word Composition Model (Syllable to Words) <br>
 -> Machine Translation module (Translate from English to Italian Sign Language): Can also be broken down into sub-modules (For example https://aclanthology.org/N03-1017/)<br>
    &nbsp; &nbsp; - Language Module<br>
    &nbsp; &nbsp; - Translation Module<br>
 -> Video Creation / Robot Controller Model (Convert vector representations of Italian Sign Language Words to machine instructions)<br>
Output: Italian Sign Language (Video or Robot)<br><br>

Italian Sign Language (Video or Robot) -> Speech/Braille<br><br>
<strong>End-To-End system:<br></strong>
Input: Italian Sign Language<br>
 -> End-To-End Module<br>
Output: Speech/Braille<br><br>

<strong>Modular system:<br></strong>
Input: Italian Sign Language<br>
 -> Image capture model (Capture each individual gesture from a video of the suspect communicating through Italian Sign Language) <br>
 -> Gesture Meaning Model (Translates the gestures into words with meaning) <br>
 -> Machine Translation module (Translate from Italian Sign Language to English):<br>
    &nbsp; &nbsp; - Language Module<br>
    &nbsp; &nbsp; - Translation Module<br>
 -> Speech Creation model (Words given to speech generation model) or <br>
 -> Braille Creation (Words given to machine instructions for the Braille to be printed)<br>
Output: Speech/Braille<br><br>

<i>ii) What are the advantages/disadvantages of each of your systems (End2End vs Modular) you described in i)?</i><br><br>
<strong>End-To-End:<br></strong>
Advantages:<br>
 - Doesn't need any deep knowledge or expertise about Braille or Italian Sign Language<br>
 - Can be implemented by someone who does not need Computer Science knowledge<br><br>

Disadvantages:<br>
 - It is very difficult to modify the system - what if the next suspect is deaf but only communicates in French Sign Language? Then the whole model would have to be retrained<br>
 - If something goes wrong it is very difficult to troubleshoot<br>
 - Cannot incorporate preexistent universally recognized modules<br>
 - Very costly to train in terms of time and energy use <br>
 - Huge amount of training data needed - Needs a lot of Italian Sign Language, Speech/Braille to be trained!<br><br>

<strong>Modular System:<br></strong>
Advantages:<br>
 - It is possible to adapt this system to the next case - possibly another suspect can only speak portuguese?<br>
 - Each step is very clear so it is easy to troubleshoot if an error occurs<br>
 - It could be possible to completely use preexistent recognized modules - increasing time and energy efficiency<br><br>
  
Disadvantages:<br>
 - Expert knowledge needed for the setup and connection of each component - requires a lot of human resources and management<br>
 - If a module needs to be created and trained, rather than being imported, then it may require expert knowledge (for example in Braille or Italian Sign Language)

text-src: https://towardsdatascience.com/e2e-the-every-purpose-ml-method-5d4f20dafee4

**3. Auonomus Driving system can also be introduced as an end-to-end system. Explain main components of an Autonomus Driving system, and how each components are connected and contribute to the the overall system.** (2 points)

Notes: Use "[Standard Driven Software Architecture for Fully Autonomous Vehicles](https://www.atlantis-press.com/journals/jase/125934832/view)" by Serban et al. to support your explanations.

Autonomous Driving is the term used to describe the process of automating the control over automobiles. "the automation of any task is a control loop which receives input from sensors, performs some reasoning and acts upon the environment (possibly through actuators)" (txt-src(1)). The dominant view for a long time, was that three elements were needed for any autonomous robot - a sensing system, planning system and an execution system (SPA). Autonomous driving, however, provides such a difficult challenge, that such a simple model cannot capture the complexity of the task. For full automation, the main question seems to be the balancing of complex tasks and simple reactions - the system has to have a view of the entire route it needs to take to get to the destination while also constantly reacting to hazards and objects in the immediate vicinity. An autonomous driving system therefore requires both reactive (high frequency, quick computation) and deliberative (broader environment understanding, slow computation) components. 

<figure width='100%' style="display:flex; align-items: center; justify-content: center">
  <img src='./images/autonomous_driving.png' width='50%'>
</figure>
This image provides an overview of a suggested component structure (txt-src(1)), from the interface for inputs into the model (Sensors Abstraction), through a hierarchy of 5 elements (Sensor Fusion, World Model, Behavior Generation, Planning, Vehicle Control) to the resulting actions (Actuators Interface). An End-to-End System would replace the entire hierarchical structure between the Input and Output with a single network. The inputs are made up of environmental inputs (RADAR, LIDAR and cameras), global positioning (GPS, including routing), communication with other vehicles and traffic elements (V2X) and the internal vehicle state (Speed, momentum etc.). The output is the control of the vehicle, which can be broken down into lateral control (Steering) and longitudinal control (Brake, Throttle, Transmission). The components depicted between these input an output states respond to different types of information provided to the system. As stated above, the system requires a very complex hierarchy of decision making, "For example, one can not only judge the distances to the surrounding objects, but also the relevance of the decision in achieving the goal. Is it worth to overtake the car in front if the vehicle must turn right in a relatively short distance after the overtake" (txt-src(1)). A suggested 'tee and join/pipe-and-filter' structure of hierarchy and connection is displayed below, enabling the intricate relationship of the 'Sensor Fusion' component being able to react instantaneously to dynamic objects in the immediate surrounds, while the 'Behavior Generation' component continuously monitors the global route.

<figure width='100%' style="display:flex; align-items: center; justify-content: center; gap: 50px">
  <img src='./images/tee_and_join_pipeline.png' width='20%'>
  <img src='./images/pipe_and_filter.png' width='30%'> 
</figure>

An End-to-End Autonomous driving system would bypass these complex interconnectivity issues of the variety of components as the entire system would be a single, connected decision making module. However, in doing so, we would lose the understanding and control of the hierarchy of decision making. The added unseeable element of an End-to-End system is concerning as, even in a module system, "To this moment it is not clear how autonomous vehicles will behave in case an accident cannot be avoided and which risk to minimize" (txt-src(1)). For example, how would we want an autonomous driving system to respond in the following situations? How would we truly be able to troubleshoot, verify and validate the decision making process of an End-to-End system?

<figure width='100%' style="display:flex; align-items: center; justify-content: center">
  <img src='./images/traffic_situations.png' width='50%'>
</figure>

Finally, the entire process is entirely dependent on the car manufacturers: Original Equipment Manufacturers (OEM). Each car brand and model would provide completely different input and output parameters. While a modulated system could be designed with constant awareness of the integration of such systems, allowing the adaptation of certain components, nodes or blocks to individual OEM architectures, how costly would it be to assure that an End-to-End system is truly verified for a specific integration.

text-src(1): https://www.atlantis-press.com/journals/jase/125934832/view <br>
text-src(2): https://www.researchgate.net/publication/326638158_Tactical_Safety_Reasoning_A_Case_for_Autonomous_Vehicles <br>
img-src(1-3): https://www.atlantis-press.com/journals/jase/125934832/view <br>
img-src(4): https://www.researchgate.net/publication/326638158_Tactical_Safety_Reasoning_A_Case_for_Autonomous_Vehicles


### Exercise 4: RNN/LSTM

In [41]:
import os
import cv2
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from skimage import io
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

train_dataset = dsets.MNIST(root='./data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='./data', 
                           train=False, 
                           transform=transforms.ToTensor())


batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

input_dim = 28
hidden_dim = 64 # for example

output_dim = 10 
layer_dim = 1


In [64]:
# The RNN
class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(RNNModel, self).__init__()
        # Hidden dimensions
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        # Building your RNN
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='tanh')

        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        # Initialize hidden state with zeros
        if torch.cuda.is_available():
            h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).cuda()
        else:
            h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim)

        #Define the forward steps
            
        out, hn = self.rnn(x, h0.detach())
        out = self.fc(out[:, -1, :]) 
        
        return out

a) From the above lines of code in RNN model (Try to answer short and precisely) 

1. Is there line(s) of code to prevent exploding or vanishing gradients, (1 point) <br> 


*Insert your answer here*

2. if yes, identify the line(s) and explain how it prevents exploding or vanishing gradients <br>
3. if no, add line(s) of code for the prevention and explain how<br>(Answer either 2 or 3) (3 points) <br>


*Insert your answer here*

4. Describe one another method to prevent exploding or vanishing gradient. (1 point)

*Insert your answer here*

b) Implement a Bidirectional GRU model with dropout of 0.1 in the following code (4 points) <br>


In [76]:
# b) Implementation of Bi-GRU: complete TODOs
class GRUModel(nn.Module): # class TextClassifierGRU
  def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):  # vocabulary size:vocab_size, dimensions of the embedding:dim_embed, number of classes:num_classes, number of layers:num_lay,hidden dimension:dim_hidden 
    super(GRUModel,self).__init__()    
    # TODO: Hidden dimensions
    self.hidden_dim =
    
    # TODO: Number of hidden layers
    self.layer_dim = 

    # TODO: Building your Bi-GRU
    self.gru = 
         
    #TODO: linear forward layer
    self.fc = 

  def forward(self,x):
      # TODO: Initialize hidden state and cell state with zeros
      if torch.cuda.is_available():
          h0 = 
      else:
          h0 = 

      
      # TODO: Define the forward steps      
      out, hn = 
      out = 
    
      return out

c) Check the accuracies of the RNN and GRU, and List down two possible implementations to improve the classifier performance. (1 point)

In [77]:
model_rnn  = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)
model_gru = GRUModel(input_dim, hidden_dim, layer_dim, output_dim)
#Move to GPU if available
if torch.cuda.is_available():
    model_rnn.cuda()
    model_gru.cuda()
    
#Instantiate the Loss
criterion = nn.CrossEntropyLoss()

#Instantiate the Optimizer
learning_rate = 0.1
optimizer_rnn = torch.optim.SGD(model_rnn.parameters(), lr=learning_rate)
optimizer_gru = torch.optim.SGD(model_gru.parameters(), lr=learning_rate) 


In [None]:
# RNN Training
# Number of steps to unroll
seq_dim = 28  

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Load images as Variable
        if torch.cuda.is_available():
            images = Variable(images.view(-1, seq_dim, input_dim).cuda())
            labels = Variable(labels.cuda())
        else:
            images = Variable(images.view(-1, seq_dim, input_dim))
            labels = Variable(labels)
            
        # Clear gradients w.r.t. parameters
        optimizer_rnn.zero_grad()
        
        # Forward pass to get output/logits
        outputs = model_rnn(images)
        
        # Calculate Loss
        loss = criterion(outputs, labels)
        
        # Getting gradients w.r.t. parameters
        loss.backward()
        
        # Updating parameters
        optimizer_rnn.step()
        
        iter += 1
        
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                if torch.cuda.is_available():
                    images = Variable(images.view(-1, seq_dim, input_dim).cuda())
                else:
                    images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get logits/output
                outputs = model_rnn(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                if torch.cuda.is_available():
                    correct += (predicted.cpu() == labels.cpu()).sum()
                else:
                    correct += (predicted == labels).sum()
            
            accuracy = 100 * correct / total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))

In [None]:
# GRU Training

# Number of steps to unroll
seq_dim = 28  

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Load images as Variable
        if torch.cuda.is_available():
            images = Variable(images.view(-1, seq_dim, input_dim).cuda())
            labels = Variable(labels.cuda())
        else:
            images = Variable(images.view(-1, seq_dim, input_dim))
            labels = Variable(labels)
            
        # Clear gradients w.r.t. parameters
        optimizer_gru.zero_grad()
        
        # Forward pass to get output/logits
        outputs = model_gru(images)
        
        # Calculate Loss
        loss = criterion(outputs, labels)
        
        # Getting gradients w.r.t. parameters
        loss.backward()
        
        # Updating parameters
        optimizer_gru.step()
        
        iter += 1
        
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                if torch.cuda.is_available():
                    images = Variable(images.view(-1, seq_dim, input_dim).cuda())
                else:
                    images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get logits/output
                outputs = model_gru(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                if torch.cuda.is_available():
                    correct += (predicted.cpu() == labels.cpu()).sum()
                else:
                    correct += (predicted == labels).sum()
            
            accuracy = 100 * correct / total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))

*Insert your answer here*

### Exercise 5: Convolutional Neural Networks (CNN)

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.

Download the dataset to your hard disk and extract the files. The data is provided as train-test split and the labels of the respective images are given by the name of the parent folder. Download the CIFAR10 dataset from  https://www.kaggle.com/datasets/swaroopkml/cifar10-pngs-in-folders?resource=download or from https://owncloud.csl.uni-bremen.de/s/9mNnmeA7esyEpnC. The corresponding paper of the CIFAR10 dataset: *"Learning Multiple Layers of Features from Tiny Images", Alex Krizhevsky, 2009*.

Take the last 20 % of the images of each class and save them in a separate folder structure for validation, e.g.:  
*\user\data\cifar10\validate\airplane\4001.png  
\user\data\cifar10\validate\airplane\4002.png  
...  
\user\data\cifar10\validation\bird\4001.png  
...*  
This leads to the following datasplit for each class: 4,000/1,000/1,000 images for training/validation/testing respectively.


```Important:``` change the directory path in the following code cell. <br>

First, we will do our imports, load the data and define a custom dataset. **Read** the following code cell carefully, it is used to prepare the data for the exercise.

In [None]:
root_dir = ### Insert your path here!! We will change it back to our path for the corrections
tr_dir = os.path.join(root_dir, 'cifar10_split', 'train')
cv_dir = os.path.join(root_dir, 'cifar10_split', 'validate')  
tt_dir = os.path.join(root_dir, 'cifar10_split', 'test')

# Create datalist from directory for each subset
tr_file_list = []
cv_file_list = []
tt_file_list = []

# Load the training data
for path, subdirs, files in os.walk(tr_dir):
    for name in files:
        file_dir = os.path.join(path, name)
        label = path.split('\\')[-1]
        tr_file_list.append(file_dir + ' ' + label)

with open(os.path.join(root_dir, 'tr_data_list.txt'), 'w') as file:
    for item in tr_file_list:
        file.write("%s\n" % item)        
        
# Load the validation data
for path, subdirs, files in os.walk(cv_dir):
    for name in files:
        file_dir = os.path.join(path, name)
        label = path.split('\\')[-1]
        cv_file_list.append(file_dir + ' ' + label)

with open(os.path.join(root_dir, 'cv_data_list.txt'), 'w') as file:
    for item in cv_file_list:
        file.write("%s\n" % item)

# Load the test data
for path, subdirs, files in os.walk(tt_dir):
    for name in files:
        file_dir = os.path.join(path, name)
        label = path.split('\\')[-1]
        tt_file_list.append(file_dir + ' ' + label)

with open(os.path.join(root_dir, 'tt_data_list.txt'), 'w') as file:
    for item in tt_file_list:
        file.write("%s\n" % item)

#### Define custom dataset
class cifar10Dataset(Dataset):
    def __init__(self, data_list_path):
        self.data_list_path = data_list_path
        self.data_save_dirs = []            # Stores the save dirs for the images
        self.labels = []
        self.label_encoder = LabelEncoder() # Encodes strings to integers for classification
        self._init_data()
        
    def __len__(self):                      # Get total number of dataset samples
        return len(self.data_save_dirs)

    def __getitem__(self, idx):             # Return an item
        data = io.imread(self.data_save_dirs[idx]) # Load image from hard disk
        lab = self.labels[idx]              # Load class label
        data = data.reshape(3, 32, 32).astype('float32') # Change to channels first and cast to float32
        data = data/255.0                   # Normalize input
        #print(data.shape)
        #plt.imshow(cv2.cvtColor(data.reshape(32, 32, 3), cv2.COLOR_BGR2RGB))
        return data, lab
        
    def _init_data(self):                   # Initialize data
        with open(self.data_list_path, 'r') as file:
            data_list = file.readlines()    # Read file list         
                
        for d in data_list:
            s, l = d.split(' ')             # Split into input and label
            self.data_save_dirs.append(s)
            self.labels.append(l.rstrip('\n'))
        
        # Encode string labels to integers
        self.label_encoder.fit(self.labels)
        self.labels = self.label_encoder.transform(self.labels)

#### a) Explain the Model  (6 points)
In the code cell below, a convolutional neural network is defined. **Answer the following questions concerning this model.** Each question related to a specific line of code. These are marked in the code cell. Try to answer short and precisely.

In [None]:
class myNetwork(nn.Module):
    def __init__(self):
        super(myNetwork, self).__init__()
        self.conv1 = nn.Conv2d(3, 8, 5)                                    #1
        self.conv2 = nn.Conv2d(8, 16, 5)
        self.max1 = nn.MaxPool2d(kernel_size=2, stride=2)                  #2
        self.max2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(400, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, ??)                                       #3
        
    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)                                                      #4
        x = self.max1(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = self.max2(x)                                                   #5
        x = x.view(-1, 400)                                                #6
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x

*Insert your answer here*

i) For the first convolutional layer what is the kernel size and the number of kernels. How many trainable weights does this layer consist of? (1 point)

*Insert your answer here*

ii) In your own words: What is the purpose of the max pooling operation? (1 point)

*Insert your answer here*

iii) We introduced the CIFAR-10 dataset before in the introduction of this exercise. What is the correct output size of the final linear layer and why? Add this number in the code cell by replacing ```??```. (1 point)

*Insert your answer here*

iv) What does this line of code do and why? (1 point)

*Insert your answer here*

v) Consider the output volume of the last convolutional layer. Before further processing from the dense layers, the output volume gets processed by a max pooling layer. Based on the output volume shape, what is the shape of the output volume after the max pool operation? (1 point)

*Insert your answer here*

vi) What does this line of code do and why? (1 point)

*Insert your answer here*

#### b) Model Training (2 points)

Give a step by step description of the training loop for one batch in one epoch. Explicitly state at which point the model is modified. Do not use code and explain the necessity of each step. (Bullet points)

*Insert your answer here*

#### c) Calculate Convolution (1 point)

Given the input A and the kernel K, calculate the two dimensional convolution result. Calculate the result only for fully overlapping positions (valid) between A and K (hence padding is not required). We define stride=1.

![](Matrix_A.png) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ![](Kernel_K.png)



*Insert your answer here*

####  d) Receptive Field (2 points)

i) Explain the concept of the receptive field of CNNs and why this is an important consideration when designing the model architecture. (1 point)

*Insert your answer here*

ii) Calculate the exact size of the receptive field of the myNetwork class defined above (ignoring fully connected layers). Explain the calculation and make clear what the result is. (1 point)

*Insert your answer here*

### Exercise 6: Generative Models (Benedikt)

#### a) Variational Autoencoders

**1. Explain the principle of Maximum Likelihood (ML) for training Variational Autoencoders (VAEs) Why can we not optimize VAEs with the exact likelihood, but use an approximation instead ?**  (2 points)

The idea is to have a learned laten space Z and learned conditional distribution q(x|z), the maximum likelihood of the product of this is a sample from the original data distribution. The exact likelihood can not be used, because the latent space distribution is learned by the original data distribution and the distribution of all images for example is not accessible.

**2. Describe the Evidence Lower Bound (ELBO) for training variational autoencoders, what are the main components of the ELBO loss, and how are those components implemented in practice?** Note: You do not have to show the complete derivation of the loss for the ML approximation here for full points.  (2 points)

The two main components of ELBO are:
$$E_{q_\phi(z|x)}[\log(p_\theta(x|z))]$$
which represents the probability of the reconstructed input and
$$-D_{KL}(q_\phi(z|x)||p_\theta(z))$$
which is the divergence of the encoded distribution from the prior one.

The first part is usually computed using simple mean square error loss, the second one ...

#### b) Generative Adversarial Networks

**1. Breifly explain main principle of generative Adversarial networks (GANs).**  (1 point)

A trained GAN is mostly a DNN which maps a noise vector taken out of a know distribution like a guassian distribution, to a fake data sample. The DNN is called Generator. During the training process, a second NN called discriminator is added which tries to select the fake data from the real data. The training can be discribed as a minimax Game finding a nash equilibrium.

**2. Mention and briefly explain at least 3 possible pit falls can occur during GAN training.**  (2 points)

Due to the fact that a nash equilibrium is just a state in which changing the strategy as player (player = the Generator and the Discriminator, strategy=weights) brings no advantage for the player, it is not ensured that the found equilibrium and with this the generator is handling the problem well. Furthermore it is possible that no equilibrium is found, due to the chosen "strategy's". 
That a "bad" equilibrium  has been found is made clear by two phenomena:
* Model collapse - the produced fake data differs only slightly. 
* Discriminator Dominance - the discriminator cannot be fooled, so there is no gradient for the generator to learn from. 

A third phenomena which occurs while training is that the learning curves are highly fluctuating due to the reason of two models are trained competetively. This phenomena is called Training instability.

#### c) (Stable) Diffusion Models

**1. We discussed about Diffusion models in the lecture, briefly explain the key difference between Diffusion Models and VAEs.**  (1 points)

As in VAEs the mapping to the laten space has to be learned. In other words the Encoder is a NN. In a Diffusion model the laten space is "reached" by adding additive noise to a data sample (the latent space is every noised sample x1,....,xT). 

**2) What are main diferences or improvements made with Stable Diffusion models compared to pure Diffusion models?** (2 points)

When stable diffusion models are described in the following paper (https://arxiv.org/pdf/2112.10752.pdf), than the advantage is that the computational cost during the training process as well as a higher inference speed is achieved, by computing on a lower dimensional latent space.

**3) According to the orginal Stable Diffusion models paper, brieffly explain how does the conditioning with various modalaties like text or images mapped to the intermediate layers of the UNet ?**  (2 points)

Each modalaty has a so called domain specific encoder (τθ). This maps the input y like text, to a intermediate representation which is than mapped trough  cross-attention layers to the intermediate layers of the UNet.

### Exercise 7: Transformers

Compare VisualBERT: A Simple and Performant Baseline for Vision and Language by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang (https://arxiv.org/pdf/1908.03557.pdf) and SAM: Segment Anything by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick (https://arxiv.org/pdf/2304.02643.pdf).Try to answer short and precise. 

a) Compare the basic architecture, loss functions used in these models, with short explaination of purpose of each block in architecture mentioned in the papers. (2 points) <br> 

*Insert your answer here*

b) Can VisualBERT be used for the same application as in SAM (even if partially used). Explain How or why not? (3 points)

*Insert your answer here*

c) Mention the drawbacks (atleast three) of a transformer model w.r.t. VisualBERT. (3 points)

*Insert your answer here*

d) Mention any other model that can overcome the drawback(s) found in c. Explain how? The model could be from any other research paper or own ideas. (Provide the proper citation of the research paper.) (2 points)

*Insert your answer here*

### Exercise 8: Reinforcement Learning (Benedikt)

You develop a chat bot which uses a customizable large language model to generate its output. The chatbot is designed to give advice on travel arrangements. After every session, the users give a star rating of the overall interaction. In every step of a conversation, the history of the conversation (as text) as well as a vector (containing nominal values for specific aspects, such as language complexity, tone, chattiness, ...) customizing the speaking style of the chatbot are used. For every user, there exists a user profile which gets updated over the course of the interaction. For this purpose, a user profiler model receives features from the text input during the conversation and a video camera stream of the user and predicts a vector of user attributes such as gender, age, emotional state, level of attention, etc. The LLM and the user profiler are pre-trained and should be considered fixed. You now should include a Reinforcement Learning module which yields the customization vector for a given (history of) user attribute vectors. 


a) Draw a diagram which illustrates the planned system, with boxes denoting the relevant components and agents in the scenario and annotated arrows showing the data flow between the components/agents. (3 points) 

![](images/Ex8-diagramm.png)

b) Reflect on why or why not it makes sense to frame this task as a Reinforcement Learning problem. What alternative do you see? Compare the advantages and disadvantages! (3 points)

It makes sense to frame this as a Reinforcement Learning model, because the model can learn in field. The target value is the customization vector, for the pretrained large language model. To learn this action (customization vector) the reward given by the user is used to learn the policy for a given user characteristic (state =  attribute vector).  It would although be possible to fine tune the large language model before launching the application. For this approach a ground truth must be selected. 
Because the general functionality is all ready given (the user is getting a advice on the travel arrangements), to learn a better user experience this architecture is fitting to the use case. Another advantages is, that no ground truth must be selected which is normally expensive. The only disadvantage is that the performance in the begnning of the RL is worthier than the performance of a all ready fine tuned model.

c) Make suggestions for the reward function, the state space, and the action space of the Reinforcement Learning agent that learns to predict the customization vector. (3 points)

Starting with the state space: (gender, age, emotion of face, emotion of text, language, length of text, length of session, time, ...) 
Action Space: the action space are the parameters which define the language of the LLM (language=tone,length of sentences, ration between questions and recommendations, chattiness) 
The Reward function: The reward function is divided up into two parts. The first part of the reward function is that emotion of face, emotion of text, language and lenght of text can be rated and a change in this can be rated as well. The second part is that the overall session is rewarded by the user. How much the influence of the two parts is in total reward has to be fine tuned. 


d) Do you consider this a deterministic RL problem? Why (not)? (1 point)

The definition of a deterministic model is, that the action und the state at time point t_0 are determining the next state at t_1. As the emotions of a human is the state and this is not only depending on the action of the advice on travel arrangements, I understand this as a stochastic model. The action t_0 on a state t_0  leads with a certain probability to state t_1. 

### Exercise 9: Explainable AI (Kieran)

#### a) Interpretable vs. Explainable AI (1 Point)

Please shortly explain the difference between Interpretable AI and Explainable AI and give an example method for each (maximum 2 sentences).

Interpretable AI defines methods for which the resulting output of the model can be explained with purely theoretical statements of cause and effect, for example ridge regression. Explainable AI describes the process of being able to measure the effect that each of the parameters (and elements/sections of the input) have on the output of the model, for example LIME (Local Interpretable Model-agnostic Explanations) or Pertubation-based approaches such as masking certain areas of an input image to explain the activations of an image classification CNN.

#### b) Attention (3 Points)

The following figure was included in the *Attention Models* lecture. Please briefly describe what is shown there. Then comment on this specific example and attention models in general from the viewpoint of Explainable AI. You might want to consult literature on the discussion of attention as explanation (Maximum 8 sentences)

![](images/explainable_ai_task.png)

This image is an example of 'explainable AI' as it displays, in the situation of a Machine Translation task from English to French, how much 'attention' is given to each input word in considering the generation of each output word. The model is made up of a Bidirectional RNN encoder and a decoder which 'searches through a source sentence', taking into account the last output 'hidden state' and the entire encoded sentence (as shown below). 

<img src='./images/attention_architecture.png' width='20%'>

The encoder RNN maps an English sentence input (of variable length), to an encoded representation (of fixed length), which is then decoded to a French sentence (of variable length); how the encoder and decoder are trained simultaneously is described in text-src(2). We can understand the image provided above as a grey-scale (0=black, 1=white) representation of the matrix $A = (\alpha_{ij})$, where each $\alpha_{ij}$ represents the "probability that the target word $y_i$ is aligned to, or translated from, a source word $x_j$" (text-src(1)). Defined as the 'weight' of each annotation state $h_j$, $\alpha$ is computed by:

$$
\alpha_{ij} = \frac{\exp(e_{ij})}{\sum^{Tx}_{k=1}\exp(e_ik)}
$$

where

$$
e_{ij} = a(s_{i-1}, h_j)
$$

and $a$ is considered to be the 'alignment model' of our system which is jointly trained as a feed forward neural network along with the encoder/decoder components (text-src(1)). The training gradient of this 'alignment model' can even be used in the back propagation of the whole translation model at each step. The image provided shows how the Neural Machine Translation model has learnt the reordering of particular phrases "European Economic Area" -> "zone économique européenne" along with the need to search for a noun in order to determine the choice of "le", "la", "les", or "l'". This so-called 'soft-alignment' that is possible in such an encoder/decoder network (in comparison to word-for-word translation) is explained by the jointly trained alignment model $a$, which can then be clearly represented as above.


text-src(1):  https://arxiv.org/pdf/1409.0473.pdf <br>
text-src(2): https://arxiv.org/pdf/1406.1078.pdf <br>
img-src: https://arxiv.org/pdf/1409.0473.pdf <br>

#### c) Layerwise Relevance Propagation (5 Points)

For this task, you can use the LRP live demo (https://lrpserver.hhi.fraunhofer.de/image-classification) which uses a model trained on (a subset of) the ImageNet dataset (https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet). You can find the list of classes here: https://image-net.org/challenges/LSVRC/2012/browse-synsets

* Find or create an image that is strikingly misclassified by the model. You can either choose a pre-existing image or manipulate it (e.g. change the context, add another object).
* Use the relevance assignment from LRP to explain which class label is assigned by the model to that image (chose the image such that the assigned class is justifiable to a human observer).
* Explain your choice of LRP propagation rule and parameters $\beta$ and $\epsilon$. Add screenshots of different configurations and compare them to your chosen configuration. **Hint:** The *LRP Alpha-Beta* rule is roughly comparable to the *LRP-$\gamma$* rule that was introduced in the lecture. For the exact formula please refer to page 21 (equation 60) of the original paper: [On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation (2015)](https://doi.org/10.1371/journal.pone.0130140). 

**As your solution, report the input image, the chosen LRP parameters, as well as a screenshot of the resulting heatmaps and an accompanying explanation.** <br>



<figure width='100%' style="display:flex; align-items: center; justify-content: center">
  <img src='./images/balloon_chain_heatmap_075.jpg' width='100%'>
</figure>

The <i>BVLC Reference CaffeNet</i> model misclassifies this image of a balloon, manipulated through the addition of a chain in the bottom left corner, as a syringe (with confidence between 11.5 and 12%). We are able to justify this class assignment with the help of the <i>Black Fire-Red</i> heat map (Blue $\sim$ Relevance $<< 0$, Red/Yellow $\sim$ Relevance $>> 0$) provided by the LRP live demo, using the <i>LRP Alpha-Beta</i> Relevance Propagation Formula with parameter $\beta = 0.75$. Let us first observe classification and heatmap of the original, unchanged image of the balloon:



<figure width='100%' style="display:flex; align-items: center; justify-content: center">
  <img src='./images/balloon_heatmap_075.jpg' width='100%'>
</figure>

We can see in this balloon heat map that the complete, oval shape of the balloon and the knot at the bottom of the balloon are significant, bright red regions of pixels that contribute to the classification. We, as human observers, can interpret this observation as the described attributes being the most 'recognizable' or 'important' aspects of a balloon. The string coming from the bottom of the balloon also displays an interesting combination of blue and red mapping - possibly a result of the ImageNet Balloon class being trained on Hot Air Balloons, not the type of balloon that is displayed in this image. In contrast, the bottom half of the balloon and the knot do not appear to be as relevant in the misclassification case - leading to the class 'balloon' not appearing in the top 10 most probable classes. The question then is, why the first image is being classified as a 'syringe'? It is clear that, in spite of the 'chain' class also existing in the ImageNet dataset, only the chain's top end features prominently in the heat map and the whole object is therefore not so significant in the classification. Otherwise, we can only make educated guesses based on the information provided to us by the LRP demo. It is possible, as the top of the balloon is still considered relevant, that perhaps the model is associating this shape with a human shoulder or similar body part, into which the fluid from a syringe could be injected. Another piece of evidence that could back up this interpretation is that the end of the chain, which rests in the middle of the balloon, is also considered very relevant - perhaps it is being interpreted as the end of a syringe, resting on the skin of a human limb? It is obvious that the fact that the chain breaks the complete oval shape of the balloon is very significant in this misclassification. <br>

Let us now discuss the choice of our parameter $\beta = 0.75$ through a brief analysis of <i>LRP Alpha-Beta</i> Relevance Propagation and Pixel Explanation through Layerwise Relevance Propagation in general. We denote each neuron in our neural network as $x_i$, with the weight between neurons $x_i$ and $x_j$ being defined as $w_{ij}$, as shown in the image below. In a forward propagation, let us define $z_{ij} = x_iw_{ij}$ and $z_i = \sum_{i \neq j} z_{ij} + b_j$ with $b_j$ being a bias term. Finally, our $x_j$ in the next layer is calculated by passing $z_j$ through an activation function $x_j = g(z_j)$, for example tanh or ReLU. We are then able to use these values to compute the 'relevance' $R_j$ of a neuron $x_j$ with back-propagation; with these relevances being calculated as a function of upper-layer relevances and $\beta$: 
$$R^{(l,l+1)}_{j \leftarrow k} = R^{(l+1)}_k \cdot ((1-\beta) \cdot \frac{z^+_{jk}}{z^+_k} + \beta \cdot \frac{z^-_{jk}}{z^-_k}).$$

Here, the positive and negative parts of $z_{jk} and z_{k} are denoted by '+' and '-'. Now $R^{(l)}_j= \sum R^{(l,l+1)}_{j \leftarrow k}$, with the superscript $(l)$ denoting the layer of the neurons, as shown below.

<figure width='100%' style="display:flex; align-items: center; justify-content: center">
  <img src='./images/pixel_explanation_through_layerwise_relevance_propagation.jpg' width='50%'>
</figure>

So, we may understand the adjustment of the parameter $\beta$ as the manipulation of how much significance the positive and negative activations should have. As $\beta$ grows, the importance of the negative parts grows: Testing this out on the LRP live Demo, we found that $\beta=0.75$ best described the attributes and interesting features described above. 

<figure width='100%' style="display:flex; align-items: center; justify-content: space-between;">
  <img src='./images/balloon_chain_heatmap_0.jpg' width='20%'>
  <img src='./images/balloon_chain_heatmap_025.jpg' width='20%'>
  <img src='./images/balloon_chain_heatmap_05.jpg' width='20%'>
  <img src='./images/balloon_chain_heatmap_1.jpg' width='20%'>
</figure>

$\beta$ Values from left to right: $\beta = 0, \beta = 0.25, \beta=0.5, \beta=1$. <br><br>

text-src: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140 <br>
img-src: https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0130140.g005 <br>

### Exercise 10: Sustainable AI (Benedikt)

a) Give a concrete example in which two SDGs come in conflict with each other during the development of a image-generating Diffusion Model (such as DALL-E 2, Stable Diffusion, etc.). Give references to real-world sources (not necessarily academic papers, but also news articles, blog posts, etc.) which document the claimed (positive or negative) impact with regards of the respective SDGs. Reflect about the potential trade-offs. (2 points)

"We conducted an internal audit of our filtering of sexual content to see if it concentrated or exacerbated any particular biases in the training data. We found that our initial approach to filtering of sexual content reduced the quantity of generated images of women in general, and we made adjustments to our filtering approach as a result." (https://github.com/openai/dalle-2-preview/blob/main/system-card.md) This quote is taken from the "Model training data paragraph". 

As you can see the SGD of _gender equality_ and _good health and well being_ (some sexual content could venerate this point) has been in conflict with each other, while training _Dall-e 2_. Due to the objective that  _Dall-e 2_ should not produce sexual content, the dataset was at the end biased that less pictures of women were generated. This phenomena could be described as "good will biased" and show's that filtering data by trying to optimize one objective, other potential risk of discrimination etc. can rise. Specific for this case, their is no real trade of, because when recognizing the under representation of female gender, the data set can be expanded. 

The aspect on how the dataset for a image-generating diffusion model is generated is some thing I haven't found a article for specific treating the topic critically. But I have read some month ago something about the data generation concerning the data ChatGPT was trained on. The article was problematizing that disturbing content was labeled by humans living in economically weak countries. Due to the work of labeling this content the mental health of the employees was effected negatively.   As _Dall-e 2_ and _ChatGPT_ are both developed by _OpenAI_ probably the same "way" was chosen. In this case the SGD _good health and well being_ is effected. It is hard to find a sustainable way for this problem, because no data selection has the same problem, that the SGD _good health and well being_ is violated. 


b) You are currently preparing the training for a Convolutional Neural Network for image classification. You decide to not only focus on pure accuracy but that you want to be able to find a trade-off between model accuracy and the required energy for training. Describe a strategy to introduce this trade-off. Consider the concrete technical changes to the model, data, or training process which would be necessary. Furthermore, describe how your approach allows to chose the trade-off (e.g., more focusing more or less on energy consumption vs. accuracy) for different use cases. (2 points)


To reduce the energy consumption of a CNN for image classification the following steps are proposed:
1. Use a all ready trained CNN like ResNet-50 or EfficientNetB0 (which suits your problem best) and if your task you train on is of interest's for others make your project open source. 
2. Set the accuracy that is sufficiently accurate to perform the task. After reaching this set point stop training. 
3. Do not let the trainable weights explode. 
4. Train your model with "difficult" data. 

In order to implement the above strategy in field:
1. This is all ready best practice and called transferlearning.  
2. This is easy to implement in the training - check the accuracy and break after reaching the required accuracy. (Just another hyperparameter)
3. As the loss function is the unit to tackle when you want a specific behavior of your model, maybe to penalize the amount of weights in combination with a strategic dropout. 
    The implementation of the strategic dropout is the key for using this and I do not know if this works. (The influence would be controllable with a hyperparameter same as used in regularization)
4. This is something which requirers a specific logistic. For example in combination with n-fold cross validation as in the end of your first training cycle you only have the misidentified data in your training and test set, if you want so. Probably the best way is to have a certain ratio between correct and incorrect selected data. It's a hyperparameter, between 0 and 1. (0=no data is pruned, 1=all correct identified data is pruned)

As from 2. - 4. you have hyperparameters for scaling, so for different use cases, just tune them. 



c) Pick any readily available age classifier. Furthermore, pick a dataset which has annotations for distribution of races within the data set, for example https://github.com/joojs/fairface (labels: https://github.com/dchen236/FairFace). In an exemplary way, analyze whether you observe a racial bias in the performance of the age classifier. Show a plot or table to demonstrate your findings and document your approach in a reproducible way. Your results do not have to be conclusive or representative but show an understanding of how to tackle this research question. (6 points)


In [1]:
import os
import pandas as pd
import torch as th
from torchvision.io import read_image
from torch.utils.data import Dataset
from transformers import ViTFeatureExtractor, ViTForImageClassification
from torch.utils.data import DataLoader
from tqdm import tqdm
from collections import Counter


from root_path import ROOT_PATH


class FairFaceImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
        self.df_fairface = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.df_fairface['age'])

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.df_fairface['file'][idx])
        image = read_image(img_path)
        label = self.df_fairface['age'][idx]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label
    
    def get_df(self):
        return self.df_fairface


In [2]:
# https://huggingface.co/nateraw/vit-age-classifier

class AgePredictionByVit():

    __id2label = {
    0: "0-2",
    1: "3-9",
    2: "10-19",
    3: "20-29",
    4: "30-39",
    5: "40-49",
    6: "50-59",
    7: "60-69",
    8: "more than 70",
    }   

    def __init__(self, img_dir, annotations_file) -> None:
        self.data = FairFaceImageDataset(annotations_file=annotations_file, img_dir=img_dir)
        self.loader = DataLoader(self.data, batch_size=1, shuffle=False)
        self.model = ViTForImageClassification.from_pretrained('nateraw/vit-age-classifier')
        self.transforms = ViTFeatureExtractor.from_pretrained('nateraw/vit-age-classifier')
        self.df_fairface = self.data.get_df()
        self.prediction = None
        self.lables = None
        self.df_misclassified = None
        self.toleranz = 0.02
        self.biased_dict = {}

    def _estimate_dataset(self):
        probas = []

        self.model.eval()
        with th.no_grad():                  # No gradients
            for data in tqdm(self.loader):
                inputs, self.labels = data
                # inputs = inputs.to(cpu.device)
                # label = label.to(cpu.device)
                # Do prediction
                inputs = self.transforms(inputs, return_tensors='pt')
                probas.append(self.model(**inputs))
        return [proba.logits.softmax(1)for proba in probas]
    
     
    def predict(self, qualify=True):
        probas = self._estimate_dataset()
        self.prediction = [proba.argmax(1).item() for proba in probas]
        if qualify:
            self._qualify()
            

    def _qualify(self):
        misclassified = []
        for i, prediction in enumerate(self.prediction):
                if self.__id2label[prediction] != self.df_fairface['age'][i]:
                    misclassified.append(self.df_fairface.iloc[i])    
        self.df_misclassified = None
        self.df_misclassified = pd.DataFrame(misclassified)


    def biased(self, feature='race'):
        ### if deviation_feature is pos - bad biased 

        # rel_to_feature
        feature_counter = Counter(self.df_fairface[feature])
        rel_feature = pd.DataFrame.from_dict(feature_counter, orient='index', columns=['Rel'])/sum(feature_counter.values())
        self.biased_dict[f"rel_{feature}"] = rel_feature

        # get the ratio of race in misclassified data
        rel_mis_counter = pd.DataFrame(0, index=rel_feature.index, columns=rel_feature.columns,)
        mis_counter = Counter(self.df_misclassified[feature][:])
        for i in range(len(rel_mis_counter.index)):
            index_value = rel_mis_counter.index[i]
            if index_value in mis_counter:
                rel_mis_counter.at[index_value, 'Rel'] += mis_counter[index_value]
        
        rel_mis_counter = rel_mis_counter.div(sum(mis_counter.values()))
        self.biased_dict[f"rel_mis_{feature}"] = rel_mis_counter

        self.biased_dict[f"deviation_{feature}"] = rel_mis_counter - rel_feature
        print(self.biased_dict)

    def biased_on_age(self, age, feature):
        #### check if a age is biased, same procedure as in biased(), only on get data from one age_id
        pass
    
    def biased_on_age_feature(self, age, feature, specific):
        #### check if a age is biased on a specific feature like [race][MiddleEast], same procedure as in biased(), only on get data from one age_id  
        #### and one feature   
        pass


In [3]:
img_dir = ROOT_PATH / "images" / "ex10" / "test" 
annotations_file = ROOT_PATH / "images" / "ex10" / "test" / "fairface_label_test.csv"

age_prediction = AgePredictionByVit(img_dir=img_dir, annotations_file=annotations_file)
age_prediction.predict()
age_prediction.biased(feature='race')

100%|██████████| 99/99 [00:29<00:00,  3.39it/s]

{'rel_race':                       Rel
East Asian       0.121212
Indian           0.181818
Black            0.151515
White            0.181818
Middle Eastern   0.080808
Latino_Hispanic  0.161616
Southeast Asian  0.121212, 'rel_mis_race':                       Rel
East Asian       0.166667
Indian           0.166667
Black            0.100000
White            0.133333
Middle Eastern   0.133333
Latino_Hispanic  0.166667
Southeast Asian  0.133333, 'deviation_race':                       Rel
East Asian       0.045455
Indian          -0.015152
Black           -0.051515
White           -0.048485
Middle Eastern   0.052525
Latino_Hispanic  0.005051
Southeast Asian  0.012121}





### Exercise 11: Dimensionality Reduction (Kieran)
In exercise sheet 5, one of the tasks was to visualize the 2D latent space of a Variational Autoencoder trained on the MNIST digits dataset. It was intentionally limited to 2D to make visualization easier. One of the findings of that analysis was that the size of the latent space might be too small for the VAE to properly separate the different digits. In real life applications one might even use a latent dimension in the hundreds. Therefore, we will now operate in an example more closer to reality, by using a bigger latent space. Once the dimensionality of the latent space is larger than 3 it becomes difficult to visualize. Dimensinality reduction will help us. 

In [3]:
import random
from typing import Tuple, Dict, List
import torchvision.utils as vutils

import numpy as np
import torch
from torch import Tensor
from pathlib import Path
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import sys

RANDOM_SEED = 42

# Set random seeds
random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

In [4]:
# Load vae.py - must be placed in the same directory as this notebook.
import sys  
sys.path.insert(0, '.')
from vae import VAE, MNISTDecoder, MNISTEncoder

MODEL_PATH = Path("./vae-model-32.pth")
LATENT_DIM = 32

def load_vae(model_path: Path = MODEL_PATH, latent_dim: int  = 32) -> VAE:
    """
    Initializes the variational auto-encoder and loads the state dict saved under model_path.
    """
    if not model_path.exists():
        sys.exit(f"Found no model file under: {model_path.absolute()}.\n"
                 f"Please download the vae_model.pth from StudIP and place it in the directory of this notebook.")
    else:
        print(f"Loading VAE from path: {model_path.absolute()}")
    mnist_encoder = MNISTEncoder(latent_dim=LATENT_DIM)
    mnist_decoder = MNISTDecoder(latent_dim=LATENT_DIM)
    vae = VAE(mnist_encoder, mnist_decoder, latent_dim)
    vae_state_dict = torch.load(model_path, map_location=torch.device('cpu'))
    vae.load_state_dict(vae_state_dict)
    return vae

vae_model = load_vae(MODEL_PATH)
vae_model.eval()
print(vae_model)

Loading VAE from path: /home/kieran/Documents/Uni/SoSe23/Advanced Machine Learning/aml-sose23/Portfolio 2023/vae-model-32.pth
VAE(
  (encoder): MNISTEncoder(
    (hidden_layers): Sequential(
      (0): Linear(in_features=784, out_features=512, bias=True)
      (1): ReLU()
      (2): Linear(in_features=512, out_features=256, bias=True)
      (3): ReLU()
    )
    (out_mu): Linear(in_features=256, out_features=32, bias=True)
    (out_var): Linear(in_features=256, out_features=32, bias=True)
  )
  (decoder): MNISTDecoder(
    (hidden_layers): Sequential(
      (0): Linear(in_features=32, out_features=256, bias=True)
      (1): ReLU()
      (2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Linear(in_features=256, out_features=512, bias=True)
      (4): ReLU()
      (5): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (out_layer): Linear(in_features=512, out_features=784, bias=True)
    (out_act_fn): T

In [5]:
import utils
import seaborn as sns
import pandas as pd 

DATA_PATH = Path(".") / "data"
print("Loading MNIST data...")

_, mnist_dev_set = utils.get_mnist_train_dev_loaders(DATA_PATH, batch_size=64, flatten_img=True)

print("...MNIST data loaded")

Loading MNIST data...
...MNIST data loaded


a) Implement the missing sections from the following code cell. (5 points)

In [None]:
# This cell creates the outputs ./output/ex_11_latent.png and ./output/ex_11_data.png
# Takes around 2 Minutes to run

def plot_latent_vector_scatter_plot(
        vae: VAE,
        mnist_dev_loader: DataLoader,
        num_samples_per_class: int = 1000,
        num_classes: int = 10,
        show: bool = True
) -> Tensor:
    # We create a Dictionary of 1000 samples per MNIST class
    # The Dict has the form: Dict[int, List[Tensor]],
    # where the key is the MNIST digit and the value is a list of flattened MNSIT tensors
    label_to_dev_samples = utils.get_label_to_dev_samples(mnist_dev_loader, num_samples_per_class, num_classes)
    # Use the VAE to encode the MNIST digits to sampled latent vectors
    # The latent vectors are saved in this list
    all_dev_sample_latent_vecs = []
    all_dev_sample_vecs = []
    # Also store the labels in the same order for visualization purposes
    all_labels = []
    # HINT: Feel free to use functions from utils.py and vae.py
    ## YOUR CODE HERE START
    for i in label_to_dev_samples.keys():
        for elem in label_to_dev_samples[i]:
            all_dev_sample_vecs.append(elem)
            all_dev_sample_latent_vecs.append(vae.encode_data_to_sampled_latent_vec(elem))
            all_labels.append(i)
    ## YOUR CODE HERE END
    
    # Use t-sne to reduce the latent embedding to two dimensions
    ## YOUR CODE HERE START
    tsne_latent = TSNE(n_components=2, learning_rate='auto',init='random', perplexity=50).fit_transform(pd.DataFrame(all_dev_sample_latent_vecs))
    tsne_data = TSNE(n_components=2, learning_rate='auto',init='random', perplexity=50).fit_transform(pd.DataFrame(all_dev_sample_vecs))
    ## YOUR CODE HERE END
    
    ## Create a Scatter plot of the latent vectors
    ## YOUR CODE HERE (START)
    colors = ["#FFB300", # Vivid Yellow
    "#803E75", # Strong Purple
    "#FF6800", # Vivid Orange
    "#A6BDD7", # Very Light Blue
    "#C10020", # Vivid Red
    "#CEA262", # Grayish Yellow
    "#817066", # Medium Gray
    "#007D34", # Vivid Green
    "#F6768E", # Strong Purplish Pink
    "#00538A", # Strong Blue
    ]
    latent_fig, latent_ax = plt.subplots(1,1,figsize=(20, 20))
    data_fig, data_ax = plt.subplots(1,1,figsize=(20, 20))
    tsne_latent_result = (pd.DataFrame({'tsne_1': tsne_latent[:,0], 'tsne_2': tsne_latent[:,1], 'number': all_labels}))
    tsne_data_result = (pd.DataFrame({'tsne_1': tsne_data[:,0], 'tsne_2': tsne_data[:,1], 'number': all_labels}))
    sns.scatterplot(x='tsne_1', y='tsne_2', hue='number', style='number', palette=colors, data=tsne_data_result, ax=data_ax,s=25).set(title='Validation Data')
    sns.scatterplot(x='tsne_1', y='tsne_2', hue='number', style='number', palette=colors, data=tsne_latent_result, ax=latent_ax,s=25).set(title='Latent Data')
    ## YOUR CODE HERE END

    return all_dev_sample_latent_vecs

print("This code can take approx 2 minutes to run")
latent_code_dev_samples = plot_latent_vector_scatter_plot(vae_model, mnist_dev_set, show=True)

In [None]:
# This cell creates the outputs ./output/ex_11_perplexities.png
# Takes around 10 Minutes to run

def plot_latent_vector_scatter_plot(
        vae: VAE,
        mnist_dev_loader: DataLoader,
        num_samples_per_class: int = 1000,
        num_classes: int = 10,
        show: bool = True
) -> Tensor:
    # We create a Dictionary of 1000 samples per MNIST class
    # The Dict has the form: Dict[int, List[Tensor]],
    # where the key is the MNIST digit and the value is a list of flattened MNSIT tensors
    label_to_dev_samples = utils.get_label_to_dev_samples(mnist_dev_loader, num_samples_per_class, num_classes)
    # Use the VAE to encode the MNIST digits to sampled latent vectors
    # The latent vectors are saved in this list
    all_dev_sample_latent_vecs = []
    # Also store the labels in the same order for visualization purposes
    all_labels = []
    # HINT: Feel free to use functions from utils.py and vae.py
    ## YOUR CODE HERE START
    for i in label_to_dev_samples.keys():
        for elem in label_to_dev_samples[i]:
            all_dev_sample_latent_vecs.append(vae.encode_data_to_sampled_latent_vec(elem))
            all_labels.append(i)
    ## YOUR CODE HERE END
    
    # Use t-sne to reduce the latent embedding to two dimensions
    ## YOUR CODE HERE START
    tsne_data = []
    perplexity_values = [2,5,10,20,50,100,500,1000]
    for i in range(0,len(perplexity_values)):
        tsne_data.append(TSNE(n_components=2, learning_rate='auto',init='random', perplexity=perplexity_values[i]).fit_transform(pd.DataFrame(all_dev_sample_latent_vecs)))
    ## YOUR CODE HERE END
    
    ## Create a Scatter plot of the latent vectors
    ## YOUR CODE HERE (START)
    colors = ["#FFB300", # Vivid Yellow
    "#803E75", # Strong Purple
    "#FF6800", # Vivid Orange
    "#A6BDD7", # Very Light Blue
    "#C10020", # Vivid Red
    "#CEA262", # Grayish Yellow
    "#817066", # Medium Gray
    "#007D34", # Vivid Green
    "#F6768E", # Strong Purplish Pink
    "#00538A", # Strong Blue
    ]
    fig, axs = plt.subplots(len(tsne_data),1,figsize=(20, 160))
    for i in range(0,len(tsne_data)):
        tsne_result = (pd.DataFrame({'tsne_1': tsne_data[i][:,0], 'tsne_2': tsne_data[i][:,1], 'number': all_labels}))
        sns.scatterplot(x='tsne_1', y='tsne_2', hue='number', style='number', palette=colors, data=tsne_result, ax=axs[i],s=25).set(title='Perplexity='+str(perplexity_values[i]))
    ## YOUR CODE HERE END

    return all_dev_sample_latent_vecs

print("This code can take approx 10 minutes to run")
latent_code_dev_samples = plot_latent_vector_scatter_plot(vae_model, mnist_dev_set, show=True)

b) In the following you will have to answer a couple of questions regarding this result and t-SNE. As this was not covered in detail in the lecture, we provide you with the original paper of t-SNE (https://www.researchgate.net/publication/228339739_Viualizing_data_using_t-SNE) as well as a short introduction on how to interpret t-SNE vizualisations (https://distill.pub/2016/misread-tsne/). Note however, that the questions do NOT require mathematical reasoning, but rather build on general characteristics of t-SNE. 

i) What does this visualization tell us? (3 points)

The most significant thing that this visualization tells us is that the VAE dimension reduction model, trained on the training data, successfully retains the significant meaning of the image classes. This is clear in comparing the TSNE representations (with perplexity 50) of the forward encoded latent vectors (length 32) and the original sample vectors (length 32):

<figure width='100%' style="display:flex; align-items: center; justify-content: space-between">
  <img src='./output/ex_11_data.png' width="45%">
  <img src='./output/ex_11_latent.png' width="45%">
</figure>

In discussing these clusterings, it is firstly important to reiterate, as outlined in (txt-src(2)), that the size of each of the clusters has no meaning and the distances between the clusters do not necessarily have to be significant. Furthermore, it is stated in txt-src(2) that random noise does not always look random and, as such, we do not believe that we should make many assumptions from outliers that appear in other classes. Having said that, it is interesting to see that the number 9 and number 4 clusters have an intertwined geometry; number 9 seems to split number 4 into two halves. This leads us to another important thing that we can learn from this visualization and that is that the hyperparameters have a lot of significance (particularly the perplexity)! The two visualizations shown above display this interesting interconnectedness between the 9 and 4 clusters, as do visualizations with perplexity higher than 50. However, if the perplexity is lower than 50 then these two clusters separate, as shown here:

<figure width='100%' style="display:flex; align-items: center; justify-content: center">
  <img src='./output/ex_11_perplexities.png' width="50%">
</figure>

The perplexity is, in the most basic terms, a parameter that balances how much attention is focussed on small local neighborhoods of the data (the smaller the perplexity, the smaller the locality of used knowledge). Of all the parameters, it had the most noticeable affect on the resulting visualizations. We hypothesize that this TSNE representation could be a graphical splitting of the two possible ways that a four can be written (with a parallel or slanted top):


<figure width='100%' style="display:flex; align-items: center; justify-content: center">
  <img src='./images/different_number_4.jpg' width="50%">
</figure>

However, this remains simply a question or hypothesis for future studies as we have not followed this thought further.

txt-src(1): https://www.researchgate.net/publication/228339739_Viualizing_data_using_t-SNE
txt-src(2): https://distill.pub/2016/misread-tsne/

img-src: https://bdtechtalks.com/2021/01/04/semi-supervised-machine-learning/

ii) Does this representation tell us something about the similarities of different numbers? If so, give an example. Otherwise, explain why not. (1 point)

As already stated above, in general the distances between and sizes of clusters do not generally have any meaning. We can, however, state that throughout the visualizations with varying parameters there appears to be two groups of clusters that remain together. Firstly, clusters 4 and 9 are always immediate neighbors and, with perplexity $\geq 50$, they become intertwined. This pattern could be argued to be accounted for by the similar forms that 4 and 9 have. Furthermore, one could perhaps interpret the similarities of the numbers 3,5,8 (all include 2 vertical loop-like forms) in the fact that the clusters can arguably be observed to form a metagroup throughout all the visualizations.

iii) In this example we vizualised the validation data in order to explore how the learned latent vectors look like on data that was not seen during training. Could such a visualization also be beneficial if the training data is used and this is performed before training any model? If so, explain in what way. (1 point)

As already discussed in section i) above, we used TSNE to visualize both the data vectors and the latent vectors in order to observe if any significant classifying information has been lost in the VAE dimensionality reduction. This is useful, however, as we are currently using validation data that the model has not yet seen. Observing the training data before and after training would not provide any particular beneficial information about the VAE dimensionality reduction as the accuracy of the class clusters could be artificially inflated. Or, we would simply be testing the VAE model on the data with which it has been trained and we would not gain any information on how effective the model is or whether it is simply overfitting.


### Exercise 12: AI Tools Reflection

This task is mandatory! Did you use AI tools (ChatGPT, ...) for the work on this portfolio? Note that we did not forbid the use of such tools, but the disclosure of their application is obligatory. For what steps during the work on the portfolio did you use which AI tools? Do you consider the results to be helpful (and in what way)? How did you formulate queries and how did you process and validate the output of the models? (0 points)

Benedikt used ChatGPT and DeepL in all exercises, in following a list of the questions in which each specific question is listed for CGPT:
2: Is dropout a form of regularization? | I want the element wise product of to matrices in pytorch. | I want the sum of a vector in pytorch. | I want to multiply a scalar to a tensor in pytorch. | I want to implement the squared L2 norm of weight matrix in pytorch. | How many trainable weigths has large-language model? 
6: What is a Nash equilibrium? | My answer for 6 b) 2. I asked ChatGPT if he see this as I do. | 
    6.C) Has the encoder in VAEs has to be learned?  | In Diffusion Models no Encoder has to be learned, right? | What is the name of the stable diffusion model paper? | what is inference speed? | Is the UNet specificly used in stable Diffusion models and not in diffusion models? | Which objective has the UNEt in a Diffusion model? | In the latent diffusion model is the domain specific encoder specific for each y or is it learned for all various modalities?
8: Explain Q-learning for me. 
10: initialize dataframe from column names of other dataframe | and the index | I want to add the values of the Counter(dict{}) to the count of the df 

&nbsp;

I used ChatGPT to ask specific questions and to check my answers, as well as for produce code for me. For code generation and maybe to verify answers I gave, I found it useful. For explaining some things I have by myself no real clue, I do not find CGPT helpful. 
I validated the output on my one, as most of the time asked stuff check my understanding of the problem.  


<hr>
<hr>

### Individual Contribution Statement: 
Please state who of the group members contributed which part of the portfolio.

In [None]:
1. Keiran
2. Benedikt
3. Keiran
4. Leon
5. Leon
6. Benedikt
7. Leon
8. Benedikt
9. Keiran
10. Benedikt
11. Keiran