# Assignment 4

<br> 
    <center>
        <img src="src/A4_fashion-mnist.png" width="600"/>
    </center>
</br>

In the last assignment, you can develop your own **generative neural network** to create novel fashion items.
For training, we consider the **Fashion-MNIST** dataset and you can work on

* developing, training, and testing convolutional neural networks,
* get an insight into **conditional** variational autoencoders,
* and explore **embeddings**.

For this exercise, you can switch back to the environment ```APML``` used during class.

Please provide solutions to all exercises below and send me all notebooks by **31st of May 2024**.

***

## Part I: Fashion-MNIST label prediction

In the classes, we have worked with the MNIST benchmark dataset of 70'000 images 
(usually 28 x 28 pixels) of hand-written digits. Zalando Research provides a similar
but more challenging dataset with their [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist)
dataset of 70'000 images (28 x 28 pixels) of different fashion items belonging to the 
following ten classes:

In [None]:
fashion_labels = {0: 'T-shirt',
                  1: 'Trouser',
                  2: 'Pullover',
                  3: 'Dress',
                  4: 'Coat',
                  5: 'Sandal',
                  6: 'Shirt',
                  7: 'Sneaker',
                  8: 'Bag',
                  9: 'Ankle boot'}

In order to load the dataset, let us define the following function

In [None]:
import os
import gzip
import numpy as np

os.makedirs('output', exist_ok=True)

def load_fashion_mnist(path, kind='train'):
    labels_path = os.path.join(path,
                               f'Fashion-MNIST_{kind}-labels.gz'
                              )
    images_path = os.path.join(path,
                               f'Fashion-MNIST_{kind}-images.gz'
                               )

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 28, 28, 1)

    return images.astype(np.float32), labels.astype(np.float32)

and load the dataset in the following cell:

In [None]:
img_train, label_train = load_fashion_mnist(path='data', kind='train')
img_test, label_test = load_fashion_mnist(path='data', kind='test')

print(f"img_train.shape:\t{img_train.shape}")
print(f"label_train.shape:\t{label_train.shape}")
print(f"img_test.shape:\t\t{img_test.shape}")
print(f"label_test.shape:\t{label_test.shape}")

### Exercise I.1

First, let's get a feeling for the dataset. 

* Visualise the first 20 images of the **test set**. Plot all images as subplots in one big figure with four rows and five columns.
* As the title for each subplot, set the label string provided in ```fashion_labels``` above.

Hints: Remember that subplots were discussed in [Notebook 1](../Notebooks/1-Python_Concepts.ipynb). Your solution could include the following parts

*  ```axs.ravel()```
*  ```ax.axis('off')```
*  ```cmap='binary_r'```

Note that your own solution might not require all of these.



### Exercise I.2

Normalise the input images, such that pixel values are in the range between 0.0 and 1.0.

### Exercise I.3

Train a convolutional neural network to predict the fashion item label for a
new image. Define the following architecture:

* 28 $ \times $ 28 $ \times $ 1 neurons in the input layer for the greyscale image
* 6 conv. filters of size $3 \times 3$ with ReLU activation followed by a $2 \times 2$ max pooling
* 16 conv. filters of size $3 \times 3$ with ReLU activation followed by a $2 \times 2$ max pooling
* flatten these feature maps in a single vector 
* 1024 neurons in a dense layer with ReLU activation
* 512 neurons in a dense layer with ReLU activation
* 10 output neurons for our classes in the output layer with a softmax activation
* use the ```sparse_categorical_crossentropy``` loss
* store training results in ```fashion_cnn_history```
* store weights in ```output/Fashion-MNIST_CNN_weights.h5```

You can use the following hyperparameters but also explore other choices:

In [None]:
learning_rate = 0.01
batch_size = 16
num_epochs = 10

Fill in the cell below:

In [None]:
import numpy as np
from tensorflow import keras

keras.utils.set_random_seed(123)

fashion_cnn = ...

### Exercise I.4

Visualise the training outcome:

1. Use ```fashion_cnn_history``` to plot ```accuracy``` and ```loss``` for the training set and ```val_accuracy``` and ```val_loss``` for the test set. 
2. Reuse your plotting routine of ```Exercise I.1``` and add the predicted fashion label to the title of the 20 test images you visualised earlier.

***

## Part II: Training a generative model

In [Notebook 6](../Notebooks/6-VAE.ipynb), we have discussed the concept of
variational autoencoders (VAEs) in unsupervised learning. There, we used an encoder to compress our input
data into a latent representation $Z$, which is supposed to capture the relevant
information of our input and allow us to reconstruct the original input. The following
illustration depicts the general idea of the architecture:

<br>
    <center>
        <img src="src/A4_VAE_illustration.png" width="400"/>
    </center>
</br>

In this second part, we want to extend the standard VAE to the **conditional VAE**
as illustrated below:

<br>
    <center>
        <img src="src/A4_condVAE_illustration.png" width="400"/>
    </center>
</br>

If labels are available, we can make use of these labels as additional inputs to
the encoder and decoder.

In this part, you can generate novel fashion items. With the hyperparameters specified below 
(in particular, after 20 training epochs and using 10 latent dimensions), you can expect
your reconstructions for the first 20 images of the test set to look like the following:

<br>
    <center>
        <img src="src/A4_VAE_test_reconstructions.png" width="400"/>
    </center>
</br>

(Compare this to your plot of the first 20 images of the test set in Exercises I.1 and I.3)

### There are two subparts:

* **Subpart II.A**: In the first subpart, you are asked to implement the conditional VAE. This will require some extension of what you have encountered in the course. Please try to complete this subpart first. However, if you get stuck and cannot complete the implementation, please proceed with **subpart II.B**. Please keep your intermediate results and attempts in the coding cells! If you were successful with subpart II.A, **you can skip subpart** II.B.
* **Subpart II.B**: In the second subpart, you can implement the standard VAE as encountered during the course in [Notebook 6](../Notebooks/6-VAE.ipynb). Please do this subpart only if you cannot implement the conditional VAE in subpart II.A.

***

## Subpart II.A - Fashion Conditional VAE

In this subpart, you can try to implement the conditional VAE. You can reuse parts of your CNN in Exercise I.3 as the encoder.

What you need to do:

* Recall the VAE implementation covered in [Notebook 6](../Notebooks/6-VAE.ipynb). You can follow closely what we have used there.
* The illustration above is supposed to give you an idea of the implementation. In particular, you will need to adjust the class methods ```encode(self, x)``` and ```decode(self, z, apply_sigmoid=False)``` to include the target labels ```y```.
* You will need to adjust any other function and method which calls these functions to include the target labels ```y```, too.
* Finally, your encoder and decoder definitions ```self.encoder``` and ```self.decoder``` will need an adjustment in the input layers to account for ```y```.

With these changes, you should be able to implement the conditional VAE.

Define the following encoder architecture:

* Figure out what the input layer needs to look like. It should be a minor adjustment to what you have used in Exercise I.3.
* 6 conv. filters of size $3 \times 3$ with stride $2$ with ReLU activation followed by a $2 \times 2$ max pooling
* 16 conv. filters of size $3 \times 3$ with stride $2$ with ReLU activation followed by a $2 \times 2$ max pooling
* flatten these feature maps in a single vector 
* 1024 neurons in a dense layer with ReLU activation
* 512 neurons in a dense layer with ReLU activation
* ```2*latent_dim``` output neurons in a dense layer with **no activation** for the mean and log variance parameters in the layer leading to the latent representation 

Define the following decoder architecture, which can be viewed as a reversal of the encoder steps:

* Figure out what the input layer for the latent representation needs to look like. It should be a minor adjustment to what you have seen in [Notebook 6](../Notebooks/6-VAE.ipynb)
* 512 neurons in a dense layer with ReLU activation
* 7x7x16 neurons in a dense layer with ReLU activation
* Reshape the output of the previous layer to ```target_shape=(7, 7, 16)```
* 16 transposed 2d conv. filters of size $3 \times 3$ with stride $2$, ```same``` padding, with ReLU activation 
* 6 transposed 2d conv. filters of size $3 \times 3$ with stride $2$ , ```same``` padding, with ReLU activation
* 1 transposed 2d conv. filters of size $3 \times 3$ with stride $1$ , ```same``` padding, **no activation** for the output

Additionally:
* use ```keras.optimizers.Adam``` as the optimizer
* use the same loss as seen in [Notebook 6](../Notebooks/6-VAE.ipynb)
* store weights in ```output/Fashion-MNIST_condVAE_weights.h5```

Hints: The following considerations might help you with the implementation.
1. The input images in ```img_train``` and ```img_test``` are of shape ```[batch_size, 28, 28, 1]```. For the encoder input, adjust the labels ```y``` as a similar tensor of shape ```[batch_size, 28, 28, 1]``` where the 28x28 entries are just the label repeated in every entry.
2. The latent code ```latent_z``` will be of shape ```[batch_size, latent_dim]```. For the decoder input, adjust the labels  ```y``` as a similar tensor of shape ```[batch_size, 1]```.

You can use the following code for this:

```Python
# merge image and label
x_shapes = tf.shape(x)
y = tf.reshape(y, [self.batch_size, 1, 1, 1])
x = tf.concat([x, y*tf.ones([x_shapes[0], x_shapes[1], x_shapes[2], 1])], axis=3)
```

and

```Python
# merge noise and label
y = tf.reshape(y, [self.batch_size, 1])
z = tf.concat([z, y], axis=1)
```

3. When you get an error like <center><img src="src/A4_Potential_Error_Msg.png" width="800"/></center> execute the cell where you define your conditional VAE class and training functions etc. once more.

You can use the following hyperparameters but also explore other choices:

In [None]:
learning_rate = 0.0001
batch_size = 32
num_epochs = 20
latent_dim = 10

num_train = img_train.shape[0]
num_test = img_test.shape[0]

### Exercise II.A.1

Put your attempt at implementing the conditional VAE below:

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

keras.utils.set_random_seed(123)

class condVAE(keras.Model):
    # Please fill in

### Exercise II.A.2

For the test set images, plot the reconstructions below.

Hint: You will need to use something similar to 

```Python
predictions = condVAE_model.sample(y=label_test[:batch_size], eps=latent_z)
```

for this.

### Exercise II.A.3

Plot the latent space below. I.e. plot the first two latent dimensions with the latent representation of
the first batch of test data points below.

Hint: You can use ```cmap='tab10'``` as a categorical colour map and

```Python
cbar = plt.colorbar()
cbar.ax.set_yticklabels(fashion_labels.values())
```

to have the fashion item labels displayed in the colorbar (instead of integer values).


***

## Subpart II.B - Fashion VAE

Implementing a conditional VAE is a challenging task! If you could not complete subpart II.A, 
you can implement the standard VAE in this subpart instead. Please leave your attempt on II.A in the cells above as they are!

Recall the VAE implementation covered in [Notebook 6](../Notebooks/6-VAE.ipynb). You can follow closely what we have used there.

Define the following encoder architecture:

* 28 $ \times $ 28 $ \times $ 1 neurons in the input layer for the greyscale image
* 6 conv. filters of size $3 \times 3$ with stride $2$ with ReLU activation followed by a $2 \times 2$ max pooling
* 16 conv. filters of size $3 \times 3$ with stride $2$ with ReLU activation followed by a $2 \times 2$ max pooling
* flatten these feature maps in a single vector 
* 1024 neurons in a dense layer with ReLU activation
* 512 neurons in a dense layer with ReLU activation
* ```2*latent_dim``` output neurons in a dense layer with **no activation** for the mean and log variance parameters in the layer leading to the latent representation 

Define the following decoder architecture, which can be viewed as a reversal of the encoder steps:

* An input layer which expects ```input_shape=(latent_dim,)```
* 512 neurons in a dense layer with ReLU activation
* 7x7x16 neurons in a dense layer with ReLU activation
* Reshape the output of the previous layer to ```target_shape=(7, 7, 16)```
* 16 transposed 2d conv. filters of size $3 \times 3$ with stride $2$, ```same``` padding, with ReLU activation 
* 6 transposed 2d conv. filters of size $3 \times 3$ with stride $2$ , ```same``` padding, with ReLU activation
* 1 transposed 2d conv. filters of size $3 \times 3$ with stride $1$ , ```same``` padding, **no activation** for the output

Additionally:
* use ```keras.optimizers.Adam``` as the optimizer
* use the same loss as seen in [Notebook 6](../Notebooks/6-VAE.ipynb)
* store weights in ```output/Fashion-MNIST_VAE_weights.h5```

Hints: When you get an error like <center><img src="src/A4_Potential_Error_Msg.png" width="800"/></center> execute the cell where you define your VAE class and training functions etc. once more.

You can use the following hyperparameters but also explore other choices:

In [None]:
learning_rate = 0.0001
batch_size = 32
num_epochs = 20
latent_dim = 10

num_train = img_train.shape[0]
num_test = img_test.shape[0]

### Exercise II.B.1

Put your implementation of the VAE below:

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

keras.utils.set_random_seed(123)

class VAE(keras.Model):
    # Please fill in

### Exercise II.B.2

For the test set images, plot the reconstructions below.

Hint: You will need to use something similar to 

```Python
predictions = VAE_model.sample(latent_z)
```

for this.

### Exercise II.B.3

Plot the latent space below. I.e. plot the first two latent dimensions with the latent representation of
the first batch of test data points below.

Hint: You can use ```cmap='tab10'``` as a categorical colour map and

```Python
cbar = plt.colorbar()
cbar.ax.set_yticklabels(fashion_labels.values())
```

to have the fashion item labels displayed in the colorbar (instead of integer values).


***

## Part III: Exploring embeddings

In Part II, you should have implemented a way to generate novel, "artificial" fashion
items by sampling them in the latent space and decoding these samples into images. Your 
procedure could look something like this for subpart II.A

```Python
generated_images = condVAE_model.sample(y=label_to_generate, eps=latent_z)
```

or like this for subpart II.B

```Python
generated_images = vae_model.sample(eps=latent_z)
```

with shapes 
* ```label_to_generate.shape = (batch_size,)```
* ```latent_z.shape = (batch_size, latent_dim)```
* ```generated_images.shape = (batch_size, 28, 28, 1)```

Make use of this sampling procedure in the following exercise.

### Exercise III.1

Generate random samples from the standard normal distribution 
with the correct shape for ```latent_z``` and create novel, artificial fashion items.

If you successfully implemented the conditional VAE, create the label input similar to ```y=LABELID*np.ones(batch_size).astype(np.float32)```
and choose ```LABELID``` as ```1``` (i.e. 'Trouser') and ```9``` (i.e. 'Ankle boot').
You can also choose to condition the decoder to other labels.

Plot your novel, artificial fashion items similar to your results in II.A.2 or II.B.2.
You should expect some of them to look quite freaky as you mix different fashion items. :)

Fill in the cell below:

### Bonus Exercise III.2

#### This last part is not required to complete assignment 4!

It is difficult to visualise more than three-dimensional latent spaces. 
For this reason, you have visualised only two latent dimensions in
Exercises II.A.3 or II.B.3. 

However, there are some tools to generate two-dimensional visualisation of 
higher-dimensional data, such as **t-distributed Stochastic Neighbor Embedding (t-SNE)**.
You can read up on it in the [scikit-learn documentation on t-SNE](https://scikit-learn.org/stable/modules/manifold.html#t-sne).

You can explore its capabilities below if you want to.

First, load the latent representation for a batch of the test data.
You can copy part of your results used in Exercises II.A.3 or II.B.3 here.

Fill in the cell below:

In [None]:
latent_z = # Please fill in

Let's assume the latent representation for the images with labels ```label_test[:batch_size]```
is given by ```latent_z```. Then you can execute the following to display the t-SNE visualisation:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

tsne_visualisation = TSNE(n_components=2, 
                          learning_rate='auto',
                          init='random', 
                          perplexity=3).fit_transform(latent_z)

plt.scatter(tsne_visualisation[:,0], tsne_visualisation[:,1],  c=label_test[:batch_size], cmap='tab10')
cbar = plt.colorbar()
cbar.ax.set_yticklabels(fashion_labels.values())

plt.show()