> **DO NOT EDIT IF INSIDE course Github folder**


# Architectures and concepts

Part 5.1: Convolutional neural networks<br>
Part 5.2: Recurrent neural networks<br>
Part 5.3: Transfer learning<br>
Part 5.4: VAEs<br>
Part 5.5: GANs


[**Feedback**]((https://ulfaslak.com/vent))

In [15]:
%matplotlib inline

import matplotlib.pylab as plt
import numpy as np
import random, sys, io
import requests as rq
from bs4 import BeautifulSoup

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.callbacks import LambdaCallback
from keras.datasets import mnist

## Exercises

### 5.1: Convolutional Neural Networks

#### Pen and paper

To get your intuition for computations on input data in CNNs fine-tuned, I have a few small quizzes for you. First, we'll consider the size of the parameter space.

> **Ex. 5.1.1**: Imagine you have a CNN with just one convolutional layer with a single filter. All it does, is take an input image and produce an activation map. The dimensionality of the filter in your convolutional layer is $5 \times 5 \times 3$. How many weights (or *parameters*) are there in this model?
>
> *Hint*: Don't forget the bias!

Here's the formula for computing the size of the activation map resulting from a convolution. 
If you have a filter that is $F$ wide, your input image is $W_0$ wide, you are padding the edges by
$P$ pixels and your stride is $S$, the resulting image will have width/height:

$$ W_1 = \frac{W_0 - F + 2P}{S} + 1 $$

> **Ex. 5.1.2**: You input an image of dimensions $28 \times 28 \times 3$, use a padding of 2, a stride of 1,
and then slide your $5 \times 5 \times 3$ filter across the image. What is the dimensionality of the resulting activation map?

> **Ex. 5.1.3**: Let's say you now want to use a stride of 2, instead of 1. What problem does this immediately cause?

*Maxpooling* is a method used a lot in CNNs, which downsamples the size of an activation map. It is used primarily to reduce the amount of parameters and computations needed in the network, and to avoid overfitting. Here's an illustration of how it works:

![img](http://cs231n.github.io/assets/cnn/maxpool.jpeg)

In *Max*pooling, for each $2 \times 2$ square in your activation map, you pick the largest value in that square. You do this independently for every depth slice in your activation map.

**Note:** In Keras, the dimension of data is a little different from what you may expect. The first index,
indexes datapoints, the second and third are the dimensions of your images, and the last is number of channels. So if
you have a batch of data containing 100 datapoints, each one an RGB image (so 3 channels: red, green, blue)
with resolution $128 \times 128$, then the dimensionality of your input data is (100, 128, 128, 3).

> **Ex. 5.1.4**: Given the activation map below, what is the corresponding activation map after maxpooling ($2 \times 2$ filter, stride 2)? Run it through a Keras maxpooling layer (check out [the docs](https://keras.io/layers/pooling/)), and report the dimensionality.
>
> *Hint: In Keras, layers (e.g.* `MaxPooling2D` *or* `MaxPool2D`*) are classes. An instance of such a class (e.g.* `mypool = MaxPool2D()`*) acts like a function.*

In [5]:
a = np.random.random(size=(10, 28, 28, 1))  # Create 10 x 28 x 28 x 1 matrix of random numbers
activation_map = keras.backend.variable(a)  # Load it as a Tensorflow variable

#### CNNs in Keras

For example sake, I have implemented a single conv. layer neural network Keras below.

In [30]:
model = Sequential([
    Conv2D(filters=10, kernel_size=3, strides=(1, 1), padding='valid'),
    MaxPool2D(pool_size=(2, 2), strides=2),
    Flatten(),
    Dense(10)
])

In the following exercise you will use the MNIST dataset again. Here is **some code that prepares** `x_train` and `x_test`, and `y_train` and `y_test` for you.

In [64]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Reshape data so it has a channel dimension
rows, cols = x_train.shape[-2:]
x_train = x_train.reshape(x_train.shape[0], rows, cols, 1)
x_test = x_test.reshape(x_test.shape[0], rows, cols, 1)

# Convert pixel intensities to values between 0 and 1
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
    
# Convert target vectors to one-hot encoding
num_classes = len(set(y_train))
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

> **Ex. 5.1.5**: Implement Nielsen's [last convolutional neural network](http://neuralnetworksanddeeplearning.com/chap6.html#exercise_683491)
(the one with two convolutional layers and dropout), and score an accuracy higher than 98%. It doesn't have to be
fully identical, but his solution is pretty great, so getting close is a cheap way to score a high accuracy.

### 5.2: Recurrent Neural Networks

#### Modeling text

Text prediction is a good place to start when learning about RNNs, because most of us humans have a pretty well
optimized inner model for text prediction ourselves. We can, therefore, easily assess the performance of a neural
network in executing this task.

Below is some code that loads the screenplay for Tarantino's 1994 film 'Pulp Fiction'. I recommend reading through the
first 20 lines or so to get a feeling for the language and style used (and enjoy probably the best written screenplay
in the history of film).

In [17]:
response = rq.get("http://www.dailyscript.com/scripts/pulp_fiction.html")
text = BeautifulSoup(response.content, "html.parser").getText()
print(text[:2000])



"PULP FICTION" -- by Quentin Tarantino & Roger Avary


                                      "PULP FICTION"

                                            By

                             Quentin Tarantino & Roger Avary

                

               PULP [pulp] n.

               1. A soft, moist, shapeless mass or matter.

               2. A magazine or book containing lurid subject matter and 
               being characteristically printed on rough, unfinished paper.

               American Heritage Dictionary: New College Edition

               INT. COFFEE SHOP – MORNING

               A normal Denny's, Spires-like coffee shop in Los Angeles. 
               It's about 9:00 in the morning. While the place isn't jammed, 
               there's a healthy number of people drinking coffee, munching 
               on bacon and eating eggs.

               Two of these people are a YOUNG MAN and a YOUNG WOMAN. The 
               Young Man has a slight working-class English acce

> **Ex. 5.2.1:** What is the most used symbol in this screenplay and what accuracy would a model constantly predicting this symbol obtain? In other words, what is the "baseline accuracy"?

I have adapted some code for text generation from [this Keras example](https://keras.io/examples/lstm_text_generation/), and inserted questions in the code (look for `Q:`) for you to answer in the exercise below.

The code fits an LSTM recurrent neural network model to the `text` variable (the Pulp Fiction manuscript). Execute it and see it run. It fits over 50 epochs, so you will probably want to interrupt it (hit `Esc` and then `I` twice) before solving the next exercise though.

In [None]:
# Q1: What is the purpose of this block? When is `char_indices` used? What about `indices_char`?
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# Q2: What is the purpose of this block? What does the `seqlen` and `step` parameters do?
seqlen = 40
step = seqlen
sentences = []
for i in range(0, len(text) - seqlen - 1, step):
    sentences.append(text[i: i + seqlen + 1])

# Q3: What about this block? What is `x` and what is `y`? Why do they have this dimensionality?
x = np.zeros((len(sentences), seqlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), seqlen, len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    # Q3a: What happens in this loop?
    for t, (char_in, char_out) in enumerate(zip(sentence[:-1], sentence[1:])):
        x[i, t, char_indices[char_in]] = 1
        y[i, t, char_indices[char_out]] = 1


# Q4: Here we build the model. What does the `return_sequences` argument do? Why the dense layer at the end?
model = Sequential()
model.add(LSTM(128, input_shape=(seqlen, len(chars)), return_sequences=True))
model.add(Dense(len(chars), activation='softmax'))

model.compile(
    loss='categorical_crossentropy',
    optimizer=RMSprop(learning_rate=0.01),
    metrics=['categorical_crossentropy', 'accuracy']
)

def sample(preds, temperature=1.0):
    """Helper function to sample an index from a probability array."""
    preds = np.asarray(preds).astype('float64')
    preds = np.exp(np.log(preds) / temperature)  # softmax
    preds = preds / np.sum(preds)                #
    probas = np.random.multinomial(1, preds, 1)  # sample index
    return np.argmax(probas)                     #


def on_epoch_end(epoch, _):
    """Function invoked at end of each epoch. Prints generated text."""
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - seqlen - 1)
    
    # Q5: What does diversity do?
    for diversity in [0.2, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + seqlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, seqlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.
            
            # What is the dimensionality of `preds`? Why do we input `preds[0, -1]` to the `sample` function?
            preds = model.predict(x_pred, verbose=0)
            next_index = sample(preds[0, -1], diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=50,
          callbacks=[print_callback])

> **Ex. 5.2.2**: Add a callback for Tensorboard, so you can log the training process. Start training the network (takes ~10 minutes on my computer). While it's running move on to the next question.

> **Ex. 5.2.3**: Answer the questions in the code above (look for code comments starting with `Q:`).

> **Ex. 5.2.4**: Did the network finish training? Consider the generated text across epochs.
1. In the early batches (0-10), the generated text looks very bad. Can you explain why the low diversity generated text contains almost only the symbol " " (that is, spaces)?
2. The high diversity generated text is messed up too, but in a different way. Explain how.
3. In later batches (20-30) what do you notice is off about the low diversity generated text?

> **Ex. 5.2.5**: For the network trained over all 50 epochs, generate a longer piece of text
(say 5000 symbols long). Use the sentence `text[1486:1526]` as seed (starts with 'YOUNG MAN' ends with 'No, ')
and set diversity to 0.5.
Describe what features of the screenplay and language in general that the network learned in only 50 epochs.
Also describe what serious mistakes it makes.

> **Ex. 5.2.6**: Do the same as above, but for 40 random letters (e.g. smash away on your keyboard) as seed. What happens? Can you explain why?

### 5.3: Transfer learning

We will follow [a very nice blog post](https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/) written by Jason Brownlee of 'Machine Learning 
Mastery' for most of these exercises. In his blog post, Jason takes the reader through
the process of using pretrained models in Keras. Below I have outlined the steps you
will go through with reference to his blog post. I strongly recommend you read from the
top and down to 'Models for Transfer Learning' before proceeding.

#### Loading pretrained models

The first practical thing we need to figure out when doing transfer learning is loading pretrained models. Keras makes this very easy by offering a number of pretrained models for image classification which can be downloaded through their [Applications API](https://keras.io/applications/#densenet). 

##### Applications API arguments

When loading pretrained models, we will want to provide some arguments that depend on what
we want to do with the model after loading. Below I ask you to explain, in your own words,
what some of these parameters do. See the Application API reference on some of the models
and the 'Models for Transfer Learning' section in Jason's bloc post for help.

> **Ex. 5.3.1**: In your own words, explain what the following function arguments do in
the different model loading functions:
1. `include_top`
1. `weights`
1. `input_shape`
1. `pooling`
1. `classes`
1. Explain what 'global pooling' does, and why it is needed when `include_top=False`

##### Load a model and predict an image

> **Ex. 5.3.2**: Following Jason's example under 'Pre-Trained Model as Classifier'
classify [this image](https://66.media.tumblr.com/tumblr_mc46e7Zm4R1qbqngeo1_1280.jpg).
Print not just the most likely label, but everything that `decode_predictions` returns.
>
> ***Important***: *Don't use VGG as he does. It's 500 MB to download, and will take too long.
> Use one of the smaller models instead ([here](https://keras.io/applications/#documentation-for-individual-models)'s an overview of model sizes), such as DenseNet121.*

#### Adapting pretrained models

##### Simple feature extractor for ML prediction

By removing the last layer, we can turn a pretrained convolutional neural network into a
feature extractor. We can then use it to extract features of a large number of images and
classify those using any machine learning model. Jason describes this under 'Pre-Trained Model as Feature Extractor Preprocessor'.

> **Ex. 5.3.3:** Extract features for every datapoint in the [fashion-mnist dataset](https://keras.io/datasets/#fashion-mnist-database-of-fashion-articles), and build a feature matrix X. Train an SVM classifier on the learned features, and report the accuracy on the test data.
>
> *Hint: You can import SVM from sklearn. It has a simply API, just check out some of the examples on the [documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).*

##### Changing the prediction task (switching out the last layer)

Another way to achieve roughly the same thing is to remove the last layer and insert a new one with a different number of outputs. Jason describes this under 'Pre-Trained Model as Feature Extractor in Model'.

> **Ex. 5.3.4**: Do the same as above, but by following Jason's example under 'Pre-Trained Model as Feature Extractor in Mode'.
Compare to the accuracy you got in 6.2.1.

### 5.4 Variational Autoencoders

Assuming you have watched [this video](https://www.youtube.com/watch?v=9zKuYvjFFS8), answer the questions below. I also throw in some questions that link to other sources, to prompt you for a deeper understanding of some of the intuition behind VAEs:

> **Ex. 5.4.1**: What is typically the input and output of an autoencoder? What loss function can be used?

> **Ex. 5.4.2**: What is the "bottleneck" of an autoencoder? What can it be used for?

> **Ex. 5.4.3**: Purely in terms of architecture, what is the difference between an autoencoder and a variational autoencoder (VAE)?

> **Ex. 5.4.4**: Regular autoencoders are trained to minimize a loss function with no regard to how the latent space is organized. Therefore, continuity is not guaranteed and similar datapoints may not be close to each other. We can thus say that the network is overfitting, because it uses any organization of training points in this space to minimize the loss, and is, therefore, not likely to work well with unseen data. VAEs are a regularized form of autoencoders, invented to solve this problem. Importantly, they guarantee that similar points are close in the latent space. How do they achieve this?
    > * How are datapoints represented in the VAE latent space? What is the intuition behind this?
    > * How is the loss function different? What is the purpose of the second term (the KL divergence)?
>
> *Hint: Check out this [blog post](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73) and read the section "Intuitions about the regularisation"*

> **Ex. 5.4.5**: How is the latent vector sampled from the mean and standard deviation vectors? Explain the "reparameterization trick" and why it is necessary.

> **Ex. 5.4.6**: What is the motivation behind the disentangled VAE (or *$\beta$-VAE*)?
What happens is $\beta$ is too high? What happens when it is too small?


> **Ex. 5.4.EXTRA**: If you are curious about why such radical generalization
performance increases can be achieved by just including a single new hyperparameter
in the cost function, check out [the original paper](https://openreview.net/references/pdf?id=Sy2fzU9gl)
from Google Deep Mind. In it, under "$\beta$-VAE FRAMEWORK DERIVATION" you will
find the intuition behind this small but powerful design modification.


> **Ex. 5.4.7**: Give some examples of what autoencoders can be used for. Creativity allowed.

### 5.5: Generative Adversarial Networks

Assuming you have watched [this video](https://www.youtube.com/watch?v=dCKbRCUyop8), answer the questions below. I also throw in some questions that link to other sources, to prompt you for a deeper understanding of some of the intuition behind GANs:

> **Ex. 5.5.1**: Explain in your own words how the GAN works. Touch upon:
    > * What do the generator and discriminator networks do?
    > * What are their respective input and output?
    > * What would the accuracy of the discriminator be, faced with a perfect generator?

> **Ex. 5.5.2**: What is "progressive growing"?

> **Ex. 5.5.3**: In StyleGAN, what is the purpose of the mapping network?

> **Ex. 5.5.4**: How do you transform one image to another using backprop and
gradient descent? Why does this not always work that well? How is transfer learning
used to make it work?

> **Ex. 5.5.5**: From [19:20](https://www.youtube.com/watch?v=dCKbRCUyop8&feature=youtu.be&t=1160),
outline in bullets the pipeline for obtaining the latent vector for a query image.

> **Ex. 5.5.6**: So why go through all this trouble just to find, basically, the
point in the latent space that represents a given image? This gets explained at
[22:39](https://youtu.be/dCKbRCUyop8?t=1359). Summarize the idea and utility of
labeling the points in the latent space.

> **Ex. 5.5.7**: Besides modeling faces, can you give some examples of what GANs can be used for?