In [None]:
"""
You need to run this cell for the code in following cells to work.
"""

# Enable module reloading
%load_ext autoreload
%autoreload 2

%load_ext tensorboard

import datetime
import os
import sys
sys.path.append('..')

import tensorflow as tf
import tensorflow.keras as keras
import numpy as np

from week_4.backstage.load_data import load_data
from week_4.backstage.utils import *

# Week 4

__Goals for this week__

We start working with _TensorFlow_ - a modern deep learning framework. We will implement a multilayer perceptron model and train it.

__Feedback__

This lab is a work in progress. If you notice a mistake, notify us or you can even make a pull request. Also please fill the [questionnaire](https://forms.gle/r27nBAvnMC7jbjJ58) after you finish this lab to give us feedback.

## Reminder

__You submit your project proposal in a week.__

## TensorFlow

TensorFlow (TF) is a state-of-the-art framework for neural network development, training and deployment. It provides:

1. Basic building blocks for models - layers, loss functions, activation functions, etc. Most of the neural models are built from very common building blocks.
2. Auto-differentiation. We do not have to calculate the derivatives of the loss function w.r.t. parameters. Instead these quantities are derived automatically. The training algorithms, such as SGD, can also be used via handy API.
3. Toolset for visualization, deployment, distributed computing, etc.

We will use _TensorFlow 2.0_ in our labs.

### Tensors

`Tensor` is the basic TF type. Constants called by `tf.constant` are immutable, while variables called by `tf.Variable` can be changed later. Therefore variables are used as model parameters. Notice that each tensor has a `shape` and a `dtype`:

In [None]:
np_array = np.arange(15, dtype=np.float).reshape(5,3)
print(np_array)

tf_tensor = tf.constant(np_array)
print(tf_tensor)


We can do the usual matrix operations with these tensors, e.g. $\sigma(\mathbf{Wx} + \mathbf{b})$. Common operators, such as `+` for addition or `@` for matrix multiplication are supported:

In [None]:
def sigma(x):
    return 1 / (1 + tf.exp(-x))

w = tf.constant([
    [0.1, 0.2, 0.3],
    [0.2, 0.1, 0.3],
    [0.3, 0.1, 0.2]
])

x = tf.constant([
    [0.5],
    [-0.3],
    [0.2],
])

b = tf.constant([
    [0.3],
    [-0.4],
    [0.2],
])

print(sigma(w @ x + b))


Note, that we need to define $\mathbf{x}$ and $\mathbf{b}$ as 2D tensors with shape `(3, 1)` if we want them to behave like column vectors.

### Defining models

We will use a high-level API for model definition called `keras`. Within this API we have `tf.keras.Model` class for models. Models are a basic unit that transforms input $x$ to output $\hat{y}$ and that can be trained via SGD or other similar algorithm. 

We will define it using predefined `layers`. Layers are atomic units of computation, that can be reused, e.g. `Dense` is an implementation of MLP layer equation: $\sigma(\mathbf{Wx} + \mathbf{b})$.

In [None]:
class MultilayerPerceptron(keras.Model):  # Subclassing
    
    def __init__(self, dim_output, dim_hidden):  # init is used to initialize the layers we will use
        super(MultilayerPerceptron, self).__init__(name='multilayer_perceptron')
        self.dim_output = dim_output
        self.dim_hidden = dim_hidden

        self.layer_1 = keras.layers.Dense(
            units=dim_hidden)  # units = how many neurons in the layer
        self.layer_2 = keras.layers.Dense(
            units=dim_output)

    def call(self, x):  # call defines the flow of the computation, e.g. in this particular model
                        # we simply call the two layers one after the oter
        h = self.layer_1(x)
        y = self.layer_2(h)
        return y


__Exercise 4.1:__ Check the [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) of `Dense` layer. The arguments of the layers provide us with multiple options, you should know at least the first five arguments. What is missing in the definition used above?

### Training models

We will try this model to classify the _Iris_ dataset from previous lab. Training models defined like this is really easy:

In [None]:
data = load_data('iris.csv', num_classes=3)

# Reminder how these data look
for x, y in list(zip(data.x, data.y))[:5]:  # First 5 samples
    print(x, y)

In [None]:
model = MultilayerPerceptron(  # We create new mdoel
    dim_output=3,
    dim_hidden=32)

model.compile(  # By compiling we prepare the model for training
    optimizer=keras.optimizers.SGD(learning_rate=0.003),  # We pick a optimizer algorithm
    loss='mean_squared_error',  # We pick a loss function
    metrics=['accuracy'])  # We pick evaluation metrics

model.fit(  # Fit runs the training over provided data
    x=data.x,
    y=data.y,
    batch_size=4,
    epochs=20)


This is the selling point for using modern neural frameworks. The model is trained via SGD, but we do not need to calculate derivatives. Instead they are calculated automatically by TF. We also do not need to program how SGD works, nor we need to define the loss funcions or metrics.

All that we done manually last week is now hidden behind the `fit` function. You should already be familiar with all the concepts that were introduced in the code below, such as `epochs`, `batch_size`, `metrics`, `loss`, `optimizer`, etc.

### Programming assignment 4.2: Multilayer Perceptron [1pt]

Extend the `MultilayerPerceptron` definition above so that we can have model with arbitrary number of layer and arbitrary activation functions. Check the call below to see how it should look like:

In [None]:
model = MultilayerPerceptron(
    dim_output=3,
    dim_hidden=32,
    num_layers=3,
    activation=keras.activations.sigmoid)

# compile and fit are the same as above
model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.01),
    loss='mean_squared_error',
    metrics=['accuracy'])

model.fit(
    x=data.x,
    y=data.y,
    batch_size=4,
    epochs=20)


### Building blocks

Apart from automatic training, TF also provides us with a lot of pre-programmed parts, such as:

- [Loss functions](https://www.tensorflow.org/api_docs/python/tf/keras/losses), such as mean squared error.
- [Activation functions](https://www.tensorflow.org/api_docs/python/tf/keras/activations), such as sigmoid or ReLU.
- [Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers), such as SGD.
- [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics), such as accuracy.
- [Layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers), such as Dense layer, which is basically the MLP layer $\sigma(\mathbf{Wx} + \mathbf{b})$.
- [Initializers](https://www.tensorflow.org/api_docs/python/tf/keras/initializers), such as Glorot (Xavier) initialization.
- and many other features.

You have seen an example of each of these parts in the code above. E.g. we can use `loss='mean_squared_error'` because a loss function with such name is defiend in `keras.losses`. You should be able to program most of your projects with these pre-programmed building blocks. But you can of course define your own blocks by following the documentation.

### Evaluation

`fit` also has support for creating test set, called `validation set` in TF. By using `validation_split` in `fit` it uses part of the data for evaluation. This is the same concept we have done last week. Check the [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) of `fit` to see what other options it has - you should understand most of them. Run the following code to see how it looks like:


In [None]:
model = MultilayerPerceptron(
    dim_output=3,
    dim_hidden=32,
    num_layers=3,
    activation=keras.activations.sigmoid)

model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.003),
    loss='mean_squared_error',
    metrics=['accuracy'])

model.fit(
    x=data.x,
    y=data.y,
    batch_size=4,
    epochs=20,
    validation_split=0.2)  # This was added


## TensorBoard

_TensorBoard_ (TB) is a great visualization tool for training TF models. First, we need to tell the model that it should create TB-related logs during training. The easiest way is to use callbacks: 

In [None]:
model = MultilayerPerceptron(
    dim_output=3,
    dim_hidden=32,
    num_layers=3,
    activation=keras.activations.sigmoid)

model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.003),
    loss='mean_squared_error',
    metrics=['accuracy'])

tensorboard_callback = keras.callbacks.TensorBoard(
    log_dir=os.path.join("logs", timestamp()),
    histogram_freq=1)

model.fit(
    x=data.x,
    y=data.y,
    batch_size=4,
    epochs=20,
    validation_split=0.2,
    callbacks=[tensorboard_callback],  # Callback
    verbose=0)  # Supressing text output

Now we can visulize the results with TensorBoard. We can start it from terminal, or directly from jupyter (run the next cell). After you start TB, you can access it from the notebook, or more conveniently directly from your browser at http://localhost:6006.

The first two tabs _Scalars_ and _Graphs_ are the most interesting your you right now. The first shows how do various quantities change during the epochs. It shows them for both train and validation data. You can also see the results for multiple runs at the same time. Run the training above again, but change some hyperparameters, e.g. learning rate. Then you can directly compare the results in TB.

_Graphs_ show a graph of your model, i.e. how does it compute the results. By double-clicking you can open individual parts and see how are they defined, i.e. open your model and then a dense layer within to see how is it defined.

In [None]:
%tensorboard --logdir logs --bind_all

### Programming assignment 4.3: Multilayer Perceptron, Part 2 [1pt]

First, run the current model with TB callback to get some results. Then implement the following changes:

1. Use _softmax_ as an activation function for the __last__ layer.
2. Use _categorical crossentropy_ as a loss function

You can find both these functions in the links of pre-programmed building blocks above. Then compare the results of this new implementation in TensorBoard.

#### Submission

Save the code for your `MultilayerPerceptron` and you training commands in a `mlp.py` file and submit them to AIS. You need to complete __PA 4.2__ before proceeding to __PA 4.3__. There are no tests this week so consult the submission with your teacher if needed.

## Gradient Tape

`fit` is a very convenient way of training neural models, but sometimes we need more flexibility and control. For example, with `fit` we can not track the training step by step (e.g. for debugging). The model is compiled into a computation graph in the background. So if you want to have a debugging print within a model, it will not run. E.g., try printing the value of `h` in the model `call`.

Instead we can use so called `GradientType`. With this tape the debugging print of `h` will run. Check the following code, it is very similar in how we defined SGD in previous labs:

In [None]:
model = MultilayerPerceptron(
    dim_output=3,
    dim_hidden=32)

optimizer = keras.optimizers.SGD(learning_rate=0.01)
loss_function = keras.losses.MeanSquaredError()

# loss_function = keras.losses.CategoricalCrossentropy()
# You can use cross-entropy loss if you completed PA 4.3
    
def step(xs, ys):  # This has the same meaning as step function in previous labs
    
    with tf.GradientTape() as tape:
        preds = model(xs)  # Model predictions
        loss = loss_function(ys, preds)  # The value of loss function comparing the true
                                         # values ys with predictions

    gradient = tape.gradient(
        target=loss,
        sources=model.trainable_variables)  # Calculate the gradient of loss function w.r.t. model parameters.
                                            # This behaves the same as gradient methods from previous labs.
        
    optimizer.apply_gradients(zip(gradient, model.trainable_variables))  # Applies the computed gradient on current
                                                                         # parameter values.
    
def loss(xs, ys):
    preds = model(xs)
    return loss_function(ys, preds)
    
num_epochs = 100
batch_size = 5
num_samples = len(data.x)

# Training loop
for e in range(num_epochs):
    for i in np.arange(0, num_samples, batch_size):  # Batching
        step(data.x[i:i+batch_size], data.y[i:i+batch_size])
    print('Epoch:', e, 'Loss:', loss(data.x, data.y).numpy())
        

## Further Reading

Check out TF [Tutorials](https://www.tensorflow.org/tutorials) and [Guide](https://www.tensorflow.org/guide) for some further reading. Note that all the documents there are in fact Jupyter notebooks, so you can download them and run them here. At this point, you should check:

- _Tutorials > ML basics with Keras_ - for some basic practical use cases.

And you can also check:

- _Guide > Keras_ - for more in-depth explanation of how TF works.

You can also check this [notebook](https://colab.research.google.com/drive/1UCJt8EYjlzCs1H1d1X0iDGYJsHKwu-NO) showcasing and explaining additional features.

## Correct Answers

__E 4.1:__ Activation function is missing. `Dense` uses linear activation by default. We need to use the `activation` argument to add an activation function. There we can use our own activation functions, such as `sigma` defined previously. Or, we can use some of the pre-programmed activation functions from `tf.keras.activations` module.