# Deep Learning in Python 
## Session 02 - Keras Advanced Concepts

- *Course*: Big Data and Language Technologies
- *Date*: 11.04.2022

This session will cover a few more advanced concepts around Deep Learning in Python with Keras. We will build upon the ideas from the last session and learn about ways to customize the workflow further in detail. We will also learn how to solve some problems that we faced during the last session.

## Setup

In [1]:
import tensorflow as tf
import numpy as np

## Loading Data

This time, we will simply use a wrapper provided by Keras to load up the IMDB dataset that we explored in the last session. For reference, see the [API docs](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data).

In [2]:
INDEX_FROM=3
NUM_WORDS=1000
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=NUM_WORDS,index_from=INDEX_FROM)

Note that this already provides us with a train-test split.

This dataset is already built using word indices instead of word strings. For transforming text from and to indices using the word index, see [this example](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/get_word_index#example).

**Exercise**: Explore the first 3 samples of X_train by converting them back to strings. What is going wrong? Why?

## `tf.data.Dataset`

Using `tf.data.Dataset`, we can represent very large datasets (will become very important later in the semester). Tensorflow will handle many features necessary for that internally.

**Exercise**: Use `tf.data.Dataset.from_generator` ([docs](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator)) to convert our ndarray-based dataset to `tf.data.Dataset`. Provide an `output_signature=(X,y)` (you will also have to have the generator return this format).

Using `tf.data.Dataset.from_tensor_slices` is probably difficult because the data is not padded yet.

Converting the data back to numpy is easy:

In [None]:
next(train_ds.as_numpy_iterator())

## Dataset persistence

Tensorflow makes it quite easy to save and load `tf.data.Dataset`.

### Using `tf.data.experimental.save` and `load`

`tf.data.experimental.save` ([docs](https://tensorflow.google.cn/api_docs/python/tf/data/experimental/save)) and `load` ([docs](https://tensorflow.google.cn/api_docs/python/tf/data/experimental/load)) can be used to persist a Dataset to storage. This will create multiple files (shards).

**Exercise**: Save and load our dataset to storage.

Let's test it again:

In [None]:
next(train_ds.as_numpy_iterator())

Note: The [TFRecord format](https://www.tensorflow.org/tutorials/load_data/tfrecord) is the traditional method to save serialized data, which might save memory.

## `map` and `filter`

Using `map` ([docs](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map)) and `filter` ([docs](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#filter)) on `tf.data.Dataset` is very convenient, as the used functions are applied on the fly, controlled by demand.

It is recommended to use the `tf.function` decorator ([docs](https://www.tensorflow.org/api_docs/python/tf/function)) to improve performance if possible.

**Exercise**: From the `train_ds`, filter out all reviews shorter than 100 tokens.

### \* Bonus: `flat_map`

`map` allows us to modify Dataset samples 1-to-1. If we want to split certain samples into a varying number of samples, we can use `flat_map` ([docs](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map)).

**Exercise**: Use `flat_map` on `train_ds` to split up long reviews into reviews of 100 tokens.

In [None]:
it=train_ds.as_numpy_iterator()
for i in range(3):
    print(next(it))

## Batch, shuffle, repeat

In order to make our dataset usable for training, we will need to batch it (split it up into batches), repeat it (so you can train on multiple epochs) and shuffle it (to avoid using the same order every time).

In this task, you will learn that the order of these operations indeed matters!

Let's create a dummy dataset:

In [12]:
DUMMY_DS_SIZE=30
dummy_ds=tf.data.Dataset.range(DUMMY_DS_SIZE)
DUMMY_BATCHSIZE=10
DUMMY_BUFFERSIZE=2*10

**Exercise**: Roll the dice to determine the order in which you will implement shuffle, batch and repeat. Try to spot flaws in the results by inspecting 5 epochs.

In [13]:
print(np.random.choice(["Shuffle, repeat, batch",
                       "Repeat, shuffle, batch",
                       "Batch, shuffle, repeat"]))

Repeat, shuffle, batch


### Applying what we found out
**Exercise**: Shuffle, repeat and batch (using `padded_batch`) our `train_ds`.

In [17]:
BATCHSIZE=64
BUFFERSIZE=2*64
train_ds=...#todo

## Custom Layers

Keras allows you to define custom layers. This is useful for:
1. Combining multiple pre-defined layers into a single custom layer
2. Defining the layer weights explicitly
3. Modifying gradients

### "Custom" dense layer

**Exercise**: Re-implement a dense layer using a subclass of the `tf.keras.layers.Layer` class ([docs](https://keras.io/api/layers/base_layer/)).

### \* Bonus: "Custom" dropout layer

**Exercise**: Re-implement a dropout layer using a subclass of the `Layer` class ([docs](https://keras.io/api/layers/base_layer/)).

If you want to dive deeper into defining custom layers, see [this guide](https://keras.io/guides/making_new_layers_and_models_via_subclassing/).

## Custom Loss

Sometimes we are not fully happy with the predifined losses provided by Tensorflow/Keras. See the [docs](https://keras.io/api/losses/#creating-custom-losses) for how to create custom losses based on `y_true` and `y_pred`.

**Exercise**: Define a custom loss that computes a weighted crossentropy to rebalance the classes (label `0` and `1`).

### The `add_loss()` API

Regularization losses are not just based on a comparison of `y_true` and `y_pred`. The `add_loss()` API allows to use layer weights in loss computation. See the [docs](https://keras.io/api/losses/#the-addloss-api).

## Custom Training Loops

By defining a subclass to `tf.keras.Model`, we can customize what is happening during `fit()` on a more fine-grained level than using callbacks.

Further details on customizing the behavior of `fit()` can be found in [this guide](https://keras.io/guides/customizing_what_happens_in_fit/).

## TensorBoard

TensorBoard is a browser application that allows you to supervise the training progress. To access the generated logs, use the following command:

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs
# alternatively: !tensorboard --logdir ./logs

We use a special callback to generate the data that TensorBoard will be visualizing:

In [23]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs",update_freq=1)

## Assembling everything