In [53]:
"""
You need to run this cell for the code in following cells to work.
"""

# Enable module reloading
%load_ext autoreload
%autoreload 2

# Enable interactive plots
%matplotlib notebook

import sys
sys.path.append('..')

# Week 8

__Goals for this week__

We will discuss recurrent neural networks in this lab. They are useful for processing sequences of data, such as sentences, time series, etc. We will also talk about word representations for natural language processing.

__Feedback__

This lab is a work in progress. If you notice a mistake, notify us or you can even make a pull request. Also please fill the [questionnaire](https://forms.gle/r27nBAvnMC7jbjJ58) after you finish this lab to give us feedback.


## Recurrent Neural Networks

_Recurrent neural networks_ (RNN) are the last major neural architecture we will talk about during our labs. They are used to process data sequences. One-dimensional CNNs can also be used for sequence processing, however, RNNs should be better at modeling long-term dependencies between individual inputs. RNNs are also more versatile for sequence data, e.g. they can be used for tasks that expect a sequence as an output, or that expect a separate label for each input.

_Recurrent cell_ lies at the heart of RNNs. Cell is the basic operation that is done as we process one step from a series of $N$ inputs $\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_N$. For step $i$ the cell looks like this:

   y
> cell >
   x
   
- $\mathbf{x}_i$ is $i$-th input
- $\mathbf{y}_i$ is $i$-th output
- $\mathbf{s}_i$ is the state of cell for $i$-th step
   
All of these quantities are vectors. As the figure above illustrates, at each step the cell depends on two inputs - the input for the step itself and the state of the cell from previous step. Because the cell "sees" the state from previous steps, the layer can process the current step while using the knowledge about all the previous steps.

We can imagine the recurrent layer as a series o cell operations:

   y       y
> cell > cell >
   x       x
   s vyznacenim trasy
   
Note that when we follow the flow of computation leading to output $\mathbf{y}_i$, we can see that it depends on all the previous inputs $\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_i$. It also depends on the initial state $\mathbf{s}_0$, which is usually a trainable parameter vector of the model.

### Recurrent cell variants

Multiple variants of recurrent cells exist, e.g. this is the definition of _Ellman cell_:

$$
\mathbf{h}_i = \sigma(\mathbf{W}_{in}\mathbf{x}_i + \mathbf{W}_{hid}\mathbf{s}_{i-1} + \mathbf{b}_{hid}) \\
\mathbf{y}_i = \sigma(\mathbf{W}_{out}\mathbf{h}_i + \mathbf{b}_{out})  \\
\mathbf{s}_i = \mathbf{h}_i \\
$$

The definition of $\mathbf{h}_i$ and $\mathbf{y}_i$ is very similar to MLP, the only difference is that $\mathbf{h}_i$ also depends on the state from previous step cell. In this case the state $\mathbf{s}_i$ is simply the value of hidden layer $\mathbf{h}_i$ within the cell. Note that the same parameters (weights and biases) are used for each step. The computation done by a cell is the same for each step, only the inputs of the cell ($\mathbf{x}_i$ and $\mathbf{s}_{i-1}$) differ.

Simple cells like these does not perform very well. The information is being transformed by matrix multiplication each time step. This tends to dilute the information and the network "forgets" about what it has seen in the past. This limits the use of simple recurrent cells only for relatively short sequences. Simple cells are also quite unstable to train and suffer from so called _exploding/vanishing gradient_ [FIXME: reference] problem, which makes them unstable to train.

Instead of these simple cells we usually use more complex cells that were developed to address the issues we mentiond. Most common of these cells are _LSTM_ and _GRU_ cells [FIXME: references]. Check the further reading section if you are interested in why they tend to work better than vanilla recurrent cells.

### Training

The operation used for RNN are very similar as the oprations from MLP. We use matrix multiplication, addition, activation functions and that is basically all there is to it. The training routine is therefore also quite similar to MLP. Again, we simply use _stochastic gradient descent_ to calculate the derivatives of the loss function w.r.t. each parameter.

### Recurrent architectures

There are multiple ways of using recurrent layers depending on the nature of the task we want to solve. In all the following example the recurrent layer is the same, we only work differently with the inputs and outputs of this layer to get it to do what we want.

#### Many to one

We feed the recurrent layer until we process the whole input. Then we use the result of this pass to get a label. We use this type of RNN to do:

- Sequence classification - We want to assign a label to a sequence (i.e. text classification, event detection).
- Single-hop prediction - We want to predict singular next value.

For final computation we can either:

- Discard all the outputs, but the last $\mathbf{y}_N$. Then we use only this output.
- Pool all the outputs using max- or in this case more often mean-pooling of all the output $\mathbf{y}_i$.

#### One to many

We feed the recurrent layer with one value and we expect it to produce multiple values. We use this type for:

- Generation tasks - We want to generate a series of values based on a prompt (e.g. image captioning, music generation).

The other than first cells also expect some input $\mathbf{x}_{i>1}$, otherwise they can not compute further. We can either use the same input as for the first cell $\mathbf{x}_1$, or we can feed them the output from previous step $\mathbf{y}_{i-1}$.

#### Many to many

We feed the recurrent layer with multiple values and we expect an output for each of them. We use this type for:

- Input tagging - We want to assign each input into a class (e.g. part-of-speech tagging, event scope detection).

#### Sequence to sequence

We feed the recurrent layer a series of inputs and we expect a series as an output. We use thys type for:

- Multi-hop prediction - We want to generate multiple values as a prediction.

### Advanced recurrent architectures

The architecture mentioned above show how can a single recurrent layer be used. In this section we show some example of how to combine multiple layers for various use-cases.

#### Bi-directional recurrent network

We can combine two recurrent layers, one that processes the data from start to end, while the other goes backwards from end to start. We simply combine the outputs of these two networks for each time step. The advantage of this combination is that for each time step the following layer "sees" all the inputs, not only the previous ones. This is depicted in the figure by red outline

#### Multilayer recurrent network

We can also simply stack multiple recurrent layers on top of each other. This is mainly used to increase the capacity of the model. Usually RNNs are not as deep as CNNs and we use up to 5 layers. One layer is usually enough as a starting point.

#### Hierarchical recurrent network

For sequences of sequences (e.g. sentence is a sequence of words and words are sequences of characters) hierarchical recurrent networks can be used. We again combine two networks. In this case, the first processes the words character by character. The outputs of this network for each word are then fed to another RNN.

#### Encoder-decoder architecture

We can combine two recurrent layers for sequence to sequence tasks as well. We then have one layer that encodes the input into a representation and the other that decodes this representation into a series of outputs. The main use case for this architecture is machine translation.

### Time series prediction example

Show task

Exercise which of the previously mentioned RNN architecture would you use?



## Architecture comparison

We have discussed three major neural architectures during our labs. Each is well suited for different kind of data:

1. _Multilayer Perceptron._ Used for fixed size feature vectors.
2. _Convolutional Neural Networks_. Used for fixed size 1D, 2D or 3D data with strong spatial relation.
3. _Recurrent Neural Networks._ Used for sequences.

We can combine these architectures, e.g. we have seen convolutional neural networks with dense layers at the top. We can also combine convolutional and recurrent networks for video processing. (We process each video frame with convolutional layers. Then we take these frame representations and feed them into a recurrent network.)

Some of these combinations are considered specific architectures on their own, e.g.:

- _Autoencoder:_ Used for learning compact representations.
- _Generational Adversarial Networks:_ Used to generate images and sounds.
- _Graph Neural Networks:_ Used to process graphs.

There are also many additional architecture, that are different from the three we mentioned. These can include recursive networks, self-organizing maps, attention-based networks (e.g. Transformer architecture) and many others. These are mostly used only occasionally.

## Word Embeddings

RNNs are often used for _natural language processing_ (NLP). Words are sequences of letters, sentences are sequences of words, documents are sequences of paragraphs. The written and spoken language are both very sequential in nature. Another fine feature of RNNs is that they are quite versatile. Their different forms, as depicted above, can be used for structurally different tasks:

- many-to-one - text classification
- many-to-many - part-of-speech tagging
- sequence-to-sequence - machine translation
- one to many - image labeling

But, how should we feed text into neural models? The most common way is to feed the networks word by word, while each word is represented by its `id` - an integer identifying each particular word form:

```
load_data
print_data
print word2id fict
print ids
```

Then we can use so called _embedding_ layer to get a vector representation from each `id`. Embedding layer has a vector representation assigned for each id an

```
keras example
```

Note, that this is logically the same thing as having each word represented by one-hot representation and then multiplying this with W. Embedding layer is used because simply selecting n-th row is more efficient than multiplying magrices.


Data loaded this way have batch_size, time_size, emb_dim shape and they can be used for further calculations, e.g. a simple text classification model:

```
emb
lstm
dense
```

### Pre-trained embeddings

The embeddings are essentially representations of words, i.e. they should encode semantic information about the meaning of the words. Because many NLP tasks need the same information, we can actually take the embeddings trained for one task and reuse them for other task. This practically means that we can simply take the matrix W and use it later.

Large portion of parameters
Rare words
Time efficiency

There are multiple libraries used to generate word embeddings. Some of the earliest were word2vec and GloVe. However I would recommend using fastText, they should perform better. Often you can simply download pre-trained embedding matrices from the internet, e.g. list of fastText embeddings for various languages.

Then we can load the vectors as our embedding matrix initializer:


train or do not train embedding layer?

make sure your vocabularies match

__Exercise 8.X:__ TensorBoard can also visualize word embeddings so we can explore them. Exercise: play with tensorboard, try to find what words are familiar

__Exercise 8.X:__ Maybe word2vec online to see what are the most similar words?

### PA: 8.X LSTM-based POS tagger [2 pts]

Part-of-speech tagging (_POS tagging_) is a classical NLP task. We want to mark each word in a sentence with a correct POS tag. POS tags are grammatical categories of words, such as _verb_, _noun_, etc. The data for this task consist of sentences and their respective POS tags:

POS data example

We are essentially trying to perform a classification for each word.

We can solve this task with RNN architecture depicted below:

softmax softmax softmax

dense   dense    dense

bi-dir lstm

wordem  wordem   wordem

Implement this architecture in keras. There are several gotchas:

masking
bidir
dense over multiple time steps
loss is meanpooled

### Other forms

char, subword based representation
LM pre-training

## Further Reading

C.Olah LSTM
BPTT?

In [57]:
from week_8.data.sin import sin_dataset

In [61]:
import numpy as np
import matplotlib.pyplot as plt
    
import tensorflow as tf

class LSTModel(tf.keras.Model):
    
    def __init__(self):
        super(LSTModel, self).__init__()
        self.lstm = tf.keras.layers.LSTM(20)
        self.dense = tf.keras.layers.Dense(1)
        
    def call(self, x):
        x = self.lstm(x)  # <- expects (batch_size, time_steps, sample_dim) shape
        x = self.dense(x)
        return x
    
model = LSTModel()

model.compile(
    optimizer='adam',
    loss='mse')

x, y = sin_dataset(1000, 900, 10)
model.fit(
    x=x,
    y=y,
    epochs=50,
    batch_size=10
    )

Train on 1000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50

KeyboardInterrupt: 