In [1]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


if torch.cuda.is_available():
    from torch.cuda import FloatTensor, LongTensor
else:
    from torch import FloatTensor, LongTensor

np.random.seed(42)

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/logo.png" width=150>

When working with sequential data (time-series, sentences, etc.) the order of the inputs is crucial for the task at hand. Recurrent neural networks (RNNs) process sequential data by accounting for the current input and also what has been learned from previous inputs. In this notebook, we'll learn how to create and train RNNs on sequential data.

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/rnn.png" width=550>


* **Objective:**  Process sequential data by accounting for the currend input and also what has been learned from previous inputs.
* **Advantages:** 
    * Account for order and previous inputs in a meaningful way.
    * Conditioned generation for generating sequences.
* **Disadvantages:** 
    * Each time step's prediction depends on the previous prediction so it's difficult to parallelize RNN operations. 
    * Processing long sequences can yield memory and computation issues.
    * Interpretability is difficult but there are few [techniques](https://arxiv.org/abs/1506.02078) that use the activations from RNNs to see what parts of the inputs are processed. 
* **Miscellaneous:** 
    * Architectural tweaks to make RNNs faster and interpretable is an ongoing area of research.

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/rnn2.png" width=650>

RNN forward pass for a single time step $X_t$:

$h_t = tanh(W_{hh}h_{t-1} + W_{xh}X_t+b_h)$

$y_t = W_{hy}h_t + b_y $

$ P(y) = softmax(y_t) = \frac{e^y}{\sum e^y} $

*where*:
* $X_t$ = input at time step t | $\in \mathbb{R}^{NXE}$ ($N$ is the batch size, $E$ is the embedding dim)
* $W_{hh}$ = hidden units weights| $\in \mathbb{R}^{HXH}$ ($H$ is the hidden dim)
* $h_{t-1}$ = previous timestep's hidden state $\in \mathbb{R}^{NXH}$
* $W_{xh}$ = input weights| $\in \mathbb{R}^{EXH}$
* $b_h$ = hidden units bias $\in \mathbb{R}^{HX1}$
* $W_{hy}$ = output weights| $\in \mathbb{R}^{HXC}$ ($C$ is the number of classes)
* $b_y$ = output bias $\in \mathbb{R}^{CX1}$

You repeat this for every time step's input ($X_{t+1}, X_{t+2}, ..., X_{N})$ to the get the predicted outputs at each time step.

**Note**: At the first time step, the previous hidden state $h_{t-1}$ can either be a zero vector (unconditioned) or initialize (conditioned). If we are conditioning the RNN, the first hidden state $h_0$ can belong to a specific condition or we can concat the specific condition to the randomly initialized hidden vectors at each time step. More on this in the subsequent notebooks on RNNs.


Let's see what the forward pass looks like with an RNN for a synthetic task such as processing reviews (a sequence of words) to predict the sentiment at the end of processing the review.

In [2]:
batch_size = 5
seq_size = 10 # max length per input (masking will be used for sequences that aren't this max length)
x_lengths = [8, 5, 4, 10, 5] # lengths of each input sequence
embedding_dim = 100
rnn_hidden_dim = 256
output_dim = 4

In [3]:
# Initialize synthetic inputs
x_in = torch.randn(batch_size, seq_size, embedding_dim)
x_lengths = torch.tensor(x_lengths)
print (x_in.size())

torch.Size([5, 10, 100])


In [4]:
# Initialize hidden state
hidden_t = torch.zeros((batch_size, rnn_hidden_dim))
print (hidden_t.size())

torch.Size([5, 256])


In [5]:
# Initialize RNN cell
rnn_cell = nn.RNNCell(embedding_dim, rnn_hidden_dim)
print (rnn_cell)

RNNCell(100, 256)


In [6]:
# Forward pass through RNN
x_in = x_in.permute(1, 0, 2) # RNN needs batch_size to be at dim 1

# Loop through the inputs time steps
hiddens = []
for t in range(seq_size):
    hidden_t = rnn_cell(x_in[t], hidden_t)
    hiddens.append(hidden_t)
hiddens = torch.stack(hiddens)
hiddens = hiddens.permute(1, 0, 2) # bring batch_size back to dim 0
print (hiddens.size())

torch.Size([5, 10, 256])


In [7]:
# We also could've used a more abstracted layer
x_in = torch.randn(batch_size, seq_size, embedding_dim)
rnn = nn.RNN(embedding_dim, rnn_hidden_dim, batch_first=True)
out, h_n = rnn(x_in) #h_n is the last hidden state
print ("out: ", out.size())
print ("h_n: ", h_n.size())

out:  torch.Size([5, 10, 256])
h_n:  torch.Size([1, 5, 256])


In [8]:
def gather_last_relevant_hidden(hiddens, x_lengths):
    x_lengths = x_lengths.long().detach().cpu().numpy() - 1
    out = []
    for batch_index, column_index in enumerate(x_lengths):
        out.append(hiddens[batch_index, column_index])
    return torch.stack(out)

In [9]:
# Gather the last relevant hidden state
z = gather_last_relevant_hidden(hiddens, x_lengths)
print (z.size())

torch.Size([5, 256])


In [10]:
# Forward pass through FC layer
fc1 = nn.Linear(rnn_hidden_dim, output_dim)
y_pred = fc1(z)
y_pred = F.softmax(y_pred, dim=1)
print (y_pred.size())
print (y_pred)

torch.Size([5, 4])
tensor([[0.2263, 0.2619, 0.2048, 0.3070],
        [0.2010, 0.3693, 0.2140, 0.2156],
        [0.2148, 0.3652, 0.2394, 0.1807],
        [0.2711, 0.2492, 0.2606, 0.2190],
        [0.2911, 0.2015, 0.3370, 0.1704]], grad_fn=<SoftmaxBackward>)


### Sequential data

There are a variety of different sequential tasks that RNNs can help with.

1. **One to one**: there is one input and produces one output. 
    * Ex. Given a word predict it's class (verb, noun, etc.).
2. **One to many**: one input generates many outputs.
    * Ex. Given a sentiment (positive, negative, etc.) generate a review.
3. **Many to one**: Many inputs are sequentially processed to generate one output.
    * Ex. Process the words in a review to predict the sentiment.
4. **Many to many**: Many inputs are sequentially processed to generate many outputs.
    * Ex. Given a sentence in French, processes the entire sentence and then generate the English translation.
    * Ex. Given a sequence of time-series data, predict the probability of an event (risk of disease) at each time step.

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/seq2seq.jpeg" width=700>

*From [(The Unreasonable Effectiveness of Recurrent Neural Networks)](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)*

The first example is the usual full mesh network. Each following demonstrates the processing of a certain sequence of arbitrary length (red rectangles) and the generation of the output sequence, also of arbitrary length (blue rectangles).

In this case, the green rectangles in each figure are the same weights. So, on the one hand, we are training a very, very deep network (if you look at it upside down), and on the other, a strictly limited number of parameters.

---
### Write a simple RNN right away!

Let me remind you, she does something like this:

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" width="70%">


Generally speaking, you can come up with many variations on this implementation. In our case, the processing will be as follows:

$$ h_t = tanh (W_h [h_ {t-1}; x_t] + b_h) $$

$ h_ {t-1} $ is the hidden state obtained in the previous step, $ x_t $ is the input vector. $ [h_ {t-1}; x_t] $ is a simple concatenation of vectors. Just like in the picture!

Let's check our network on a very simple task: make it say the index of the first element in the sequence.

Those. for the sequence `[1, 2, 1, 3]` the network must predict `1`.

Let's start with the generation of the batch.

In [11]:
def generate_data(batch_size=128, seq_len=5):
    data = torch.randint(0, 10, size=(seq_len, batch_size), dtype=torch.long)
    return data, data[0]

X_val, y_val = generate_data()
X_val, y_val

(tensor([[9, 8, 5, 2, 0, 5, 0, 7, 1, 1, 4, 3, 9, 7, 8, 1, 3, 7, 8, 4, 1, 4, 5, 4,
          9, 2, 7, 6, 9, 9, 6, 5, 4, 7, 3, 3, 5, 7, 5, 2, 2, 4, 1, 7, 0, 9, 4, 4,
          9, 4, 0, 9, 3, 6, 1, 5, 1, 2, 6, 2, 4, 1, 5, 4, 2, 6, 1, 0, 2, 6, 0, 5,
          8, 7, 2, 5, 0, 4, 5, 7, 0, 7, 4, 5, 8, 4, 9, 2, 1, 2, 5, 3, 1, 1, 4, 6,
          3, 5, 1, 8, 7, 6, 6, 8, 0, 1, 3, 4, 8, 6, 3, 6, 5, 5, 5, 5, 6, 9, 4, 1,
          2, 0, 9, 6, 6, 8, 2, 8],
         [4, 3, 0, 8, 9, 5, 7, 6, 1, 6, 2, 0, 8, 5, 7, 6, 4, 8, 9, 0, 6, 0, 6, 0,
          9, 0, 8, 9, 5, 1, 3, 8, 1, 3, 9, 6, 7, 8, 7, 4, 0, 7, 4, 3, 5, 9, 5, 7,
          8, 6, 9, 7, 7, 5, 6, 0, 5, 6, 2, 6, 7, 3, 5, 8, 2, 1, 7, 4, 0, 7, 5, 9,
          0, 7, 9, 3, 8, 6, 3, 8, 8, 4, 2, 7, 0, 6, 7, 6, 1, 0, 4, 1, 1, 1, 4, 0,
          5, 8, 0, 9, 7, 8, 7, 9, 3, 6, 3, 8, 7, 4, 0, 8, 4, 5, 1, 8, 7, 6, 6, 3,
          4, 8, 4, 1, 2, 7, 1, 4],
         [2, 1, 5, 4, 3, 6, 7, 0, 0, 8, 7, 7, 0, 5, 5, 1, 1, 2, 8, 5, 4, 6, 4, 7,
          7, 9, 5, 2, 4, 9, 

Please note that the batch has the dimension `(sequence_length, batch_size, input_size)`. All `RNN` in pytorch work with this default format.

This is done for performance reasons, but you can change this behavior with the help of the `batch_first` argument if you wish.

**Task** Implement the `SimpleRNN` class, which performs the calculation using the formula above.

In [None]:
class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()

        self._hidden_size = hidden_size
        <create Linear layer>

    def forward(self, inputs, hidden=None):
        seq_len, batch_size = inputs.shape[:2]
        if hidden is None:
            hidden = inputs.new_zeros((batch_size, self._hidden_size))
         
        for i in range(seq_len):
            <apply linear layer to concatenation of current input (inputs[i]) and hidden>

        return hidden


It should be clear why it is useful to have the first dimension seq_len - you need to be able to take `inputs [i]` - the subbatch related to this timestamp. If the data were located differently, this operation would be much more expensive.

** Task ** Implement the `MemorizerModel` class, with the sequence` Embedding -> SimpleRNN -> Linear `. You can use `nn.Sequential`

To make embeddings, you can use `nn.Embedding.from_pretrained`. For simplicity, we will do a one-hot-encoding representation — to do this, we simply need to initialize the network with the unit matrix `torch.eye (N)`.

In [None]:
# u can use nn.Sequential too
class MemorizerModel(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()

        <create layers>

    def forward(self, inputs):
        <apply 'em>

Run the training:

In [None]:
rnn = MemorizerModel(hidden_size=32)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters())

total_loss = 0
epochs_count = 1000
for epoch_ind in range(epochs_count):
    X_train, y_train = generate_data(seq_len=25)
    
    optimizer.zero_grad()
    rnn.train()
    
    logits = rnn(X_train)

    loss = criterion(logits, y_train)
    loss.backward()
    optimizer.step()
    
    total_loss += loss.item()
    
    if (epoch_ind + 1) % 100 == 0:
        rnn.eval()
        
        with torch.no_grad():
            logits = rnn(X_val)
            val_loss = criterion(logits, y_val)
            print('[{}/{}] Train: {:.3f} Val: {:.3f}'.format(epoch_ind + 1, epochs_count, 
                                                             total_loss / 100, val_loss.item()))
            total_loss = 0

**Task** Look at how sequence length affects network performance.

First, look at how long the network is able to learn. Secondly, try to train a network with a short sequence length, and then apply it to longer ones.

**Assignment** It is stated that `relu` fits RNN better. Try it too.

## Training RNN


<img src="https://image.ibb.co/cEYkw9/rnn_bptt_with_gradients.png">

*From [Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients](http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/)*


If everything went according to plan, we had to look at how RNN's were forgotten.

To understand the reason, it is worth remembering exactly how the RNN learning takes place, for example, here: [Backpropagation Through Time and Vanishing Gradients](http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/) или здесь - [Vanishing Gradients & LSTMs](http://harinisuresh.com/2016/10/09/lstms/).

In short, one of the problems of learning recurrent networks is * explosion of gradients *. It manifests itself when the matrix of weights is such that it increases the norm of the gradient vector during the reverse pass. As a result, the rate of the gradient grows exponentially and it "explodes."

This problem can be solved using clipping gradients: `nn.utils.clip_grad_norm_ (rnn.parameters (), 1.)`.

# Issues with vanilla RNNs

There are several issues with the vanilla RNN that we've seen so far. 

1. When we have an input sequence that has many time steps, it becomes difficult for the model to retain information seen earlier as we process more and more of the downstream timesteps. The goals of the model is to retain the useful components in the previously seen time steps but this becomes cumbersome when we have so many time steps to process. 

2. During backpropagation, the gradient from the loss has to travel all the way back towards the first time step. If our gradient is larger than 1 (${1.01}^{1000} = 20959$) or less than 1 (${0.99}^{1000} = 4.31e-5$) and we have lot's of time steps, this can quickly spiral out of control.

To address both these issues, the concept of gating was introduced to RNNs. Gating allows RNNs to control the information flow between each time step to optimize on the task. Selectively allowing information to pass through allows the model to process inputs with many time steps. The most common RNN gated varients are the long short term memory ([LSTM](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM)) units and gated recurrent units ([GRUs](https://pytorch.org/docs/stable/nn.html#torch.nn.GRU)). You can read more about how these units work [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/gates.png" width=900>

In [12]:
# GRU in PyTorch
gru = nn.GRU(input_size=embedding_dim, hidden_size=rnn_hidden_dim, batch_first=True)

In [13]:
# Initialize synthetic input
x_in = torch.randn(batch_size, seq_size, embedding_dim)
print (x_in.size())

torch.Size([5, 10, 100])


In [14]:
# Forward pass
out, h_n = gru(x_in)
print ("out:", out.size())
print ("h_n:", h_n.size())

out: torch.Size([5, 10, 256])
h_n: torch.Size([1, 5, 256])


**Note:** Choosing whether to use GRU or LSTM really depends on the data and empirical performance. GRUs offer comparable performance with reduce number of parameters while LSTMs are more efficient and may make the difference in performance for your particular task.

## LSTM vs GRU

Another problem is *attenuation of gradients*. It is connected the opposite - with the exponential decay of gradients. And now it is solved in more complicated ways.

Namely - use gate'ovye architecture.

The idea of gate is simple, but important, they are used not only in recurrent networks.

If you look at how our SimpleRNN works, you will notice that each time the memory (ie, $ h_t $) is overwritten. I want to be able to make this rewrite controlled: do not discard any important information from the vector.

Let's get for this the vector $g \in \{0,1 \}^n $, which will say which $h_{t-1}$ cells are good, and instead of which ones it is worth substituting new values:

$$ h_t = g \odot f (x_t, h_{t-1}) + (1 - g) \odot h_{t-1}. $$

For example:
$$
 \begin{bmatrix}
  8 \\
  11 \\
  3 \\
  7
 \end{bmatrix} =
 \begin{bmatrix}
  0 \\
  1 \\
  0 \\
  0
 \end{bmatrix}
 \odot
  \begin{bmatrix}
  7 \\
  11 \\
  6 \\
  5
 \end{bmatrix}
 +
  \begin{bmatrix}
  1 \\
  0 \\
  1 \\
  1
 \end{bmatrix}
 \odot
  \begin{bmatrix}
  8 \\
  5 \\
  3 \\
  7
 \end{bmatrix}
$$

To achieve differentiability, we use sigmoid: $ \sigma(f (x_t, h_ {t-1})) $.

As a result, the network itself will, looking at the inputs, decide which cells of its memory and how much it costs to rewrite.

### LSTM

It seems that the first architecture that applied this mechanism was LSTM (Long Short-Term Memory).

In it, we also add $ c_ {t-1} $ to $ h_ {t-1} $: $ h_ {t-1} $ is all the same hidden states obtained in the previous step, and $ c_ {t -1} $ is a memory vector.

Schematically - something like this:

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" width="50%">

*From [(Understanding LSTM Networks)](http://colah.github.io/posts/2015-08-Understanding-LSTMs)*


For a start, we can in the same way, as before, calculate a new hidden state (we denote it by $ \tilde c_{t} $):

$$ \tilde c_{t} = tanh(W_h [h_ {t-1}; x_t] + b_h) $$

In normal RNNs, we would simply overwrite the value of the latent state with this value. And now we want to understand how much information we need from $ c_ {t-1} $ and from $ \tilde c_ {t} $.

Rate it sigmoid:
$$f = \sigma(W_f [h_{t-1}; x_t] + b_f),$$
$$i = \sigma(W_i [h_{t-1}; x_t] + b_i).$$

The first is about how much you want to forget the old information. The second is how interesting is the new one. Then

$$ c_t = f \odot c_ {t-1} + i \odot \tilde c_t. $$

We will also weigh the new hidden state:

$$ o = \sigma (W_o [h_ {t-1}; x_t] + b_o), $$
$$ h_t = o \odot tanh (c_t). $$

Another picture:

<img src="https://image.ibb.co/e6HQUU/details.png">
 
*From [Vanishing Gradients & LSTMs](http://harinisuresh.com/2016/10/09/lstms/)*

Why is the problem of damped gradients solved? Because look at the derivative $ \frac {\partial c_t} {\partial c_ {t-1}} $. It is proportional to the $ f $ gate. If $ f = 1 $ - gradients flow unchanged. Otherwise - well, the network itself learns when it wants to forget something.

It is highly recommended to read the article: [Understanding LSTM Networks] (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) for more information and fun pictures.

Why did I write these formulas? The main thing is to show how much more parameters you need to learn in LSTM compared to a regular RNN. Four times more!

For those who fell asleep - [video, as forgets RNN (bottom)](https://www.youtube.com/watch?v=mLxsbWAYIpw)

# Bidirectional RNNs
There have been many advancements with RNNs ([attention](https://www.oreilly.com/ideas/interpretability-via-attentional-and-memory-based-interfaces-using-tensorflow), Quasi RNNs, etc.) that we will cover in later lessons but one of the basic and widely used ones are bidirectional RNNs (Bi-RNNs). The motivation behind bidirectional RNNs is to process an input sequence by both directions. Accounting for context from both sides can aid in performance when the entire input sequence is known at time of inference. A common application of Bi-RNNs is in translation where it's advantageous to look at an entire sentence from both sides when translating to another language (ie. Japanese → English).

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/birnn.png" width=700>

In [15]:
# BiGRU in PyTorch
bi_gru = nn.GRU(input_size=embedding_dim, hidden_size=rnn_hidden_dim, batch_first=True, bidirectional=True)

In [16]:
# Forward pass
out, h_n = bi_gru(x_in)
print ("out:", out.size()) # collection of all hidden states from the RNN for each time step
print ("h_n:", h_n.size()) # last hidden state from the RNN

out: torch.Size([5, 10, 512])
h_n: torch.Size([2, 5, 256])


## Data preprocessing

In [None]:
symbols = set(symb for word in data_train for symb in word)
char2ind = {symb: ind + 1 for ind, symb in enumerate(symbols)}
char2ind[''] = 0

lang2ind = {lang: ind for ind, lang in enumerate(set(labels_train))}

Convert dataset.

**Task** Write a batch generator that will select a random set of words on the fly and convert them into matrices

In [None]:
def iterate_batches(data, labels, char2ind, lang2ind, batch_size):
    # let's do the conversion part first
    labels = np.array([lang2ind[label] for label in labels])
    data = [[char2ind.get(symb, 0) for symb in word] for word in data]
    
    indices = np.arange(len(data))
    np.random.shuffle(indices)
    
    for start in range(0, len(data), batch_size):
        end = min(start + batch_size, len(data))
        
        batch_indices = indices[start: end]
        
        max_word_len = max(len(data[ind]) for ind in batch_indices)
        X = np.zeros((max_word_len, len(batch_indices)))
        <fill X>
            
        yield X, labels[batch_indices]

Лень передавать `char2ind, lang2ind`:

In [None]:
from functools import partial

iterate_batches = partial(iterate_batches, char2ind=char2ind, lang2ind=lang2ind)

In [None]:
next(iterate_batches(data, labels, batch_size=8))

**Задание** Реализуйте простую модель на `SimpleRNN`.

In [None]:
class SurnamesClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim, lstm_hidden_dim, classes_count):
        super().__init__()
        
        <set layers>
            
    def forward(self, inputs):
        'embed(inputs) -> prediction'
        <implement it>
    
    def embed(self, inputs):
        'inputs -> word embedding'
        <and it> 

In [None]:
import math
import time

def do_epoch(model, criterion, data, batch_size, optimizer=None):  
    epoch_loss = 0.
    
    is_train = not optimizer is None
    model.train(is_train)
    
    data, labels = data
    batchs_count = math.ceil(len(data) / batch_size)
    
    with torch.autograd.set_grad_enabled(is_train):
        for i, (X_batch, y_batch) in enumerate(iterate_batches(data, labels, batch_size=batch_size)):
            X_batch, y_batch = LongTensor(X_batch), LongTensor(y_batch)

            logits = model(X_batch)
            loss = criterion(logits, y_batch)
            epoch_loss += loss.item()

            if is_train:
                optimizer.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), 1.)
                optimizer.step()

            print('\r[{} / {}]: Loss = {:.4f}'.format(i, batchs_count, loss.item()), end='')
                
    return epoch_loss / batchs_count

def fit(model, criterion, optimizer, train_data, epochs_count=1, 
        batch_size=32, val_data=None, val_batch_size=None):
    if not val_data is None and val_batch_size is None:
        val_batch_size = batch_size
        
    for epoch in range(epochs_count):
        start_time = time.time()
        train_loss = do_epoch(model, criterion, train_data, batch_size, optimizer)
        
        output_info = '\rEpoch {} / {}, Epoch Time = {:.2f}s: Train Loss = {:.4f}'
        if not val_data is None:
            val_loss = do_epoch(model, criterion, val_data, val_batch_size, None)
            
            epoch_time = time.time() - start_time
            output_info += ', Val Loss = {:.4f}'
            print(output_info.format(epoch+1, epochs_count, epoch_time, train_loss, val_loss))
        else:
            epoch_time = time.time() - start_time
            print(output_info.format(epoch+1, epochs_count, epoch_time, train_loss))

In [None]:
model = SurnamesClassifier(vocab_size=len(char2ind), emb_dim=16, lstm_hidden_dim=64, classes_count=len(lang2ind)).cuda()

criterion = nn.CrossEntropyLoss().cuda()
optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, epochs_count=50, batch_size=128, train_data=(data_train, labels_train),
    val_data=(data_test, labels_test), val_batch_size=512)

**Задание** Напишите функцию для тестирования полученной сети: пусть она принимает слово и говорит, в каком языке с какой вероятностью это может быть фамилией.

**Задание** Оцените качество модели.

In [None]:
model.eval()

y_test, y_pred = [], []
<calc 'em>

print('Accuracy = {:.2%}'.format(accuracy_score(y_test, y_pred)))
print('Classification report:')
print(classification_report(y_test, y_pred, 
                            target_names=[lang for lang, _ in sorted(lang2ind.items(), key=lambda x: x[1])]))

## Визуализация эмбеддингов

In [None]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.colors import RGB
from bokeh.io import output_notebook

from sklearn.manifold import TSNE
from sklearn.preprocessing import scale


def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()
    
    if isinstance(color, str): 
        color = [color] * len(x)
    if isinstance(color, np.ndarray):
        color = [RGB(*x[:3]) for x in color]
    print(color)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: 
        pl.show(fig)
    return fig


def get_tsne_projection(word_vectors):
    tsne = TSNE(n_components=2, verbose=100)
    return scale(tsne.fit_transform(word_vectors))
    
    
def visualize_embeddings(embeddings, token, colors):
    tsne = get_tsne_projection(embeddings)
    draw_vectors(tsne[:, 0], tsne[:, 1], color=colors, token=token)

Мы опять получили эмбеддинги - символьного уровня теперь.

Хочется на них посмотреть

**Задание** Посчитайте векторы для случайных слов и выведите их.

In [None]:
word_indices = np.random.choice(np.arange(len(data_test)), 1000, replace=False)
words = [data_test[ind] for ind in word_indices]
word_labels = [labels_test[ind] for ind in word_indices]

model.eval()
X_batch, y_batch = next(iterate_batches(words, word_labels, batch_size=1000))
embeddings = <calc me>

colors = plt.cm.tab20(y_batch) * 255

visualize_embeddings(embeddings, words, colors)

## Network visualization

At each step, RNN produces some vector. The full layer applies only to the last output. But you can also look at intermediate states - how the network’s opinion changed about what this word refers to.

**Task** Write your visualizer.

## Network improvement

**Task** Replace SimpleRNN with LSTM. Compare quality.

**Task** Add Dropout to LSTM (or later). A value of about 0.3 will be adequate.

**Task** An important RNN is the Bidirectional RNN. In fact, these are two RNNs, one bypassing the sequence from left to right, the second - vice versa.

As a result, for each point in time we have the vector $ h_t = [f_t; b_t] $ is the concatenation (or some other function of $ f_t $ and $ b_t $) of the states $ f_t $ and $ b_t $ of the forward and backward passage of the sequence. In sum, they cover the entire context.


In our task, the Bidirectional option can help with the fact that the network will forget less about how the sequence began. That is, we will need to take $ f_N $ and $ b_N $ states: the first is the last state in the passage from left to right, i.e. output from the last character. The second is the last state at the back pass, i.e. output for the first character.

Implement the Bidirectional Classifier. To do this, `LSTM` has a` bidirectional` option.

# Referrence

[The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)  
[Understanding LSTM Networks, Christopher Olah](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)  
[Recurrent Neural Networks Tutorial, Denny Britz](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)  
[Vanishing Gradients & LSTMs, Harini Suresh](http://harinisuresh.com/2016/10/09/lstms/)
[Non-Zero Initial States for Recurrent Neural Networks](https://r2rt.com/non-zero-initial-states-for-recurrent-neural-networks.html)
[Explaining and illustrating orthogonal initialization for recurrent neural networks, Stephen Merity](http://smerity.com/articles/2016/orthogonal_init.html)
[Comparative Study of CNN and RNN for Natural Language Processing, Yin, 2017](https://arxiv.org/abs/1702.01923)
[cs224n "Lecture 8: Recurrent Neural Networks and Language Models"](https://www.youtube.com/watch?v=Keqep_PKrY8)