In [0]:
!pip3 -qq install torch==0.4.1
!pip install -qq bokeh==0.13.0
!wget -O surnames.txt -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1ji7dhr9FojPeV51dDlKRERIqr3vdZfhu"

In [0]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


if torch.cuda.is_available():
    from torch.cuda import FloatTensor, LongTensor
else:
    from torch import FloatTensor, LongTensor

np.random.seed(42)

# Recurrent neural networks, part 1

In [0]:
data, labels = [], []
with open('surnames.txt') as f:
    for line in f:
        surname, lang = line.strip().split('\t')
        data.append(surname)
        labels.append(lang)

for i in np.random.randint(0, len(data), 10):
    print(data[i], labels[i])

Test your knowledge - try to independently predict which language the surname belongs to :)

In [0]:
from sklearn.utils.class_weight import compute_class_weight

def test_generator():
    classes = np.unique(labels)
    weights = compute_class_weight('balanced', classes, labels)
    classes = {label: ind for ind, label in enumerate(classes)}

    probs = np.array([weights[classes[label]] for label in labels])
    probs /= probs.sum()

    ind = np.random.choice(np.arange(len(data)), p=probs)
    yield data[ind]
    
    while True:
        new_ind = np.random.choice(np.arange(len(data)), p=probs)
        yield labels[ind], data[new_ind]
        ind = new_ind
        
gen = test_generator()
question = next(gen)

Start, look at the name that appears - and select the language in the drop-down list.

In [0]:
#@title Проверим себя (или адекватность данных) { run: "auto" }
answer = "Vietnamese" #@param ["Arabic", "Chinese", "Czech", "Dutch", "English", "French", "German", "Greek", "Irish", "Italian", "Japanese", "Korean", "Polish", "Portuguese", "Russian", "Scottish", "Spanish", "Vietnamese"]

correct_answer, question = next(gen)

if 'correct_count' not in globals():
    correct_count = 0
    total_count = 0
else:
    if answer == correct_answer:
        print('You are correct', end=' ')
        correct_count += 1
    else:
        print("No, it's", correct_answer, end=' ')

    total_count += 1
    print('({} / {})'.format(correct_count, total_count))
    
print('Next surname:', question)

### Data partitioning

First you need to build a split data on the train / test. The difficulty is that the classes are distributed unevenly, and you need to cut off from each class a proportional amount of data per test. To do this, use the `stratify` function parameter` train_test_split` (or `StratifiedShuffleSplit`, or, if you so desire,` GroupShuffleSplit`).

In [0]:
from sklearn.model_selection import train_test_split

data_train, data_test, labels_train, labels_test = train_test_split(
    data, labels, test_size=0.3, stratify=labels, random_state=42
)

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

from collections import Counter

langs = set(labels)

train_distribution = Counter(labels_train)
train_distribution = [train_distribution[lang] for lang in langs]

test_distribution = Counter(labels_test)
test_distribution = [test_distribution[lang] for lang in langs]

plt.figure(figsize=(17, 5))

bar_width = 0.35
plt.bar(np.arange(len(langs)), train_distribution, bar_width, align='center', alpha=0.5, label='train')
plt.bar(np.arange(len(langs)) + bar_width, test_distribution, bar_width, align='center', alpha=0.5, label='test')
plt.xticks(np.arange(len(langs)) + bar_width / 2, langs)
plt.legend()
    
plt.show()

You should always start with baseline - let's use our favorite pair of vectorizer-logistic regression:

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

model = Pipeline([
    ('vectorizer', CountVectorizer(analyzer='char', ngram_range=(1, 4))),
    ('log_regression', LogisticRegression())
])

model.fit(data_train, labels_train)


What metrics will we count? There is a multi-class classification, so everything is very ambiguous.

It makes sense to look at accuracy and F1-scores for each class .:

In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

preds = model.predict(data_test)

print('Accuracy = {:.2%}'.format(accuracy_score(labels_test, preds)))
print('Classification report:')
print(classification_report(labels_test, preds))

F1-scores can be aggregated in different ways:

- weighted is as considered by classification_report - if it is more important for us to predict well more frequency surnames
- macro - simple averaging - if it is important to predict everything, regardless of how much each class is in the test sample
- micro - normal F1-score calculation for all true positive, false positive and negative negative

Weighted and micro - two metrics that take into account class imbalances. But in our case it is not obvious, is there an imbalance, yes?

In [0]:
import matplotlib.ticker as ticker

label_names = list(set(labels_test))
confusion = confusion_matrix(labels_test, preds, labels=label_names).astype(np.float)
confusion /= confusion.sum(axis=-1, keepdims=True)

fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(111)
cax = ax.matshow(confusion, cmap='Reds')
fig.colorbar(cax)

ax.set_xticklabels([''] + label_names, rotation=45)
ax.set_yticklabels([''] + label_names)

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()

## Simple RNN


The main charm of RNN is the shared parameters. Look at the picture:

<img src="http://karpathy.github.io/assets/rnn/diags.jpeg">

*From [(The Unreasonable Effectiveness of Recurrent Neural Networks)](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)*

The first example is the usual full mesh network. Each following demonstrates the processing of a certain sequence of arbitrary length (red rectangles) and the generation of the output sequence, also of arbitrary length (blue rectangles).

In this case, the green rectangles in each figure are the same weights. So, on the one hand, we are training a very, very deep network (if you look at it upside down), and on the other, a strictly limited number of parameters.

---
Write a simple RNN right away!

Let me remind you, she does something like this:

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png">


Generally speaking, you can come up with many variations on this implementation. In our case, the processing will be as follows:

$$ h_t = tanh (W_h [h_ {t-1}; x_t] + b_h) $$

$ h_ {t-1} $ is the hidden state obtained in the previous step, $ x_t $ is the input vector. $ [h_ {t-1}; x_t] $ is a simple concatenation of vectors. Just like in the picture!

Let's check our network on a very simple task: make it say the index of the first element in the sequence.

Those. for the sequence `[1, 2, 1, 3]` the network must predict `1`.

Let's start with the generation of the batch.

In [0]:
def generate_data(batch_size=128, seq_len=5):
    data = torch.randint(0, 10, size=(seq_len, batch_size), dtype=torch.long)
    return data, data[0]

X_val, y_val = generate_data()
X_val, y_val

Please note that the batch has the dimension `(sequence_length, batch_size, input_size)`. All `RNN` in pytorch work with this default format.

This is done for performance reasons, but you can change this behavior with the help of the `batch_first` argument if you wish.

**Task** Implement the `SimpleRNN` class, which performs the calculation using the formula above.

In [0]:
class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()

        self._hidden_size = hidden_size
        <create Linear layer>

    def forward(self, inputs, hidden=None):
        seq_len, batch_size = inputs.shape[:2]
        if hidden is None:
            hidden = inputs.new_zeros((batch_size, self._hidden_size))
         
        for i in range(seq_len):
            <apply linear layer to concatenation of current input (inputs[i]) and hidden>

        return hidden


It should be clear why it is useful to have the first dimension seq_len - you need to be able to take `inputs [i]` - the subbatch related to this timestamp. If the data were located differently, this operation would be much more expensive.

** Task ** Implement the `MemorizerModel` class, with the sequence` Embedding -> SimpleRNN -> Linear `. You can use `nn.Sequential`

To make embeddings, you can use `nn.Embedding.from_pretrained`. For simplicity, we will do a one-hot-encoding representation — to do this, we simply need to initialize the network with the unit matrix `torch.eye (N)`.

In [0]:
# u can use nn.Sequential too
class MemorizerModel(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()

        <create layers>

    def forward(self, inputs):
        <apply 'em>

Run the training:

In [0]:
rnn = MemorizerModel(hidden_size=32)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters())

total_loss = 0
epochs_count = 1000
for epoch_ind in range(epochs_count):
    X_train, y_train = generate_data(seq_len=25)
    
    optimizer.zero_grad()
    rnn.train()
    
    logits = rnn(X_train)

    loss = criterion(logits, y_train)
    loss.backward()
    optimizer.step()
    
    total_loss += loss.item()
    
    if (epoch_ind + 1) % 100 == 0:
        rnn.eval()
        
        with torch.no_grad():
            logits = rnn(X_val)
            val_loss = criterion(logits, y_val)
            print('[{}/{}] Train: {:.3f} Val: {:.3f}'.format(epoch_ind + 1, epochs_count, 
                                                             total_loss / 100, val_loss.item()))
            total_loss = 0

**Task** Look at how sequence length affects network performance.

First, look at how long the network is able to learn. Secondly, try to train a network with a short sequence length, and then apply it to longer ones.

**Assignment** It is stated that `relu` fits RNN better. Try it too.

## Training RNN


<img src="https://image.ibb.co/cEYkw9/rnn_bptt_with_gradients.png">

*From [Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients](http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/)*


If everything went according to plan, we had to look at how RNN's were forgotten.

To understand the reason, it is worth remembering exactly how the RNN learning takes place, for example, here: [Backpropagation Through Time and Vanishing Gradients](http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/) или здесь - [Vanishing Gradients & LSTMs](http://harinisuresh.com/2016/10/09/lstms/).

In short, one of the problems of learning recurrent networks is * explosion of gradients *. It manifests itself when the matrix of weights is such that it increases the norm of the gradient vector during the reverse pass. As a result, the rate of the gradient grows exponentially and it "explodes."

This problem can be solved using clipping gradients: `nn.utils.clip_grad_norm_ (rnn.parameters (), 1.)`.

## LSTM vs GRU

Another problem is * attenuation of gradients *. It is connected the opposite - with the exponential decay of gradients. And now it is solved in more complicated ways.

Namely - use gate'ovye architecture.

The idea of gate is simple, but important, they are used not only in recurrent networks.

If you look at how our SimpleRNN works, you will notice that each time the memory (ie, $ h_t $) is overwritten. I want to be able to make this rewrite controlled: do not discard any important information from the vector.

Let's get for this the vector $g \in \{0,1 \}^n $, which will say which $h_{t-1}$ cells are good, and instead of which ones it is worth substituting new values:

$$ h_t = g \odot f (x_t, h_{t-1}) + (1 - g) \odot h_{t-1}. $$

For example:
$$
 \begin{bmatrix}
  8 \\
  11 \\
  3 \\
  7
 \end{bmatrix} =
 \begin{bmatrix}
  0 \\
  1 \\
  0 \\
  0
 \end{bmatrix}
 \odot
  \begin{bmatrix}
  7 \\
  11 \\
  6 \\
  5
 \end{bmatrix}
 +
  \begin{bmatrix}
  1 \\
  0 \\
  1 \\
  1
 \end{bmatrix}
 \odot
  \begin{bmatrix}
  8 \\
  5 \\
  3 \\
  7
 \end{bmatrix}
$$

To achieve differentiability, we use sigmoid: $ \sigma(f (x_t, h_ {t-1})) $.

As a result, the network itself will, looking at the inputs, decide which cells of its memory and how much it costs to rewrite.

### LSTM

It seems that the first architecture that applied this mechanism was LSTM (Long Short-Term Memory).

In it, we also add $ c_ {t-1} $ to $ h_ {t-1} $: $ h_ {t-1} $ is all the same hidden states obtained in the previous step, and $ c_ {t -1} $ is a memory vector.

Schematically - something like this:

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png">

*From [(Understanding LSTM Networks)](http://colah.github.io/posts/2015-08-Understanding-LSTMs)*


For a start, we can in the same way, as before, calculate a new hidden state (we denote it by $ \tilde c_{t} $):

$$ \tilde c_{t} = tanh(W_h [h_ {t-1}; x_t] + b_h) $$

In normal RNNs, we would simply overwrite the value of the latent state with this value. And now we want to understand how much information we need from $ c_ {t-1} $ and from $ \tilde c_ {t} $.

Rate it sigmoid:
$$f = \sigma(W_f [h_{t-1}; x_t] + b_f),$$
$$i = \sigma(W_i [h_{t-1}; x_t] + b_i).$$

The first is about how much you want to forget the old information. The second is how interesting is the new one. Then

$$ c_t = f \odot c_ {t-1} + i \odot \tilde c_t. $$

We will also weigh the new hidden state:

$$ o = \sigma (W_o [h_ {t-1}; x_t] + b_o), $$
$$ h_t = o \odot tanh (c_t). $$

Another picture:

<img src="https://image.ibb.co/e6HQUU/details.png">
 
*From [Vanishing Gradients & LSTMs](http://harinisuresh.com/2016/10/09/lstms/)*

Why is the problem of damped gradients solved? Because look at the derivative $ \frac {\partial c_t} {\partial c_ {t-1}} $. It is proportional to the $ f $ gate. If $ f = 1 $ - gradients flow unchanged. Otherwise - well, the network itself learns when it wants to forget something.

It is highly recommended to read the article: [Understanding LSTM Networks] (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) for more information and fun pictures.

Why did I write these formulas? The main thing is to show how much more parameters you need to learn in LSTM compared to a regular RNN. Four times more!

For those who fell asleep - [video, as forgets RNN (bottom)](https://www.youtube.com/watch?v=mLxsbWAYIpw)

## Data preprocessing

In [0]:
symbols = set(symb for word in data_train for symb in word)
char2ind = {symb: ind + 1 for ind, symb in enumerate(symbols)}
char2ind[''] = 0

lang2ind = {lang: ind for ind, lang in enumerate(set(labels_train))}

Convert dataset.

**Task** Write a batch generator that will select a random set of words on the fly and convert them into matrices

In [0]:
def iterate_batches(data, labels, char2ind, lang2ind, batch_size):
    # let's do the conversion part first
    labels = np.array([lang2ind[label] for label in labels])
    data = [[char2ind.get(symb, 0) for symb in word] for word in data]
    
    indices = np.arange(len(data))
    np.random.shuffle(indices)
    
    for start in range(0, len(data), batch_size):
        end = min(start + batch_size, len(data))
        
        batch_indices = indices[start: end]
        
        max_word_len = max(len(data[ind]) for ind in batch_indices)
        X = np.zeros((max_word_len, len(batch_indices)))
        <fill X>
            
        yield X, labels[batch_indices]

Лень передавать `char2ind, lang2ind`:

In [0]:
from functools import partial

iterate_batches = partial(iterate_batches, char2ind=char2ind, lang2ind=lang2ind)

In [0]:
next(iterate_batches(data, labels, batch_size=8))

**Задание** Реализуйте простую модель на `SimpleRNN`.

In [0]:
class SurnamesClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim, lstm_hidden_dim, classes_count):
        super().__init__()
        
        <set layers>
            
    def forward(self, inputs):
        'embed(inputs) -> prediction'
        <implement it>
    
    def embed(self, inputs):
        'inputs -> word embedding'
        <and it> 

In [0]:
import math
import time

def do_epoch(model, criterion, data, batch_size, optimizer=None):  
    epoch_loss = 0.
    
    is_train = not optimizer is None
    model.train(is_train)
    
    data, labels = data
    batchs_count = math.ceil(len(data) / batch_size)
    
    with torch.autograd.set_grad_enabled(is_train):
        for i, (X_batch, y_batch) in enumerate(iterate_batches(data, labels, batch_size=batch_size)):
            X_batch, y_batch = LongTensor(X_batch), LongTensor(y_batch)

            logits = model(X_batch)
            loss = criterion(logits, y_batch)
            epoch_loss += loss.item()

            if is_train:
                optimizer.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), 1.)
                optimizer.step()

            print('\r[{} / {}]: Loss = {:.4f}'.format(i, batchs_count, loss.item()), end='')
                
    return epoch_loss / batchs_count

def fit(model, criterion, optimizer, train_data, epochs_count=1, 
        batch_size=32, val_data=None, val_batch_size=None):
    if not val_data is None and val_batch_size is None:
        val_batch_size = batch_size
        
    for epoch in range(epochs_count):
        start_time = time.time()
        train_loss = do_epoch(model, criterion, train_data, batch_size, optimizer)
        
        output_info = '\rEpoch {} / {}, Epoch Time = {:.2f}s: Train Loss = {:.4f}'
        if not val_data is None:
            val_loss = do_epoch(model, criterion, val_data, val_batch_size, None)
            
            epoch_time = time.time() - start_time
            output_info += ', Val Loss = {:.4f}'
            print(output_info.format(epoch+1, epochs_count, epoch_time, train_loss, val_loss))
        else:
            epoch_time = time.time() - start_time
            print(output_info.format(epoch+1, epochs_count, epoch_time, train_loss))

In [0]:
model = SurnamesClassifier(vocab_size=len(char2ind), emb_dim=16, lstm_hidden_dim=64, classes_count=len(lang2ind)).cuda()

criterion = nn.CrossEntropyLoss().cuda()
optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, epochs_count=50, batch_size=128, train_data=(data_train, labels_train),
    val_data=(data_test, labels_test), val_batch_size=512)

**Задание** Напишите функцию для тестирования полученной сети: пусть она принимает слово и говорит, в каком языке с какой вероятностью это может быть фамилией.

**Задание** Оцените качество модели.

In [0]:
model.eval()

y_test, y_pred = [], []
<calc 'em>

print('Accuracy = {:.2%}'.format(accuracy_score(y_test, y_pred)))
print('Classification report:')
print(classification_report(y_test, y_pred, 
                            target_names=[lang for lang, _ in sorted(lang2ind.items(), key=lambda x: x[1])]))

## Визуализация эмбеддингов

In [0]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.colors import RGB
from bokeh.io import output_notebook

from sklearn.manifold import TSNE
from sklearn.preprocessing import scale


def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()
    
    if isinstance(color, str): 
        color = [color] * len(x)
    if isinstance(color, np.ndarray):
        color = [RGB(*x[:3]) for x in color]
    print(color)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: 
        pl.show(fig)
    return fig


def get_tsne_projection(word_vectors):
    tsne = TSNE(n_components=2, verbose=100)
    return scale(tsne.fit_transform(word_vectors))
    
    
def visualize_embeddings(embeddings, token, colors):
    tsne = get_tsne_projection(embeddings)
    draw_vectors(tsne[:, 0], tsne[:, 1], color=colors, token=token)

Мы опять получили эмбеддинги - символьного уровня теперь.

Хочется на них посмотреть

**Задание** Посчитайте векторы для случайных слов и выведите их.

In [0]:
word_indices = np.random.choice(np.arange(len(data_test)), 1000, replace=False)
words = [data_test[ind] for ind in word_indices]
word_labels = [labels_test[ind] for ind in word_indices]

model.eval()
X_batch, y_batch = next(iterate_batches(words, word_labels, batch_size=1000))
embeddings = <calc me>

colors = plt.cm.tab20(y_batch) * 255

visualize_embeddings(embeddings, words, colors)

## Network visualization

At each step, RNN produces some vector. The full layer applies only to the last output. But you can also look at intermediate states - how the network’s opinion changed about what this word refers to.

** Task ** Write your visualizer.

## Network improvement

**Task** Replace SimpleRNN with LSTM. Compare quality.

**Task** Add Dropout to LSTM (or later). A value of about 0.3 will be adequate.

**Task** An important RNN is the Bidirectional RNN. In fact, these are two RNNs, one bypassing the sequence from left to right, the second - vice versa.

As a result, for each point in time we have the vector $ h_t = [f_t; b_t] $ is the concatenation (or some other function of $ f_t $ and $ b_t $) of the states $ f_t $ and $ b_t $ of the forward and backward passage of the sequence. In sum, they cover the entire context.


In our task, the Bidirectional option can help with the fact that the network will forget less about how the sequence began. That is, we will need to take $ f_N $ and $ b_N $ states: the first is the last state in the passage from left to right, i.e. output from the last character. The second is the last state at the back pass, i.e. output for the first character.

Implement the Bidirectional Classifier. To do this, `LSTM` has a` bidirectional` option.

# Referrence

[The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)  
[Understanding LSTM Networks, Christopher Olah](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)  
[Recurrent Neural Networks Tutorial, Denny Britz](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)  
[Vanishing Gradients & LSTMs, Harini Suresh](http://harinisuresh.com/2016/10/09/lstms/)
[Non-Zero Initial States for Recurrent Neural Networks](https://r2rt.com/non-zero-initial-states-for-recurrent-neural-networks.html)
[Explaining and illustrating orthogonal initialization for recurrent neural networks, Stephen Merity](http://smerity.com/articles/2016/orthogonal_init.html)
[Comparative Study of CNN and RNN for Natural Language Processing, Yin, 2017](https://arxiv.org/abs/1702.01923)
[cs224n "Lecture 8: Recurrent Neural Networks and Language Models"](https://www.youtube.com/watch?v=Keqep_PKrY8)