In [None]:
!pip3 -qq install torch==0.4.1
!pip -qq install torchtext==0.3.1
!pip -qq install spacy==2.0.16
!pip -qq install gensim==3.6.0
!pip -qq install allennlp==0.7.2
!pip -qq install pytorch-pretrained-bert==0.1.2
!python -m spacy download en

!git clone https://github.com/rowanz/swagaf.git
!wget -O conll_2003.zip -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1z7swIvfWs97lKkTpHKIeV7Cgv4NKxGC0"
!unzip conll_2003.zip
!wget -qq https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/week08_multitask/conlleval.py
!wget -O fintech.zip -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=110Mi9nF0J_FTv1MHhf1G-FsZIWEErMzK"
!unzip fintech.zip

In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


if torch.cuda.is_available():
    from torch.cuda import FloatTensor, LongTensor
    DEVICE = torch.device('cuda')
else:
    from torch import FloatTensor, LongTensor
    DEVICE = torch.device('cpu')

np.random.seed(42)

# Pretrained Models

One of the most notable trends in NLP in 2018 is the use of pre-trained language models as components in models that are trained for specific tasks.

In fact, we are already familiar with this approach: word2vec is also a language model, but very simple. The difference is that from embeddingings of words without context, we turn to embeddingings of words in context.

A rather pathetic explanation of why this is important: [NLP's ImageNet now has arrived] (http://ruder.io/nlp-imagenet/).

At the same time, the quality increases and the speed drops.

## Universal Sentence Encoder

Let's start a little not with the language model: [Universal Sentence Encoder] (https://arxiv.org/pdf/1803.11175.pdf). This model was pre-trained in the multi-task learning mode: the encoder learned to produce "universal" representations of proposals, in which specific task decoders learned such things as the prediction of the previous and next sentences (Skip-Thoughts like model) or the usual classification on marked data.

The main point is that the representations are quite self-interpretable, even without some special training for the desired task.

* Unfortunately, I do not know how to use this model on pytorch, so I have to live with tensorflow ... *

In [None]:
import tensorflow as tf
import tensorflow_hub as hub


tf.reset_default_graph()
sess = tf.InteractiveSession()

universal_sentence_encoder = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2", trainable=False)
sess.run([tf.global_variables_initializer(), tf.tables_initializer()])

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

inputs = tf.placeholder(tf.string, shape=[None])
outputs = universal_sentence_encoder(inputs)

lines = [
    "How old are you?",                                                                 # 0
    "Attempting to light a cigarette, someone fumbles with the lighter and drops it.",  # 1
    "This is the story of a man named Neil Fisk, and how he came to love God.",         # 2
    "What is your age?",                                                                # 3
    "Do you have a moment to talk about our Lord?",                                     # 4
]

result = sess.run(outputs, {
    inputs: lines
})

plt.title('phrase similarity')
plt.imshow(result.dot(result.T), interpolation='none', cmap='gray')

For example, the phrase "How old are you?" and "What is your age?" do not have common words, but their cosine proximity is quite high. Similarly, with the second and fourth phrases.

Note that the vectors at the exit of the model are already normalized - therefore, the cosine proximity is considered as a scalar product.

As an example of using these ideas almost for free, let's solve this (flooded) task dragged away from another course: https://www.kaggle.com/c/fintech-tinkoff

In [None]:
import pandas as pd

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

Дано 60 тысяч пар похожих вопросов + куча вопросов из категории "другое". Нужно определить, к какой паре ближе данный вопрос или сказать, что он из другого.

In [None]:
train_data.iloc[:10]

Модель вполне себе умеет говорить, что вопросы похожи!

In [None]:
result = sess.run(outputs, {
    inputs: train_data.iloc[:10].text
})

plt.title('phrase similarity')
plt.imshow(result.dot(result.T), interpolation='none', cmap='gray')

Напишем функцию для подсчета эмбеддингов всех предложений в датасете:

In [None]:
def calc_vectors(data):
    BATCH_SIZE = 1024

    vectors = []
    for batch_begin in range(0, len(data), BATCH_SIZE):
        batch_end = min(len(data), batch_begin + BATCH_SIZE)

        vectors.append(
            sess.run(outputs, {
                inputs: data.iloc[batch_begin: batch_end].text
            })
        )

    return np.concatenate(vectors, 0)

In [None]:
train_vectors = calc_vectors(train_data)
train_labels = train_data['labels'].values

test_vectors = calc_vectors(test_data)

In [None]:
for test_ind in range(10):
    print(test_data.iloc[test_ind].text, train_data.iloc[(train_vectors * test_vectors[test_ind]).sum(-1).argmax()].text, sep='\t')

** Task ** We have a base of vectors and their corresponding tags. Implement 1NN - search for the nearest neighbor for queries from the test sample.

As a label, then you can take the label of this nearest neighbor.

In general, to make a search quickly over a large database, use approximate algorithms such as: [HNSW] (https://github.com/nmslib/hnswlib) In this case, this is not relevant, but you can still try to replace your 1NN with 2NN from that library.

## ELMo

Another story in the case of [ELMo] (https://arxiv.org/pdf/1802.05365.pdf). This is the usual language model:
<img src="https://i.ibb.co/dpp00wG/elmo.png" width="50%">

*From [Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids](http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html)*

Well, almost ordinary. First, word embeddings are constructed using a convolution network over symbols.

Secondly, two language models study at once, forward and backward, which are then concatenated.

Finally, thirdly, the meaning of the model is that it gives the embedding of the word according to its context. This could be the output of the last LSTM layer, but the guys did it more slyly: for each word we have several embeddingings at once: the outputs of each of the LSTM layers + the output of the convolutional network above the characters. When training the final model for the desired task, the word embedding is considered as a weighted sum of the embedding data. Weights learn by task.

As a result, when we want to use ELMo in our model, we take this language model and substitute our usual embeddings instead. All, a couple of lines of change - but more good and much slower inserts.

You can train the language model for the task - then you will have to retrain all these millions of parameters, which is slow. And you can not train out - then inside ELMo will only learn `(num_layers + 1)` parameter - the weight of the mix of embeddings.

### NER

Such embeddings can be used anywhere, but they look best in tasks related to sequence markup (it’s more critical to get embeddingings from a context perspective, whereas Universal Sentence Encoder looks more logical in ELMo for classification tasks).

In [None]:
def read_dataset(path):
    data = []
    with open(path) as f:
        words, tags = [], []
        for line in f:
            line = line.strip()
            if not line and words:
                data.append((words, tags))
                words, tags = [], []
                continue
            word, pos_tag, synt_tag, ner_tag = line.split()
            words.append(word)
            tags.append(ner_tag)
        if words:
            data.append((words, tags))
    return data

In [None]:
train_data = read_dataset('train.txt')
val_data = read_dataset('valid.txt')
test_data = read_dataset('test.txt')

Мы уже смотрели на NER, но вообще он такой:

In [None]:
train_data[:3]

Соберем датасет:

In [None]:
from torchtext.data import Field, Example, Dataset, BucketIterator

tokens_field = Field(unk_token=None, batch_first=True)
tags_field = Field(unk_token=None, batch_first=True)

fields = [('tokens', tokens_field), ('tags', tags_field)]

train_dataset = Dataset([Example.fromlist(example, fields) for example in train_data], fields)
val_dataset = Dataset([Example.fromlist(example, fields) for example in val_data], fields)
test_dataset = Dataset([Example.fromlist(example, fields) for example in test_data], fields)

tokens_field.build_vocab(train_dataset, val_dataset, test_dataset)
tags_field.build_vocab(train_dataset)

print('Vocab size =', len(tokens_field.vocab))
print('Tags count =', len(tags_field.vocab))

train_iter, val_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset, val_dataset, test_dataset), batch_sizes=(32, 128, 128), 
    shuffle=True, device=DEVICE, sort=False
)

* Please note: the dictionary is based on all datasets. It is not that it is very ethical, but we will not complete any embeddingings - therefore nothing is dishonest.

### Baseline

To begin with, we will teach a model with pre-trained word embeddings:

In [None]:
import gensim.downloader as api

w2v_model = api.load('glove-wiki-gigaword-100')

In [None]:
embeddings = np.zeros((len(tokens_field.vocab), w2v_model.vectors.shape[1]))

for i, token in enumerate(tokens_field.vocab.itos):
    if token.lower() in w2v_model.vocab:
        embeddings[i] = w2v_model.get_vector(token.lower())

**Задание** Допишите модель простого теггера.  
Обратите внимание `batch_first=True` в `fields` (это для ELMo понадобится).

In [None]:
class BaselineTagger(nn.Module):
    def __init__(self, embeddings, tags_count, emb_dim=100, rnn_dim=256, num_layers=1):
        super().__init__()
        
        <init layers>
        
    def forward(self, inputs):
        <apply layers>

**Задание** Допишите тренировщик модели.

In [None]:
class ModelTrainer():
    def __init__(self, model, criterion, optimizer):
        self._model = model
        self._criterion = criterion
        self._optimizer = optimizer
        
    def on_epoch_begin(self, is_train, name, batches_count):
        """
        Initializes metrics
        """
        self._epoch_loss = 0
        self._correct_count, self._total_count = 0, 0
        self._is_train = is_train
        self._name = name
        self._batches_count = batches_count
        
        self._model.train(is_train)
        
    def on_epoch_end(self):
        """
        Outputs final metrics
        """
        return '{:>5s} Loss = {:.5f}, Accuracy = {:.2%}'.format(
            self._name, self._epoch_loss / self._batches_count, self._correct_count / self._total_count
        )
        
    def on_batch(self, batch):
        """
        Performs forward and (if is_train) backward pass with optimization, updates metrics
        """        
        loss = <calc loss>
        correct_count, total_count = <and this stuff>
        
        self._correct_count += correct_count
        self._total_count += total_count
        self._epoch_loss += loss.item()
        
        if self._is_train:
            self._optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self._model.parameters(), 1.)
            self._optimizer.step()

        return '{:>5s} Loss = {:.5f}, Accuracy = {:.2%}'.format(
            self._name, loss.item(), correct_count / total_count
        )

Воспользуемся уже знакомой функцией для оценки теггера:

In [None]:
from conlleval import evaluate

def eval_tagger(model, test_iter):
    true_seqs, pred_seqs = [], []

    model.eval()
    with torch.no_grad():
        for batch in test_iter:
            logits = model(batch.tokens)
            preds = logits.argmax(-1)

            seq_lengths = (batch.tags != 0).sum(-1)

            for i, seq_len in enumerate(seq_lengths):
                true_seqs.append(' '.join(tags_field.vocab.itos[ind] for ind in batch.tags[i, :seq_len]))
                pred_seqs.append(' '.join(tags_field.vocab.itos[ind] for ind in preds[i, :seq_len]))

    print('Precision = {:.2f}%, Recall = {:.2f}%, F1 = {:.2f}%'.format(*evaluate(true_seqs, pred_seqs, verbose=False)))

In [None]:
import math
from tqdm import tqdm
tqdm.get_lock().locks = []


def do_epoch(trainer, data_iter, is_train, name=None):
    trainer.on_epoch_begin(is_train, name, batches_count=len(data_iter))
    
    with torch.autograd.set_grad_enabled(is_train):
        with tqdm(total=len(data_iter)) as progress_bar:
            for i, batch in enumerate(data_iter):
                batch_progress = trainer.on_batch(batch)

                progress_bar.update()
                progress_bar.set_description(batch_progress)
                
            epoch_progress = trainer.on_epoch_end()
            progress_bar.set_description(epoch_progress)
            progress_bar.refresh()

            
def fit(trainer, train_iter, epochs_count=1, val_iter=None):
    best_val_loss = None
    for epoch in range(epochs_count):
        name_prefix = '[{} / {}] '.format(epoch + 1, epochs_count)
        do_epoch(trainer, train_iter, is_train=True, name=name_prefix + 'Train:')
        
        if not val_iter is None:
            do_epoch(trainer, val_iter, is_train=False, name=name_prefix + '  Val:')
            eval_tagger(trainer._model, val_iter)
            print(flush=True)

Запустим обучение модели:

In [None]:
model = BasicTagger(len(tokens_field.vocab), tags_count=len(tags_field.vocab)).to(DEVICE)
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters())

trainer = ModelTrainer(model, criterion, optimizer)

fit(trainer, train_iter, epochs_count=32, val_iter=val_iter)

### ELMo Model

Take a pre-trained model from the authors (a description of working with it is in. [ELMo how to](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md)).

In [None]:
from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

elmo = Elmo(options_file, weight_file, num_output_representations=1,
            dropout=0, vocab_to_cache=tokens_field.vocab.itos).to(DEVICE)

Вообще надо сначала преобразовать предложение в символьное представление:

In [None]:
sentences = [['First', 'sentence', '.'], ['Another', '.']]
character_ids = batch_to_ids(sentences).to(DEVICE)

character_ids

А потом засунуть это в модель elmo:

In [None]:
elmo(character_ids)

We are interested in `elmo_representations`.

Such an interface is not very convenient - you need to transfer strings to `batch_to_ids`, which means writing a batch generator from scratch. In addition, it is not good to transfer large batch files to gpu - and with a symbolic representation from the batch `(batch_size, seq_len)` we get the batch in `max_word_len` times larger. Finally, the calculation of word embeds for characters is not free (regarding the request to the table of embeddings).

Therefore, when creating the model, we cached all embeddings: the `vocab_to_cache = tokens_field.vocab.itos` parameter.

In order to use cached values ​​in `inputs`, we will transfer some garbage, and in` word_inputs` - embedding indices.

In [None]:
batch = next(iter(train_iter))

elmo(inputs=batch.tokens.new_empty((batch.tokens.shape[0], batch.tokens.shape[1], 50)), 
     word_inputs=batch.tokens)

**Задание** Обновите модель, заменив эмбеддинги на ELMo.

In [None]:
class ELMoTagger(nn.Module):
    def __init__(self, elmo, tags_count, emb_dim=1024, rnn_dim=256, num_layers=1):
        super().__init__()
        
        <init layers>
        
    def forward(self, inputs):
        <apply layers>

In [None]:
model = ELMoTagger(elmo, tags_count=len(tags_field.vocab)).to(DEVICE)
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters())

trainer = ModelTrainer(model, criterion, optimizer)

fit(trainer, train_iter, epochs_count=32, val_iter=val_iter)

### CRF

This is not relevant to what is happening, generally speaking, but also useful information: in order to work better with the task of predicting sequence tags, you can, instead of independently predicting each tag individually, predict a tag under the condition of the previous tag.

It looks like a decoder in Seq2Seq models - only instead of the previous word, the previous tag is input.

This prediction can be implemented in different ways. One of them is using CRF (Conditional Random Field) (A good description is[здесь](http://www.davidsbatista.net/blog/2017/11/13/Conditional_Random_Fields/).

We now have a model with fully connected output layer. At each step of its output, the probability of which tag has the given word is estimated — the speeds are simply normalized using softmax: $ p [i] = \frac {e ^ {s [i]}} {\sum_ {j = 1} ^ 9 e ^ {s [j]}} $.

And instead of local normalization, let's do global, for the entire sequence. In addition, let's learn the transition probabilities from one tag in the previous step to another tag in the next $ T [y_t, y_ {t + 1}] $. For example, it should be learned that the probability after `O` to meet` I-LOC` is zero (`I-LOC` can only be after` B-LOC` or another `I-LOC`).

Then each sequence will be evaluated using the following formula:
$$\begin{align*}
C(y_1, \ldots, y_m) &= b[y_1] &+ \sum_{t=1}^{m} s_t [y_t] &+ \sum_{t=1}^{m-1} T[y_{t}, y_{t+1}] &+ e[y_m]\\
                    &= \text{begin} &+ \text{scores} &+ \text{transitions} &+ \text{end}
\end{align*}$$

For example, there may be two sequence variants:
<img src="https://guillaumegenthial.github.io/assets/crf1.png" width="50%">
<img src="https://guillaumegenthial.github.io/assets/crf2.png" width="50%">

*From [Sequence Tagging with Tensorflow](https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html)*

The best of them is the one that has the lower the amount of tags' tags + the sum of transitions between the tags.

In this case, you just need to add the ConditionalRandomField module - it will learn $ T $ transitions, consider the global loss for the entire sequence, and also do decoding.

Counting a loss is done this way:

```python
crf = ConditionalRandomField (tags_count)
loss = -crf (output, tags, mask)
```
where `output` is the output of the fully connected layer that was previously the last.

Decoding is done like this:
```python
decoded_sequences = crf.viterbi_tags (output, mask)
```

**Task** Update the tagger and learning functions with the assessment of the quality of the tagger.

In [None]:
from allennlp.modules import ConditionalRandomField

class CRFTagger(nn.Module):
    def __init__(self, embeddings, tags_count, emb_dim=100, rnn_dim=256, num_layers=1):
        super().__init__()
        <init layers (embeddings can be from either glove or elmo)>
        
    def forward(self, inputs, mask, tags=None):
        <apply layers like in previos models>
        
        if tags is not None:
            return <crf loss>
        return <viterbi decoding>

## BERT

Finally, the third variant of the model (something in between with a bunch of its buns) is BERT (yes, dudes know how to name their models).

Similarity to ELMo is also a language model. Differences:
1. This is a Transformer, not BiLSTM.
<img src="https://i.ibb.co/PQ4qXtr/2018-12-22-21-13-14.png" width="50%">

To learn the language model in the framework of the transformer, they accidentally threw out some words from the sentence and tried to predict them using the network. Thus, in the prediction of the token, the entire context was available, and not just the left or just the right, as in ELMo.

2. An additional task in the style of Skip-Thoughts was used - a prediction of the following sentence:
<img src="https://i.ibb.co/WWdwmPD/2018-12-22-21-12-59.png" width="50%">


A couple of sentences were recorded in a row (focus on SEP) and the model learned to predict whether they were coming in succession (focus on CLS).

As a result, CLS embedding was learned in such a way that it contained information about the whole context - and then it can easily be used for classification, as in the first model. But at the same time context embeddingings of all words in the sentence are also learned, therefore the model can be used similarly to ELMo in the task of tagging.

BERT is remarkable for breaking the results of all currently existing models, as well as human performance on some tasks ([for example, SQuAD 1] (https://rajpurkar.github.io/SQuAD-explorer/)).

We use it for the task [Swag] (https://rowanzellers.com/swag/) - the choice of the correct continuation for the text. Dataset, as the name implies, was formed in such a way that it was difficult to do for a soulless machine (but after a few months BERT came out and all was nothing :().

In [None]:
import pandas as pd

train_data = pd.read_csv('swagaf/data/train.csv')
val_data = pd.read_csv('swagaf/data/val.csv')

In [None]:
train_data.sample(10)[['startphrase', 'sent1', 'sent2', 'ending0', 'ending1', 'ending2', 'ending3', 'label']]

Обратите внимание на токенизатор - модель работает с подсловами как в bpe.

In [None]:
from pytorch_pretrained_bert import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(tokenizer.vocab)

tokens = tokenizer.tokenize('Why do I do this stuff...')
print(tokens)
print(tokenizer.convert_tokens_to_ids(tokens))

Соберем датасет.

In [None]:
def collect_samples(data, tokenizer):
    choices_list, labels_list, segment_ids_list = [], [], []
    
    for _, row in tqdm(data.iterrows(), total=len(data)):
        first_phrase = tokenizer.tokenize(row.sent1)
        second_phrase = tokenizer.tokenize(row.sent2)
        choices, segment_ids = [], []
        for i, ending in enumerate(row[['ending0', 'ending1', 'ending2', 'ending3']]):
            ending = tokenizer.tokenize(ending)
            tokens = ["[CLS]"] + first_phrase + ["[SEP]"] + second_phrase + ending + ["[SEP]"]
            tokens = tokenizer.convert_tokens_to_ids(tokens)
            choices.append(tokens)
            
            segment_ids.append([0] * (len(first_phrase) + 2) + [1] * (len(tokens) - len(first_phrase) - 2))
            
        choices_list.append(choices)
        segment_ids_list.append(segment_ids)
        labels_list.append(row.label)

    return np.array(choices_list), np.array(labels_list), np.array(segment_ids_list)


train_data, train_labels, train_segment_ids = collect_samples(train_data, tokenizer)
val_data, val_labels, val_segment_ids = collect_samples(val_data, tokenizer)

Будем сэмплы формировать также, как было при обучении модели - `[CLS]<first text>[SEP]<next possible text>[SEP]`.

Чтобы модель знала, где заканчивается первый сегмент и начинается второй передается заодно еще `train_segment_ids` - 0 соответствует токену из первого сегмента, а 1 - из второго.

In [None]:
train_data[:2], train_labels[:2], train_segment_ids[:2]

Итератор батчей таки придется написать свой...

In [None]:
import math


def to_matrix(choices_list):
    batch_size = len(choices_list)
    num_options = len(choices_list[0])
    seq_len = max(len(choice) for choices in choices_list for choice in choices)
    
    matrix = np.zeros((batch_size, num_options, seq_len))
    for i, choices in enumerate(choices_list):
        for j, choice in enumerate(choices):
            matrix[i, j, :len(choice)] = choice

    return matrix
    

class BatchIterator():
    def __init__(self, data, labels, segment_ids, batch_size, shuffle=True):
        self._data = data
        self._labels = labels
        self._segment_ids = segment_ids
        self._num_samples = len(data)
        self._batch_size = batch_size
        self._shuffle = shuffle
        self._batches_count = int(math.ceil(len(data) / batch_size))
        
    def __len__(self):
        return self._batches_count
    
    def __iter__(self):
        return self._iterate_batches()

    def _iterate_batches(self):
        indices = np.arange(self._num_samples)
        if self._shuffle:
            np.random.shuffle(indices)

        for start in range(0, self._num_samples, self._batch_size):
            end = min(start + self._batch_size, self._num_samples)

            batch_indices = indices[start: end]
            
            choices = to_matrix(self._data[batch_indices])
            mask = (choices != 0).astype(np.int)
            yield {
                'choices': choices,
                'segment_ids': to_matrix(self._segment_ids[batch_indices]),
                'mask': mask,
                'label': self._labels[batch_indices]
            }

Тренироваться на colab можно только с очень маленьким батчем:

In [None]:
train_iter = BatchIterator(train_data, train_labels, train_segment_ids, 8)
val_iter = BatchIterator(val_data, val_labels, val_segment_ids, 8)

Загружаем предобученный BERT:

In [None]:
from pytorch_pretrained_bert import BertModel

bert = BertModel.from_pretrained('bert-base-uncased').to(DEVICE)

We build such a model:
<img src = "https://i.ibb.co/JFcwW6D/2018-12-22-21-23-23-20.png" width = "50%">

BERT builds four representations $ C_0, \ldots, C_3 $ for four options for continuing the sentence. We will learn the parameter $ V $, whose scalar product with $ C_i $ should be maximal for the relevant continuation.

For this we will simply use the cross-entropy loss and assume softmax:
$ P_i = \frac {e ^ {V \cdot C_i}} {\ um_j e ^ {V \cdot C_j}} $.

That is, the joke is that the only task-specific that we are learning is the vector $ V $!

In [None]:
class MultipleChoiceModel(nn.Module):
    def __init__(self, bert, num_choices):
        super().__init__()
        
        self._bert = bert
        self._num_choices = num_choices
        self._dropout = nn.Dropout(0.1)
        self._classifier = nn.Linear(768, 1)
        
    def forward(self, choices, segment_ids, mask):
        """
        choices: LongTensor of shape [batch_size, num_choices, seq_len] with token ids
        segment_ids: LongTensor of shape [batch_size, num_choices, seq_len] with token types (0 for first segment, 1 for second)
        mask: LongTensor of shape [batch_size, num_choices, seq_len] with mask for padding tokens
        returns logits - FloatTensor of shape [batch_size, num_choices]
        """
        choices = choices.view(-1, choices.size(-1))
        segment_ids = segment_ids.view(-1, segment_ids.size(-1))
        mask = mask.view(-1, mask.size(-1))
        
        _, pooled_output = self._bert(choices, segment_ids, mask, output_all_encoded_layers=False)
        
        pooled_output = self._dropout(pooled_output)
        logits = self._classifier(pooled_output)
        return logits.view(-1, self._num_choices)

**Задание** Доделайте обучалку для модели.

In [None]:
class ModelTrainer():
    def __init__(self, model, criterion, optimizer):
        self._model = model
        self._criterion = criterion
        self._optimizer = optimizer
        
    def on_epoch_begin(self, is_train, name, batches_count):
        """
        Initializes metrics
        """
        self._epoch_loss = 0
        self._correct_count, self._total_count = 0, 0
        self._is_train = is_train
        self._name = name
        self._batches_count = batches_count
        
        self._model.train(is_train)
        
    def on_epoch_end(self):
        """
        Outputs final metrics
        """
        return '{:>5s} Loss = {:.5f}, Accuracy = {:.2%}'.format(
            self._name, self._epoch_loss / self._batches_count, self._correct_count / self._total_count
        )
        
    def on_batch(self, batch):
        """
        Performs forward and (if is_train) backward pass with optimization, updates metrics
        """
        
        loss = <calc loss>
        correct_count, total_count = <and this stuff>
        
        self._correct_count += correct_count
        self._total_count += total_count
        self._epoch_loss += loss.item()
        
        if self._is_train:
            self._optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self._model.parameters(), 1.)
            self._optimizer.step()

        return '{:>5s} Loss = {:.5f}, Accuracy = {:.2%}'.format(
            self._name, loss.item(), correct_count / total_count
        )

Нужно также немного магии для инициализации оптимизатора:

In [None]:
from pytorch_pretrained_bert.optimization import BertAdam

model = MultipleChoiceModel(bert, 4).to(DEVICE)
criterion = nn.CrossEntropyLoss()

params = [(name, param) for name, param in model.named_parameters() if 'pooler' not in name]

no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [param for name, param in params if not any(nd in name for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [param for name, param in params if any(nd in name for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = BertAdam(optimizer_grouped_parameters, lr=5e-5, warmup=0.1, t_total=len(train_iter) * 2)

trainer = ModelTrainer(model, criterion, optimizer)

fit(trainer, train_iter, 2, val_iter)

**Task** The larger batch does not fit into the memory. And I want to train with a big batch (Google had 16).

To solve this, there is such a simple (if on pytorch, and not on tensorflow) reception - the accumulation of gradients. You can simply optimize the model not at every step, but every few steps of training:

[Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups] -gpu-distributed-setups-ec88c3e51255)

Implement this approach.

# Referrence
Universal Language Model Fine-tuning for Text Classification [[pdf]](https://arxiv.org/pdf/1801.06146)  
Deep contextualized word representations [[pdf]](https://arxiv.org/pdf/1802.05365)  
Improving Language Understanding by Generative Pre-Training [[pdf]](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)  
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [[pdf]](https://arxiv.org/pdf/1810.04805.pdf)

Dissecting Contextual Word Embeddings: Architecture and Representation [[pdf]](http://aclweb.org/anthology/D18-1179)

[NLP's ImageNet moment has arrived](http://ruder.io/nlp-imagenet/)  
[The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](https://jalammar.github.io/illustrated-bert/)  
[Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids](http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html)

[Sequence Tagging with Tensorflow](https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html)  
[Conditional Random Field Tutorial in PyTorch](https://towardsdatascience.com/conditional-random-field-tutorial-in-pytorch-ca0d04499463)  
[Conditional Random Fields for Sequence Prediction](http://www.davidsbatista.net/blog/2017/11/13/Conditional_Random_Fields/)