In [None]:
!pip3 -qq install torch==0.4.1
!pip -qq install torchtext==0.3.1
!wget -qq --no-check-certificate 'https://drive.google.com/uc?export=download&id=1Pq4aklVdj-sOnQw68e1ZZ_ImMiC8IR1V' -O tweets.csv.zip
!wget -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1ji7dhr9FojPeV51dDlKRERIqr3vdZfhu" -O surnames.txt
!unzip tweets.csv.zip

In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


if torch.cuda.is_available():
    from torch.cuda import FloatTensor, LongTensor
    DEVICE = torch.device('cuda')
else:
    from torch import FloatTensor, LongTensor
    DEVICE = torch.device('cpu')

np.random.seed(42)

# Language model


*The language model* is a piece that is able to estimate the probabilities of meeting the sequence of words $ w_1, \ldots, w_n $:

$$ \mathbf {P} (w_1, \ldots, w_n) = \prod_k \mathbf {P} (w_k | w_ {k-1}, \ldots, w_ {1}). $$

Interpretable and interesting here are conditional probabilities - what word does the language model expect following the data. We all have such a language model, that's it. For example, in this context

<center>
<img src="https://hsto.org/web/956/239/601/95623960157b4e15a1b3f599aed62ed2.png" width="20%">
</center>

my language model says - after * honest * hardly * my * will go. But * and * or, of course, * the rules * - very much so.

And the task is to learn how to generate political tweets in the image and likeness of `Russian Troll Tweets`. Dataset taken from here: https://www.kaggle.com/vikasg/russian-troll-tweets

In [None]:
import pandas as pd

data = pd.read_csv('tweets.csv')

data.text.sample(15).tolist()

Yes, the results will be persistent, I warn you immediately.

## Reading data

Has anyone already gotten enough of writing all these builds, dictionaries - is that all? Personally, me - yes!

In pytorch there is a special class for generating batches - `Dataset`. Instead of writing a function like `iterate_batches`, you can inherit from it and override the methods` __len__` and `__getitem__` ... and implement almost everything that was in ʻiterate_batches in them. Not impressive yet, is it?

There is also a `DataLoader` that can work with dataset. It allows you to make shuffle batches and generate them in separate processes - this is especially important when the generation of a batch is a long operation. For example, in pictures. You can read about all this here: [Data Loading and Processing Tutorial] (https://pytorch.org/tutorials/beginner/data_loading_tutorial.html).

But so far it is still not very cool, it seems to me. Another thing is interesting - pytorch has a separate library in the repository - [torchtext] (https://github.com/pytorch/text). Here it already gives us special implementations of `Dataset` for working with text and all sorts of tools that make life a little easier.

The library, in my opinion, lacks tutorials that show how to work with it - but you can read the source code, it is nice.

The plan is to build a class `torchtext.data.Dataset`, create an iterator for it, and learn the model.

This data is initialized with two parameters:
```
            examples: List of Examples.
            fields (List (tuple (str, Field))): The Fields to use in this tuple. The
                The field is the field name.
```
We will understand first with the second.

`Field` is such a meta-information for dataset + sample handler.
It has a bunch of options that are easier to look at [here] (https://github.com/pytorch/text/blob/master/torchtext/data/field.py). In short, he can preprocess (for example, tokenize) sentences, build a dictionary (mapping from word to index), build batchy — add paddings and convert to tensors. What else is needed in life?

We will do a character-level language model, so tokenization for us is the transformation of a string into a set of characters. We also ask you to add the special characters `<s>` and `</ s>` to the beginning and end.

In [None]:
from torchtext.data import Field

text_field = Field(init_token='<s>', eos_token='</s>', lower=True, tokenize=lambda line: list(line))


Preprocessing will look like this:

In [None]:
text_field.preprocess(data.text.iloc[0])


Convert everything and look at the length distribution:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

data['text'] = data['text'].fillna('')
lines = data.apply(lambda row: text_field.preprocess(row['text']), axis=1).tolist()

lengths = [len(line) for line in lines]

plt.hist(lengths, bins=30)[-1]

Cut off too short lines and convert the remaining ones to `Example`:

In [None]:
from torchtext.data import Example

lines = [line for line in lines if len(line) >= 50]

fields = [('text', text_field)]
examples = [Example.fromlist([line], fields) for line in lines]

By `Example` you can get back all the fields that we shoved there. For example, now we have created one `text` field:

In [None]:
examples[0].text

Let's build, at last,:

In [None]:
from torchtext.data import Dataset

dataset = Dataset(examples, fields)

Dataset can be divided into parts:

In [None]:
train_dataset, test_dataset = dataset.split(split_ratio=0.75)

On it you can build a dictionary:

In [None]:
text_field.build_vocab(train_dataset, min_freq=30)

print('Vocab size =', len(text_field.vocab))
print(text_field.vocab.itos)

Finally, it can be iterated:

In [None]:
from torchtext.data import BucketIterator

train_iter, test_iter = BucketIterator.splits(datasets=(train_dataset, test_dataset), batch_sizes=(32, 128), 
                                              shuffle=True, device=DEVICE, sort=False)

In [None]:
batch = next(iter(train_iter))

batch

In [None]:
batch.text

## Perplexity

Our task, as always, needs to start with two questions - which metric is optimized and which baseline.

With a metric, everything is simple - we want the model to know how best to approximate the distribution of words in a language. We don’t have a whole language, so let's do a test sample.

It is possible to calculate cross-entropy losses on it:

$$H(w_1, \ldots, w_n) = - \frac 1n \sum_k \log\mathbf{P}(w_k | w_{k-1}, \ldots, w_1).$$

Here the probability $ \mathbf {P} $ is the probability estimated by our language model. The ideal model would give a probability equal to 1 for words in the text and the losses would be zero - although this is of course impossible, even you cannot predict the next word, what to say about a soulless machine.

Thus, as always, we optimize cross-entropy and strive to make it as low as possible.

Well, almost everything. There is also a separate metric for language models - * perplexion *. These are simply exponential cross-entropy losses:

$$PP(w_1, \ldots, w_n) = e^{H(w_1, \ldots, w_n)} = e^{- \frac 1n \sum_k \log\mathbf{P}(w_k | w_{k-1}, \ldots, w_1)} = \left(\mathbf{P}(w_1, \ldots, w_n) \right)^{-\frac 1n}.$$

Its measurement has some sacred meaning besides banal interpretability: we will present a model that predicts words from the dictionary equally likely, regardless of context. For her, $ \mathbf {P} (w) = \frac 1 N $, where $ N $ is the size of the dictionary, and perplexion will be equal to the size of the dictionary - $ N $. Of course, this is a completely stupid model, but looking at it, one can interpret the perplexion of real models as the level of ambiguity of word generation.

For example, in the model with perplexia 100, the choice of the next word is also ambiguous, as the choice of a uniform distribution among 100 words. And if such perplexion was achieved on a dictionary of 100,000, it turns out that we managed to reduce this ambiguity by three orders of magnitude compared with the blunt randomness.

## Бейзлайн

In general, baseline is also very simple here. We, in fact, even looked at it on the course of concepts: [N-gram language model] (https://colab.research.google.com/drive/1lz9vO6Ue5zOiowEx0-koXNiejBrrnbj0). It is possible to calculate the probabilities of N-grams of words by the frequency of their occurrence in the learning package. Then use the approximation $ \mathbf {P} (w_k | w_1, \ldots, w_ {k-1}) \approx \mathbf {P} (w_k | w_ {k-1}, \ldots, w_ {kN + 1 }) $.

Apply better mesh to implement the same.

<center>
<img src="https://image.ibb.co/buMnLf/2018-10-22-00-22-56.png" width="20%">
</center>

*From cs224n, Lecture 8 [pdf](http://web.stanford.edu/class/cs224n/lectures/lecture8.pdf)*

A sequence of words comes to the input, they are inserted, and then with the help of the output layer the next word is considered the most likely.

Stop ... But we have already implemented this! In the Word2vec CBoW model, we contextually predicted the central word - the only difference is that now we have only the left context. So, everything, we go to the next model?

Not! There is still something to have fun. In Word2vec, we formed batchy like this:

<center>
<img src="https://image.ibb.co/bs3wgV/training-data.png" width="20%">
</center>
    
*From [Word2Vec Tutorial - The Skip-Gram Model, Chris McCormic](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)*

That is, a set of <context, word> pairs was cut from the text (and somehow they were used depending on the method).

It is irrational - every word is repeated many times. But you can use convolutional networks - they will apply the operation of multiplication by $ W $ for each window for us. As a result, the size of the input batch will be much smaller.

To process everything correctly, you need to add padding to the beginning of the sequence with the size of `window_size - 1` - then the first word will be predicted by` <pad> ... <pad> <s> `.

**Task** Implement a language model with a fixed window.

In [None]:
class ConvLM(nn.Module):
    def __init__(self, vocab_size, window_size=5, emb_dim=16, filters_count=128):
        super().__init__()
        
        self._window_size = window_size
        
        <init layers>
        
    def forward(self, inputs):
        <apply>
        
        return output, None  # hacky way to use training cycle for RNN and Conv simultaneously

Check that it works:

In [None]:
model = ConvLM(vocab_size=len(train_iter.dataset.fields['text'].vocab)).to(DEVICE)

model(batch.text)

**Task** Implement a function to sample a sequence from a language model.

In [None]:
def sample(probs, temp):
    probs = F.log_softmax(probs.squeeze(), dim=0)
    probs = (probs / temp).exp()
    probs /= probs.sum()
    probs = probs.cpu().numpy()

    return np.random.choice(np.arange(len(probs)), p=probs)


def generate(model, temp=0.7):
    model.eval()
    
    history = [train_dataset.fields['text'].vocab.stoi['<s>']]
    
    with torch.no_grad():
        for _ in range(150):
            <sample next character and print it (use end='' in print function)>

generate(model)

**Task** We still have not set any target. And we will need to predict the following words - that is, just the input tensor shifted by 1. Implement target building and loss calculation.

In [None]:
import math
from tqdm import tqdm


def do_epoch(model, criterion, data_iter, unk_idx, pad_idx, optimizer=None, name=None):
    epoch_loss = 0
    
    is_train = not optimizer is None
    name = name or ''
    model.train(is_train)
    
    batches_count = len(data_iter)
    
    with torch.autograd.set_grad_enabled(is_train):
        with tqdm(total=batches_count) as progress_bar:
            for i, batch in enumerate(data_iter):                
                logits, _ = model(batch.text)

                <implement loss calc>
                
                epoch_loss += loss.item()

                if optimizer:
                    optimizer.zero_grad()
                    loss.backward()
                    nn.utils.clip_grad_norm_(model.parameters(), 1.)
                    optimizer.step()

                progress_bar.update()
                progress_bar.set_description('{:>5s} Loss = {:.5f}, PPX = {:.2f}'.format(name, loss.item(), 
                                                                                         math.exp(loss.item())))
                
            progress_bar.set_description('{:>5s} Loss = {:.5f}, PPX = {:.2f}'.format(
                name, epoch_loss / batches_count, math.exp(epoch_loss / batches_count))
            )

    return epoch_loss / batches_count


def fit(model, criterion, optimizer, train_iter, epochs_count=1, unk_idx=0, pad_idx=1, val_iter=None):
    for epoch in range(epochs_count):
        name_prefix = '[{} / {}] '.format(epoch + 1, epochs_count)
        train_loss = do_epoch(model, criterion, train_iter, unk_idx, pad_idx, optimizer, name_prefix + 'Train:')
        
        if not val_iter is None:
            val_loss = do_epoch(model, criterion, val_iter, unk_idx, pad_idx, None, name_prefix + '  Val:')

        generate(model)

In [None]:
model = ConvLM(vocab_size=len(train_iter.dataset.fields['text'].vocab)).to(DEVICE)

pad_idx = train_iter.dataset.fields['text'].vocab.stoi['<pad>']
unk_idx = train_iter.dataset.fields['text'].vocab.stoi['<unk>']
criterion = nn.CrossEntropyLoss(reduction='none').to(DEVICE)

optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, train_iter, epochs_count=30, unk_idx=unk_idx, pad_idx=pad_idx, val_iter=test_iter)

**Task** To wean the model to sample `<unk>` can explicitly forbid it in sepliruyuschey function - but you can not just teach it to them. Implement masking for both padding and unknown words.

## Recurrent language model

Obviously, I want to use not a fixed history window, but all the information about the already generated one. At a minimum, I want to know when we have a limit of characters in a tweet.
For this, recurrent language models are used:

<center>
<img src="https://hsto.org/web/dc1/7c2/c4e/dc17c2c4e9ac434eb5346ada2c412c9a.png" width="20%">
</center>
    
The previous token is transmitted to the network as well as the previous RNN state. About the entire history is coded (should be), and the previous token is needed in order to know what kind of token was sampled from the distribution predicted at the last step.

** Assignment ** We have done this several times already - implement again the network that will be engaged in language modeling.

In [None]:
class RnnLM(nn.Module):
    def __init__(self, vocab_size, emb_dim=16, lstm_hidden_dim=128, num_layers=1):
        super().__init__()

        self._emb = nn.Embedding(vocab_size, emb_dim)
        self._rnn = nn.LSTM(input_size=emb_dim, hidden_size=lstm_hidden_dim)
        self._out_layer = nn.Linear(lstm_hidden_dim, vocab_size)

    def forward(self, inputs, hidden=None):
        <implement me>
        return output, hidden

**Task** Implement a function to sample sentences from a model.

In [None]:
def generate(model, temp=0.8):
    model.eval()
    with torch.no_grad():
        prev_token = train_iter.dataset.fields['text'].vocab.stoi['<s>']
        end_token = train_iter.dataset.fields['text'].vocab.stoi['</s>']
        
        hidden = None
        for _ in range(150):
            <print sampled character>

generate(model)

In [None]:
model = RnnLM(vocab_size=len(train_iter.dataset.fields['text'].vocab)).to(DEVICE)

pad_idx = train_iter.dataset.fields['text'].vocab.stoi['<pad>']
unk_idx = train_iter.dataset.fields['text'].vocab.stoi['<unk>']
criterion = nn.CrossEntropyLoss(reduction='none').to(DEVICE)

optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, train_iter, epochs_count=30, unk_idx=unk_idx, pad_idx=pad_idx, val_iter=test_iter)

## Model improvement


We have only used Adam so far. In general, you can achieve better results with the usual `SGD`, if you really try.
 
** Task ** Replace the optimizer with `optim.SGD (model.parameters (), lr = 20., Weight_decay = 1e-6)`. For example. Or other options to choose from.

### Dropout

Recall what a dropout is.

In essence, this is the multiplication of a randomly generated mask of zeros and ones by the input vector (+ normalization).

For example, for the Dropout (p) layer:

$$m = \frac1{1-p} \cdot \text{Bernouli}(1 - p)$$
$$\tilde h = m \odot h $$


In recurrent networks for a long time they could not screw the dropout. They tried to do this by generating a random mask:  

from [A Theoretically Grounded Application of Dropout in Recurrent Neural Networks](https://arxiv.org/abs/1512.05287)


It turned out that it is more correct to make the mask fixed: for each step the same elements should be zero.

For pytorch, there is no normal embedded variational dropout in LSTM. But there is [AWD-LSTM] (https://github.com/salesforce/awd-lstm-lm).

I advise you to look at the review of different ways of applying dropout in recurrent networks: [Dropout in Recurrent Networks - Part 1] (https://becominghuman.ai/learning-note-dropout-in-recurrent-networks-part-1-57a9c19a2307) ( at the end - links to Part 2 and 3).

**Task** Implement a variation dropout. For this you need to sample the mask `(1, batch_size, inp_dim)` for the input tensor of the size `(seq_len, batch_size, inp_dim)` from the distribution $ \text {Bernouli} (1 - p) $, multiply it by $ \frac1 {1 -p} $ and multiply the input tensor by it.

Thanks to broadcasting, each timestamp from the input tensor is multiplied by the same mask - and there must be happiness.

Although it is better to compare with the usual `nn.Dropout`, suddenly the difference will not be noticeable.

In [None]:
class LockedDropout(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs, dropout=0.5):
        if not self.training or not dropout:
            return inputs
        
        <implement me>

## Conditional generation

We have already classified names by language. We now learn how to generate a surname for a given language.

Let's use the heir of `Dataset` -` TabularDataset`:

In [None]:
from torchtext.data import TabularDataset

name_field = Field(init_token='<s>', eos_token='</s>', lower=True, tokenize=lambda line: list(line))
lang_field = Field(sequential=False)

dataset = TabularDataset(
    path='surnames.txt', format='tsv', 
    skip_header=True,
    fields=[
        ('name', name_field),
        ('lang', lang_field)
    ]
)

name_field.build_vocab(dataset)
lang_field.build_vocab(dataset)

print(name_field.vocab.itos)
print(lang_field.vocab.itos)

Let's break the dataset:

In [None]:
train_dataset, val_dataset = dataset.split(split_ratio=0.25, stratified=True, strata_field='lang')

**Task** Make a language model that accepts both the previous generated character and the index of the language to which this word belongs. Build embeddings for the symbol and for the language, concatenate them - and then everything is the same.

It is necessary to train this model and write the function-generator of surnames for a given language.

# In the wild

Let's apply our knowledge to the combat mission: [Kaggle Toxic Comment Classification Challenge] (https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/).

It is about the classification of messages in several categories. The network architecture should be as follows: some encoder (for example, LSTM) builds the embedding sequence. Then, the output layer should predict 6 categories - but not with cross-entropy losses, but with `nn.BCEWithLogitsLoss` - because the categories are not mutually exclusive.

Tip: Understand the tokenization that `Field` can do. Download the pre-trained vocabulary embeddings, as we did. Build a network and write a learning cycle for it.

**Task** Download data from kaggle, train something and make a package.

# Referrence

[A Friendly Introduction to Cross-Entropy Loss, Rob DiPietro](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/)

[A Tutorial on Torchtext, Allen Nie](http://anie.me/On-Torchtext/)

[Dropout in Recurrent Networks, Ceshine Lee](https://becominghuman.ai/learning-note-dropout-in-recurrent-networks-part-1-57a9c19a2307)

[The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

[The unreasonable effectiveness of Character-level Language Models, Yoav Goldberg](http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139)

[Unsupervised Sentiment Neuron, OpenAI](https://blog.openai.com/unsupervised-sentiment-neuron/)

[Как научить свою нейросеть генерировать стихи](https://habr.com/post/334046/)


[cs224n, "Lecture 8: Recurrent Neural Networks and Language Models"](https://www.youtube.com/watch?v=Keqep_PKrY8)

[Oxford Deep NLP, "Language Modelling and RNNs"](https://github.com/oxford-cs-deepnlp-2017/lectures#5-lecture-3---language-modelling-and-rnns-part-1-phil-blunsom)