In [None]:
!pip3 -qq install torch==0.4.1
!pip -qq install torchtext==0.3.1
!pip install sacremoses==0.0.5
!wget -O news.zip -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1hIVVpBqM6VU4n3ERkKq4tFaH4sKN0Hab"
!unzip news.zip

In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import matplotlib.pyplot as plt
%matplotlib inline


if torch.cuda.is_available():
    from torch.cuda import FloatTensor, LongTensor
    DEVICE = torch.device('cuda')
else:
    from torch import FloatTensor, LongTensor
    DEVICE = torch.device('cpu')

np.random.seed(42)

# Abstactive Summarization

The task is to generate an excerpt from the text.

For example, let's try to generate headlines on the news:

In [None]:
!shuf -n 10 news.csv

Token them. We will use a single dictionary for text and headings.

In [None]:
from torchtext.data import Field, Example, Dataset, BucketIterator

BOS_TOKEN = '<s>'
EOS_TOKEN = '</s>'

word_field = Field(tokenize='moses', init_token=BOS_TOKEN, eos_token=EOS_TOKEN, lower=True)
fields = [('source', word_field), ('target', word_field)]

In [None]:
import pandas as pd
from tqdm import tqdm

data = pd.read_csv('news.csv', delimiter=',')

examples = []
for _, row in tqdm(data.iterrows(), total=len(data)):
    source_text = word_field.preprocess(row.text)
    target_text = word_field.preprocess(row.title)
    examples.append(Example.fromlist([source_text, target_text], fields))

Построим датасеты:

In [None]:
dataset = Dataset(examples, fields)

train_dataset, test_dataset = dataset.split(split_ratio=0.85)

print('Train size =', len(train_dataset))
print('Test size =', len(test_dataset))

word_field.build_vocab(train_dataset, min_freq=7)
print('Vocab size =', len(word_field.vocab))

train_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset, test_dataset), batch_sizes=(16, 32), shuffle=True, device=DEVICE, sort=False
)

## Seq2seq for Abstractive Summarization

In general, the task is not much different from machine translation:
<center>
<img src = "https://image.ibb.co/jAf3S0/2018-11-20-9-42-17.png" width = "25%">
</center>
    
* From [Get To The Point: Summarization With Pointer-Generator Networks] (https://arxiv.org/pdf/1704.04368.pdf) *

Here at each step the decoder spies on all tokens - more precisely, their embeddings after BiRNN.

The question arises - why even RNN, if then we still look at everything.

# Transformer

From this idea - the rejection of the RNN - and turned Transformer.
<center>
<img src = "https://hsto.org/webt/59/f0/44/59f04410c0e56192990801.png" width="20%">
</center>
* From Attention is all you need *

As in the case of RNN, at each step we apply the same operation (LSTM cell) to the current input, and here - only now there are no connections between timestamps and we can process them almost in parallel.

* The code further relies heavily on a smart article [The Annotated Transformer] (http://nlp.seas.harvard.edu/2018/04/03/attention.html). *

## Encoder

Let's start with the encoder:
<center>
<img src="http://jalammar.github.io/images/t/transformer_resideual_layer_norm.png" width = "20%">
</center>
* From [Illustrated Transformer] (http://jalammar.github.io/illustrated-transformer/) *

It is a sequence of identical blocks with self-attention + fully connected layers.

You can imagine that this is an LSTM cell: it also applies to each input with the same weights. The main difference is in the absence of recurrent links: due to this, the encoder can be applied simultaneously to all inputs of the batch.

### Positional Encoding

It is necessary to somehow encode information about where in the sentence the token is placed. Dudes suggested doing so:

$$ PE_{(pos, 2i)} = sin (pos / 10000^{2i / d _ {\text {model}}}) $$
$$ PE_{(pos, 2i + 1)} = cos (pos / 10000^{2i / d _ {\text {model}}}) $$

where $ (pos, i) $ is the position in the sentence and the index in the hidden vector of dimension to $ d_ {model} $.

In [None]:
import math 

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float) * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

In [None]:
plt.figure(figsize=(15, 5))
pe = PositionalEncoding(20, 0)
y = pe(torch.zeros(1, 100, 20))
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())
plt.legend(["dim %d"%p for p in [4,5,6,7]])

As a result, token embeddings are obtained as the sum of ordinary embedding and embedding positions:
<center>
<img src="http://jalammar.github.io/images/t/transformer_positional_encoding_vectors.png" width="20%">
</center>
    
*From [Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)*

### Residual Connection

Let us analyze the block of the encoder - repeating N times the combination of operations in the first figure.

The simplest thing here is the residual connection. Instead of the output of an arbitrary function $ F $, its input is added.

$$ y = F(x) \quad \to \quad y = F (x) + x $$

The idea is that ordinary networks are difficult to make too deep - gradients fade out. And through this residual input $ x $ gradients flow nothing. As a result, in the pictures, thanks to such blocks, it turned out to be done before the layers and improve the quality (see ResNet).

Nothing prevents us from doing the same.

In [None]:
class ResidualBlock(nn.Module):
    def __init__(self, size, dropout_rate):
        super().__init__()
        self._norm = LayerNorm(size)
        self._dropout = nn.Dropout(dropout_rate)

    def forward(self, inputs, sublayer):
        return inputs + self._dropout(sublayer(self._norm(inputs)))

### Layer Norm

Additionally, LayerNorm normalization is applied.

** Batch normalization **
We did not understand at all, but BatchNorm works like this:
$$\mu_j = \frac{1}{m}\sum_{i=1}^{m}x_{ij} \\    \sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m}(x_{ij} - \mu_j)^2 \\    \hat{x}_{ij} = \frac{x_{ij} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}$$
$$y_{ij} = \gamma \ \hat{x}_{ij} + \beta$$

On each batch, these $ \mu $ and $ \sigma $ are recalculated, updating statistics. Inferense uses accumulated statistics.

Its main drawback is that it does not work well with recurrent networks. To overcome this came up:

** Layer normalization **
And now we will use slightly different formulas:
$$\mu_i = \frac{1}{m}\sum_{j=1}^{m}x_{ij} \\    \sigma_i^2 = \frac{1}{m}\sum_{j=1}^{m}(x_{ij} - \mu_i)^2 \\    \hat{x}_{ij} = \frac{x_{ij} - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}$$
$$y_{ij} = \gamma \ \hat{x}_{ij} + \beta$$

<center>
<img src="https://image.ibb.co/hjtuX0/layernorm.png" width="20%">
</center>
  
*From [Weight Normalization and Layer Normalization Explained ](http://mlexplained.com/2018/01/13/weight-normalization-and-layer-normalization-explained-normalization-in-deep-learning-part-2/)*

If in BatchNorm, statistics are considered for each feature as averaging over a batch, then now for each entry, averaging over features.

In [None]:
class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-6):
        super().__init__()
        
        self._gamma = nn.Parameter(torch.ones(features))
        self._beta = nn.Parameter(torch.zeros(features))
        self._eps = eps

    def forward(self, inputs):
        <calc it>

### Attention

The whole Transformer relies on the idea of ​​self-attention. It looks like this:
<center>
<img src="http://jalammar.github.io/images/t/transformer_self-attention_visualization.png" width="20%">
</center>

*From [Tensor2Tensor Tutorial](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)*

The embedding of the word * it * is constructed as a combination of all the embeddingdings of a sentence.

The article came up with the idea of ​​doing this attention:

$$ \mathrm {Attention} (Q, K, V) = \mathrm {softmax} \left (\frac {QK ^ T} {\sqrt {d_k}} \right) V $$

This is approximately like dot-attention in the last lesson: the request (** Q ** uery) is multiplied by the keys (** K ** ey) scalarly, then the softmax is taken - estimates are obtained of how interesting the different timestamps are from the values ​​(** V ** alue). 

For example, $ \mathrm {emb} (\text {it}) = \mathrm {Attention} (\text {it}, \ldots \text {because it was too tired}, \ldots \text {because it was too tired }) $.

Only now with the $ \frac {1} {\sqrt {d_k}} $ parameter, where $ d_k $ is the dimension of the key. It is argued that this works better for large key sizes $ d_k $.

In [None]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, dropout_rate):
        super().__init__()
        
        self._dropout = nn.Dropout(dropout_rate)
        
    def forward(self, query, key, value, mask):
        <calc it>

### Multi-Head Attention

<center>
<img src="https://hsto.org/webt/59/f0/44/59f0440f1109b864893781.png" width="20%">
</center>

The important idea why attention (and, most importantly, self-attention) has earned is the use of several heads (multi-head).

In general, when we make attention - we determine the similarity of the key and the request. Many heads helps (should) determine this similarity by different criteria - syntactically, semantically, etc.

For example, in the picture two heads are used and one head looks at * the animal * when generating * it *, the second - at * tired *:

<center>
<img src="https://hsto.org/webt/59/f0/44/59f0440f1109b864893781.png" width="20%">
</center>
    
*From [Tensor2Tensor Tutorial](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)*


It is applied this way:
$$\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, ...,
\mathrm{head_h})W^O    \\
    \mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)$$
    
Where $W^Q_i \in \mathbb{R}^{d_{model} \times d_k}, W_i^K \in \mathbb{R}^{d_{model} \times d_k}, W^V_i \in \mathbb{R}^{d_{model} \times d_v}, W^O \in \mathbb{R}^{hd_v \times d_{model}}$.

The original article used $ h = 8 $, $ d_k = d_v = d _ {\text {model}} / h = 64 $.

The process of applying this:
<center>
<img src="http://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png" width="20%">
</center>
    
*From Illustrated Transformer*

In [None]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, heads_count, d_model, dropout_rate=0.1):
        super().__init__()
        
        assert d_model % heads_count == 0

        self._d_k = d_model // heads_count
        self._heads_count = heads_count
        self._attention = ScaledDotProductAttention(dropout_rate)
        self._attn_probs = None
        
        self._w_q = nn.Linear(d_model, d_model)
        self._w_k = nn.Linear(d_model, d_model)
        self._w_v = nn.Linear(d_model, d_model)
        self._w_o = nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value, mask=None):
        <calc it>

### Position-wise Feed-Forward Networks

Линейный блок в энкодере выглядит так:
$$\mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2$$

In [None]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, inputs):
        return self.w_2(self.dropout(F.relu(self.w_1(inputs))))

### Encoder block

Соберем все в блок:

In [None]:
class EncoderBlock(nn.Module):
    def __init__(self, size, self_attn, feed_forward, dropout_rate):
        super().__init__()
        
        self._self_attn = self_attn
        self._feed_forward = feed_forward
        self._self_attention_block = ResidualBlock(size, dropout_rate)
        self._feed_forward_block = ResidualBlock(size, dropout_rate)

    def forward(self, inputs, mask):
        outputs = self._self_attention_block(inputs, lambda inputs: self._self_attn(inputs, inputs, inputs, mask))
        return self._feed_forward_block(outputs, self._feed_forward)

In [None]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, d_ff, blocks_count, heads_count, dropout_rate):
        super().__init__()
        
        self._emb = nn.Sequential(
            nn.Embedding(vocab_size, d_model),
            PositionalEncoding(d_model, dropout_rate)
        )
        
        block = lambda: EncoderBlock(
            size=d_model, 
            self_attn=MultiHeadedAttention(heads_count, d_model, dropout_rate), 
            feed_forward=PositionwiseFeedForward(d_model, d_ff, dropout_rate),
            dropout_rate=dropout_rate
        )
        self._blocks = nn.ModuleList([block() for _ in range(blocks_count)])
        self._norm = LayerNorm(d_model)
        
    def forward(self, inputs, mask):
        inputs = self._emb(inputs)
        
        for block in self._blocks:
            inputs = block(inputs, mask)
        return self._norm(inputs)

## Decoder

<center>
<img src="https://hsto.org/webt/59/f0/44/59f0440f7d88f805415140.png" width="10%">
</center>

The decoder block (gray part) consists of three parts:
1. First - the same self-attention as in the encoder
2. Then - the standard attention to the outputs from the encoder + the current state of the decoder (it was the same in seq2seq with attention)
3. Finally - feed-forward block

All this, of course, with residual links.

In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, size, self_attn, encoder_attn, feed_forward, dropout_rate):
        super().__init__()
                
        self._self_attn = self_attn
        self._encoder_attn = encoder_attn
        self._feed_forward = feed_forward
        self._self_attention_block = ResidualBlock(size, dropout_rate)
        self._attention_block = ResidualBlock(size, dropout_rate)
        self._feed_forward_block = ResidualBlock(size, dropout_rate)
 
    def forward(self, inputs, encoder_output, source_mask, target_mask):
        outputs = self._self_attention_block(
            inputs, lambda inputs: self._self_attn(inputs, inputs, inputs, target_mask)
        )
        outputs = self._attention_block(
            outputs, lambda inputs: self._encoder_attn(inputs, encoder_output, encoder_output, source_mask)
        )
        return self._feed_forward_block(outputs, self._feed_forward)

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, d_ff, blocks_count, heads_count, dropout_rate):
        super().__init__()
        
        self._emb = nn.Sequential(
            nn.Embedding(vocab_size, d_model),
            PositionalEncoding(d_model, dropout_rate)
        )
        
        block = lambda: DecoderLayer(
            size=d_model, 
            self_attn=MultiHeadedAttention(heads_count, d_model, dropout_rate),
            encoder_attn=MultiHeadedAttention(heads_count, d_model, dropout_rate),
            feed_forward=PositionwiseFeedForward(d_model, d_ff, dropout_rate),
            dropout_rate=dropout_rate
        )
        self._blocks = nn.ModuleList([block() for _ in range(blocks_count)])
        self._norm = LayerNorm(d_model)
        self._out_layer = nn.Linear(d_model, vocab_size)
        
    def forward(self, inputs, encoder_output, source_mask, target_mask):
        inputs = self._emb(inputs)
        for block in self._blocks:
            inputs = block(inputs, encoder_output, source_mask, target_mask)
        return self._out_layer(self._norm(inputs))

В декодере нужно аттентиться только на предыдущие токены - сгенерируем маску для этого:

In [None]:
def subsequent_mask(size):
    mask = torch.ones(size, size, device=DEVICE).triu_()
    return mask.unsqueeze(0) == 0

In [None]:
plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])

## Полная модель

In [None]:
class FullModel(nn.Module):
    def __init__(self, source_vocab_size, target_vocab_size, d_model=256, d_ff=1024, 
                 blocks_count=4, heads_count=8, dropout_rate=0.1):
        
        super().__init__()
        
        self.d_model = d_model
        self.encoder = Encoder(source_vocab_size, d_model, d_ff, blocks_count, heads_count, dropout_rate)
        self.decoder = Decoder(target_vocab_size, d_model, d_ff, blocks_count, heads_count, dropout_rate)
        
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
        
    def forward(self, source_inputs, target_inputs, source_mask, target_mask):
        encoder_output = self.encoder(source_inputs, source_mask)
        return self.decoder(target_inputs, encoder_output, source_mask, target_mask)

In [None]:
def make_mask(source_inputs, target_inputs, pad_idx):
    source_mask = (source_inputs != pad_idx).unsqueeze(-2)
    target_mask = (target_inputs != pad_idx).unsqueeze(-2)
    target_mask = target_mask & subsequent_mask(target_inputs.size(-1)).type_as(target_mask)
    return source_mask, target_mask


def convert_batch(batch, pad_idx=1):
    source_inputs, target_inputs = batch.source.transpose(0, 1), batch.target.transpose(0, 1)
    source_mask, target_mask = make_mask(source_inputs, target_inputs, pad_idx)
    
    return source_inputs, target_inputs, source_mask, target_mask

In [None]:
batch = next(iter(train_iter))

In [None]:
model = FullModel(source_vocab_size=len(word_field.vocab), target_vocab_size=len(word_field.vocab)).to(DEVICE)

model(*convert_batch(batch))

## Оптимизатор

Тоже очень важно в данной модели - использовать правильный оптимизатор

In [None]:
class NoamOpt(object):
    def __init__(self, model_size, factor=2, warmup=4000, optimizer=None):
        if optimizer is not None:
            self.optimizer = optimizer
        else:
            self.optimizer = optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0
        
    def step(self):
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()
        
    def rate(self, step = None):
        if step is None:
            step = self._step
        return self.factor * (self.model_size ** (-0.5) * min(step ** (-0.5), step * self.warmup ** (-1.5)))

Идея в том, чтобы повышать learning rate в течении первых warmup шагов линейно, а затем понижать его по сложной формуле:

$$lrate = d_{\text{model}}^{-0.5} \cdot
  \min({step\_num}^{-0.5},
    {step\_num} \cdot {warmup\_steps}^{-1.5})$$

In [None]:
opts = [NoamOpt(512, 1, 4000, None), 
        NoamOpt(512, 1, 8000, None),
        NoamOpt(256, 1, 4000, None)]
plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts] for i in range(1, 20000)])
plt.legend(["512:4000", "512:8000", "256:4000"])

## Тренировка модели

In [None]:
import math
from tqdm import tqdm
tqdm.get_lock().locks = []


def do_epoch(model, criterion, data_iter, optimizer=None, name=None):
    epoch_loss = 0
    
    is_train = not optimizer is None
    name = name or ''
    model.train(is_train)
    
    batches_count = len(data_iter)
    
    with torch.autograd.set_grad_enabled(is_train):
        with tqdm(total=batches_count) as progress_bar:
            for i, batch in enumerate(data_iter):
                source_inputs, target_inputs, source_mask, target_mask = convert_batch(batch)                                
                logits = model.forward(source_inputs, target_inputs[:, :-1], source_mask, target_mask[:, :-1, :-1])
                
                logits = logits.contiguous().view(-1, logits.shape[-1])
                target = target_inputs[:, 1:].contiguous().view(-1)
                loss = criterion(logits, target)

                epoch_loss += loss.item()

                if optimizer:
                    optimizer.optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

                progress_bar.update()
                progress_bar.set_description('{:>5s} Loss = {:.5f}, PPX = {:.2f}'.format(name, loss.item(), 
                                                                                         math.exp(loss.item())))
                
            progress_bar.set_description('{:>5s} Loss = {:.5f}, PPX = {:.2f}'.format(
                name, epoch_loss / batches_count, math.exp(epoch_loss / batches_count))
            )
            progress_bar.refresh()

    return epoch_loss / batches_count


def fit(model, criterion, optimizer, train_iter, epochs_count=1, val_iter=None):
    best_val_loss = None
    for epoch in range(epochs_count):
        name_prefix = '[{} / {}] '.format(epoch + 1, epochs_count)
        train_loss = do_epoch(model, criterion, train_iter, optimizer, name_prefix + 'Train:')
        
        if not val_iter is None:
            val_loss = do_epoch(model, criterion, val_iter, None, name_prefix + '  Val:')

In [None]:
model = FullModel(source_vocab_size=len(word_field.vocab), target_vocab_size=len(word_field.vocab)).to(DEVICE)

pad_idx = word_field.vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx).to(DEVICE)

optimizer = NoamOpt(model.d_model)

fit(model, criterion, optimizer, train_iter, epochs_count=30, val_iter=test_iter)

** Task ** Add a generator for the model.

** Task ** Add a rating for the model using the ROUGE metric (for example, from the package https://pypi.project/pyrouge/0.1.3/)

** Task ** Add visualization (can be peeped in the code by links).

## Model improvements

** Task ** Try to share the matrix of embeddings - there are three of them (input to the encoder and decoder + decoder output).

** Task ** Change Loss to LabelSmoothing.

# Pointer-Generator Networks

A cool idea specific to self-freezing:
<center>
<img src = "https://image.ibb.co/eijTc0/2018-11-20-10-18-52.png" width = "25%">
</center>

** Task ** Try to implement it.

# Referrence
Attention Is All You Need, 2017 [[pdf]](https://arxiv.org/pdf/1706.03762.pdf)  
Get To The Point: Summarization with Pointer-Generator Networks, 2017 [[pdf]](https://arxiv.org/pdf/1704.04368.pdf)  
Universal Transformers, 2018 [[arxiv]](https://arxiv.org/abs/1807.03819)

[Transformer — новая архитектура нейросетей для работы с последовательностями](https://habr.com/post/341240/)  
[The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)  
[The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)  
[Weighted Tranformer](https://einstein.ai/research/blog/weighted-transformer)  
[Your tldr by an ai: a deep reinforced model for abstractive summarization](https://einstein.ai/research/blog/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization)