In [None]:
!pip3 -qq install torch==0.4.1
!pip -qq install torchtext==0.3.1
!pip -qq install torchvision==0.2.1
!pip -qq install spacy==2.0.16
!python -m spacy download en
!pip install sacremoses==0.0.5
!pip install subword_nmt==0.3.5
!wget -qq http://www.manythings.org/anki/rus-eng.zip 
!unzip rus-eng.zip

In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


if torch.cuda.is_available():
    from torch.cuda import FloatTensor, LongTensor
    DEVICE = torch.device('cuda')
else:
    from torch import FloatTensor, LongTensor
    DEVICE = torch.device('cpu')

np.random.seed(42)

# Machine Translation

We have already looked at this picture several times:
<img src="http://karpathy.github.io/assets/rnn/diags.jpeg" width="50%">

*From [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)*

In POS tagging, we (well, some accurately) used an important idea: first, a certain function above the symbols was used to embed a word (for example, many to one rnn'koy in the picture). Then another rnn'ka built embeddingings of words according to their context. And then it is all classified by logistic regression.

Here it is important that we teach the encoder to build end2end embeddings - right in the network (this is the main difference between neural networks and classical approaches - in the ability to do end2end).

Another thing we did was language models. Here, like this:
<img src="https://hsto.org/web/dc1/7c2/c4e/dc17c2c4e9ac434eb5346ada2c412c9a.png" width="50%">


Pay attention to the red arrow - it shows the transfer of the hidden state, which is responsible for the network memory.

Now let's combine these two ideas:
<img src="https://raw.githubusercontent.com/tensorflow/nmt/master/nmt/g3doc/img/seq2seq.jpg" width="50%">

*From [tensorflow/nmt](https://github.com/tensorflow/nmt)*

Everything looks almost like a language model, but in the blue part the predictions are not made, only the last hidden state is used.

The blue part of the network is called the encoder, it builds the embedding sequence. The red part is a decoder, it works like a normal language model, but takes into account the result of the work of the encoder.

As a result, the encoder learns to efficiently extract meaning from a sequence of words, and the decoder must build on them a new sequence. This can be a sequence of translation words, or a sequence of words in the chat bot reply, or something else depending on your corruption.

## Data preparation

Let's start by reading the data. Take them from anki, so they are a bit specific:

In [None]:
!shuf -n 10 rus.txt

Токенизируем их:

In [None]:
from torchtext.data import Field, Example, Dataset, BucketIterator

BOS_TOKEN = '<s>'
EOS_TOKEN = '</s>'

source_field = Field(tokenize='spacy', init_token=None, eos_token=EOS_TOKEN)
target_field = Field(tokenize='moses', init_token=BOS_TOKEN, eos_token=EOS_TOKEN)
fields = [('source', source_field), ('target', target_field)]

In [None]:
source_field.preprocess("It's surprising that you haven't heard anything about her wedding.")

In [None]:
target_field.preprocess('Удивительно, что ты ничего не слышал о её свадьбе.')

In [None]:
from tqdm import tqdm

MAX_TOKENS_COUNT = 16
SUBSET_SIZE = .3

examples = []
with open('rus.txt') as f:
    for line in tqdm(f, total=328190):
        source_text, target_text = line.split('\t')
        source_text = source_field.preprocess(source_text)
        target_text = target_field.preprocess(target_text)
        if len(source_text) <= MAX_TOKENS_COUNT and len(target_text) <= MAX_TOKENS_COUNT:
            if np.random.rand() < SUBSET_SIZE:
                examples.append(Example.fromlist([source_text, target_text], fields))

Построим датасеты:

In [None]:
dataset = Dataset(examples, fields)

train_dataset, test_dataset = dataset.split(split_ratio=0.85)

print('Train size =', len(train_dataset))
print('Test size =', len(test_dataset))

source_field.build_vocab(train_dataset, min_freq=3)
print('Source vocab size =', len(source_field.vocab))

target_field.build_vocab(train_dataset, min_freq=3)
print('Target vocab size =', len(target_field.vocab))

train_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset, test_dataset), batch_sizes=(32, 256), shuffle=True, device=DEVICE, sort=False
)

In [None]:
source_field.process([source_field.preprocess("It's surprising that you haven't heard anything about her wedding.")])

In [None]:
source_field.vocab.itos

In [None]:
target_field.vocab.itos

## Seq2seq модель

It's time to write simple seq2seq. We divide the model into several modules - Encoder, Decoder and their combination.

Encoder should be similar to the character reticule in POS tagging: to attach tokens and start rnn (in this case we will use GRU) and give the last hidden state.

The decoder is almost the same, only it predicts tokens at each step.

** Task ** Implement models.

In [None]:
batch = next(iter(train_iter))

In [None]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, emb_dim=128, rnn_hidden_dim=256, num_layers=1, bidirectional=False):
        super().__init__()

        self._emb = nn.Embedding(vocab_size, emb_dim)
        self._rnn = nn.GRU(input_size=emb_dim, hidden_size=rnn_hidden_dim, 
                           num_layers=num_layers, bidirectional=bidirectional)

    def forward(self, inputs, hidden=None):
        <implement me>

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_dim=128, rnn_hidden_dim=256, num_layers=1):
        super().__init__()

        self._emb = nn.Embedding(vocab_size, emb_dim)
        self._rnn = nn.GRU(input_size=emb_dim, hidden_size=rnn_hidden_dim, num_layers=num_layers)
        self._out = nn.Linear(rnn_hidden_dim, vocab_size)

    def forward(self, inputs, encoder_output, hidden=None):
        <implement me>

Модель перевода будет просто сперва вызывать Encoder, а потом передавать его скрытое состояние декодеру в качестве начального.

In [None]:
class TranslationModel(nn.Module):
    def __init__(self, source_vocab_size, target_vocab_size, emb_dim=128, 
                 rnn_hidden_dim=256, num_layers=1, bidirectional_encoder=False):
        
        super().__init__()
        
        self.encoder = Encoder(source_vocab_size, emb_dim, rnn_hidden_dim, num_layers, bidirectional_encoder)
        self.decoder = Decoder(target_vocab_size, emb_dim, rnn_hidden_dim, num_layers)
        
    def forward(self, source_inputs, target_inputs):
        encoder_hidden = self.encoder(source_inputs)
        
        return self.decoder(target_inputs, encoder_hidden, encoder_hidden)

In [None]:
model = TranslationModel(source_vocab_size=len(source_field.vocab), target_vocab_size=len(target_field.vocab)).to(DEVICE)

model(batch.source, batch.target)

We implement a simple translation - greedy. At each step we will issue the most likely of the predicted tokens:<img src="https://github.com/tensorflow/nmt/raw/master/nmt/g3doc/img/greedy_dec.jpg" width="50%"> 
*From [tensorflow/nmt](https://github.com/tensorflow/nmt)*

** Task ** Implement function.

In [None]:
def greedy_decode(model, source_text, source_field, target_field):
    bos_index = target_field.vocab.stoi[BOS_TOKEN]
    eos_index = target_field.vocab.stoi[EOS_TOKEN]
    
    model.eval()
    with torch.no_grad():
        result = [] # list of predicted tokens indices
        <implement me>
            
        return ' '.join(target_field.vocab.itos[ind.squeeze().item()] for ind in result)

In [None]:
greedy_decode(model, "Do you believe?", source_field, target_field)


Need to somehow evaluate the model.

Usually, [BLEU speed] (https://en.wikipedia.org/wiki/BLEU) is used for this - something like the accuracy of guessing n-gram from the correct (reference) translation.

** Task ** Implement the evaluation function: for batches from ʻiterator`, predict their translations, trim by `</ s>` and add the correct variants and predicted in `refs` and` hyps`, respectively.

In [None]:
from nltk.translate.bleu_score import corpus_bleu

def evaluate_model(model, iterator):
    model.eval()
    refs, hyps = [], []
    eos_index = iterator.dataset.fields['target'].vocab.stoi[EOS_TOKEN]
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            <implement me>
            
    return corpus_bleu([[ref] for ref in refs], hyps) * 100

In [None]:
import math
from tqdm import tqdm
tqdm.get_lock().locks = []


def do_epoch(model, criterion, data_iter, optimizer=None, name=None):
    epoch_loss = 0
    
    is_train = not optimizer is None
    name = name or ''
    model.train(is_train)
    
    batches_count = len(data_iter)
    
    with torch.autograd.set_grad_enabled(is_train):
        with tqdm(total=batches_count) as progress_bar:
            for i, batch in enumerate(data_iter):                
                logits, _ = model(batch.source, batch.target)
                
                target = torch.cat((batch.target[1:], batch.target.new_ones((1, batch.target.shape[1]))))
                loss = criterion(logits.view(-1, logits.shape[-1]), target.view(-1))

                epoch_loss += loss.item()

                if optimizer:
                    optimizer.zero_grad()
                    loss.backward()
                    nn.utils.clip_grad_norm_(model.parameters(), 1.)
                    optimizer.step()

                progress_bar.update()
                progress_bar.set_description('{:>5s} Loss = {:.5f}, PPX = {:.2f}'.format(name, loss.item(), 
                                                                                         math.exp(loss.item())))
                
            progress_bar.set_description('{:>5s} Loss = {:.5f}, PPX = {:.2f}'.format(
                name, epoch_loss / batches_count, math.exp(epoch_loss / batches_count))
            )
            progress_bar.refresh()

    return epoch_loss / batches_count


def fit(model, criterion, optimizer, train_iter, epochs_count=1, val_iter=None):
    best_val_loss = None
    for epoch in range(epochs_count):
        name_prefix = '[{} / {}] '.format(epoch + 1, epochs_count)
        train_loss = do_epoch(model, criterion, train_iter, optimizer, name_prefix + 'Train:')
        
        if not val_iter is None:
            val_loss = do_epoch(model, criterion, val_iter, None, name_prefix + '  Val:')
            print('\nVal BLEU = {:.2f}'.format(evaluate_model(model, val_iter)))

In [None]:
model = TranslationModel(source_vocab_size=len(source_field.vocab), target_vocab_size=len(target_field.vocab)).to(DEVICE)

pad_idx = target_field.vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx).to(DEVICE)

optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, train_iter, epochs_count=30, val_iter=test_iter)

## Scheduled Sampling

Until now, we have been training the translation using the so-called * teacher forcing *: the decoder always took the correct token as output in the previous step. The problem with this approach is that during the inference the correct token will most likely not get out at least at some step. It turns out that the network studied at the good inputs, it will be used on the bad - it can easily break everything.

An alternative approach is to sample the token from the current step right during the training and transfer it to the next.

Such an approach is not very well mathematically substantiated (gradients are not passed through sampling), but it is interesting to implement it and it often improves the quality.


** Task ** Update `Decoder`: Replace the rnn'c call above the sequence with a loop. At each step, pass the `p` probability as the previous output to the decoder, the correct input, and otherwise - argmax from the previous output (the cycle should be similar to those in` greedy_decode` and evaluate_model`). When passing argmax, call `detach` so that the gradients do not run through. Collect all exits in the list, at the end make `torch.cat`.

As a result, with a probability equal to `p = 1`, it should turn out as before, only slower. When training, you can pass `p = 0.5`, in the interest -` p = 1`.

## Beam Search

Another way to deal with decoding errors on the interference is to do beam search. In essence, this is a depth search with very strong clipping at each step:
<img src="https://image.ibb.co/dBRKkA/2018-11-06-13-53-40.png" width="50%">
  
*From [cs224n, Machine Translation, Seq2Seq and Attention](http://web.stanford.edu/class/cs224n/lectures/lecture10.pdf)*

In the picture, at each step, the two best (according to the network predictions) of the four chain continuation options are selected.

For the comparison of beams, the sums of the log probabilities of the tokens included in the beam are used. To get log-likelihoods, you just need to call `F.log_softmax` for logites. The advantage of adding logarithms over the multiplication of probabilities should be clear: there are no such problems with numerical instability — by multiplying probabilities close to zero, we very quickly get zero as a scramble.

As a result, you need to implement an analogue of `gready_decoding`.

The Beam will consist of a sequence of token indices (at the beginning - `[bos_index]`), total quality (at the beginning of 0) and the last `hidden` (at the beginning of` encoder_hidden`).


Interactive visualization dragged off https://github.com/yandexdataschool/nlp_course:

In [None]:
!wget https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/beam_search.html 2> log
from IPython.display import HTML
# source: parlament does not support the amendment freeing tymoshenko
HTML('./beam_search.html')

In [None]:
def beam_search_decode(model, source_text, source_field, target_field, beam_size=5):
    bos_index = target_field.vocab.stoi[BOS_TOKEN]
    eos_index = target_field.vocab.stoi[EOS_TOKEN]
    
    model.eval()
    with torch.no_grad():
        encoder_hidden = model.encoder(...source_text...)
        beams = [([bos_index], 0, encoder_hidden)]
        
        # 1. make next step from each beam
        # 2. create new beams from top beam_size of each continuation (best next token variants for the given token)
        # 3. leave only top beam_size beams
        # 4. repeat
            
        return ' '.join(target_field.vocab.itos[ind.squeeze().item()] for ind in result)

## Model improvements


**Task** Try to improve the quality of the model. Try:
- Bidirectional encoder
- Dropout
- Stack moar layers

## Byte-Pair Encoding

We can represent the words in one index - and use word embeddings as rows of the embeddingings matrix.
We can consider them a set of characters and get word embedding with the help of some function above symbol embeddings.

Finally, we can also use an intermediate representation - as a set of subwords.

A few years ago, the use of subwords was suggested for the machine translation task: [Neural Machine Translation of Rare Words with Subword Units] (https://arxiv.org/abs/1508.07909). It used byte-pair encoding.

In fact, this is the process of combining the most frequent pairs of characters of the alphabet into a new super-symbol. Suppose we have a dictionary consisting of such a set of words:
`‘ Low · ’,‘ lowest · ’,‘ newer · ’,‘ wider · ’`
(`·` Means the end of the word)

Then the first can learn the new character `r ·`, after it `l o` will turn into` lo`. `W` will be attached to this new symbol:` lo w` $ \ to $ `low`. And so on.

It is argued that in this way, firstly, all frequency and short words will be learned, and secondly, all significant subwords. For example, the resulting alphabet should contain `ly ·` and `tion ·`.

Then the word can be broken into a set of subwords - and act as with characters.

Here you can find pre-trained embeddingings: [BPEmb](https://github.com/bheinzerling/bpemb).

We will train a model for them:

In [None]:
from subword_nmt.learn_bpe import learn_bpe
from subword_nmt.apply_bpe import BPE

with open('data.en', 'w') as f_src, open('data.ru', 'w') as f_dst:
    for example in examples:
        f_src.write(' '.join(example.source) + '\n')
        f_dst.write(' '.join(example.target) + '\n')

bpe = {}
for lang in ['en', 'ru']:
    with open('./data.' + lang) as f_data, open('bpe_rules.' + lang, 'w') as f_rules:
        learn_bpe(f_data, f_rules, num_symbols=3000)
    with open('bpe_rules.' + lang) as f_rules:
        bpe[lang] = BPE(f_rules)

In [None]:
bpe['en'].process_line(' '.join(examples[10000].source))

In [None]:
bpe['ru'].process_line(' '.join(examples[10000].target))

**Задание** Переобучиться с subword'ами вместо слов. Возможно, поменять их число (`num_symbols`)

# Image Captioning

It is not necessary to encode a sequence of words. For example, you can use a convolutional network for an image encoder - and generate a signature for it:

<img src="https://image.ibb.co/fpYdkL/image-captioning.png" width="50%">

* From [Image Captioning Tutorial] (https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning) *

The result is very cool signatures: [https://cs.stanford.edu/people/karpathy/deepimagesent/] (http: //cs.stanford.edu/people/karpathy/deepimagesent/).

Скачаем данные для обучения:

In [None]:
# Install the PyDrive wrapper & import libraries.
# This only needs to be done once per notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

downloaded = drive.CreateFile({'id': '13BP-6Xd6ymhGallRppYfJBO6UUjFCtbB'})
downloaded.GetContentFile('image_codes.npy')

downloaded = drive.CreateFile({'id': '1O7_3lyTyBMXsBBIt1PwUXwLdkyRQzZML'})
downloaded.GetContentFile('sources.txt')

downloaded = drive.CreateFile({'id': '1t-Dy8TzoRuTMoM7N9NJZKgWXfaw3b6KF'})
downloaded.GetContentFile('texts.txt')

!wget http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_Dataset.zip
!unzip Flickr8k_Dataset.zip

Скачем предобученную модельку:

In [None]:
from torchvision.models.inception import Inception3

class BeheadedInception3(Inception3):
    """ Like torchvision.models.inception.Inception3 but the head goes separately """
    
    def forward(self, x):
        x = x.clone()
        x[:, 0] = x[:, 0] * (0.229 / 0.5) + (0.485 - 0.5) / 0.5
        x[:, 1] = x[:, 1] * (0.224 / 0.5) + (0.456 - 0.5) / 0.5
        x[:, 2] = x[:, 2] * (0.225 / 0.5) + (0.406 - 0.5) / 0.5
        x = self.Conv2d_1a_3x3(x)
        x = self.Conv2d_2a_3x3(x)
        x = self.Conv2d_2b_3x3(x)
        x = F.max_pool2d(x, kernel_size=3, stride=2)
        x = self.Conv2d_3b_1x1(x)
        x = self.Conv2d_4a_3x3(x)
        x = F.max_pool2d(x, kernel_size=3, stride=2)
        x = self.Mixed_5b(x)
        x = self.Mixed_5c(x)
        x = self.Mixed_5d(x)
        x = self.Mixed_6a(x)
        x = self.Mixed_6b(x)
        x = self.Mixed_6c(x)
        x = self.Mixed_6d(x)
        x = self.Mixed_6e(x)
        x = self.Mixed_7a(x)
        x = self.Mixed_7b(x)
        x_for_attn = x = self.Mixed_7c(x)
        # 8 x 8 x 2048
        x = F.avg_pool2d(x, kernel_size=8)
        # 1 x 1 x 2048
        x_for_capt = x = x.view(x.size(0), -1)
        # 2048
        x = self.fc(x)
        # 1000 (num_classes)
        return x_for_attn, x_for_capt, x

In [None]:
from torch.utils.model_zoo import load_url

inception_model = BeheadedInception3()

inception_url = 'https://download.pytorch.org/models/inception_v3_google-1a9a5a14.pth'
inception_model.load_state_dict(load_url(inception_url))

inception_model.eval()

Почему это вообще работает? Запустим модельку на картинке:

In [None]:
from matplotlib import pyplot as plt
from scipy.misc import imresize
%matplotlib inline
    
img = plt.imread('Flicker8k_Dataset/1000268201_693b08cb0e.jpg')
img = imresize(img, (299, 299)).astype('float32') / 255.
plt.imshow(img)

In [None]:
import requests
LABELS_URL = 'https://s3.amazonaws.com/outcome-blog/imagenet/labels.json'
labels = {int(key): value for (key, value) in requests.get(LABELS_URL).json().items()}

with torch.no_grad():
    img_tensor = torch.tensor(img.transpose([2, 0, 1]), dtype=torch.float32).unsqueeze(0)
    _, _, logits = inception_model(img_tensor)
    _, top_classes = logits.topk(5)

    print('; '.join(labels[ind.item()] for ind in top_classes.squeeze()))

Она выдает такие классы.

Подписи же к картинке такие:

In [None]:
with open('texts.txt') as f:
    text = f.readline().strip().split('\t')
print('\n'.join(text))

Загрузим данные:

In [None]:
source_field = Field(sequential=False, use_vocab=False, dtype=torch.float)
target_field = Field(init_token=BOS_TOKEN, eos_token=EOS_TOKEN)
path_field = Field(sequential=False, use_vocab=True)

fields = [('source', source_field), ('target', target_field), ('path', path_field)]

In [None]:
img_vectors = np.load('image_codes.npy')

examples = []
with open('texts.txt') as f_texts, open('sources.txt') as f_sources:
    for img, texts, source in zip(img_vectors, f_texts, f_sources):
        for text in texts.split('\t'):
            examples.append(Example.fromlist([img, target_field.preprocess(text), source.rstrip()], fields))

In [None]:
dataset = Dataset(examples, fields)

train_dataset, test_dataset = dataset.split(split_ratio=0.85)

print('Train size =', len(train_dataset))
print('Test size =', len(test_dataset))

target_field.build_vocab(train_dataset, min_freq=2)
path_field.build_vocab(dataset)
print('Target vocab size =', len(target_field.vocab))

train_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset, test_dataset), batch_sizes=(32, 512), shuffle=True, device=DEVICE, sort=False
)

**Задание** Реализуйте декодер для модели:

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, cnn_feature_size, emb_dim=128, rnn_hidden_dim=256, num_layers=1):
        super().__init__()

        self._emb = nn.Embedding(vocab_size, emb_dim)
        self._cnn_to_h0 = nn.Linear(cnn_feature_size, rnn_hidden_dim)
        self._cnn_to_c0 = nn.Linear(cnn_feature_size, rnn_hidden_dim)
        self._rnn = nn.LSTM(input_size=emb_dim, hidden_size=rnn_hidden_dim, num_layers=num_layers)
        self._out = nn.Linear(rnn_hidden_dim, vocab_size)

    def forward(self, encoder_output, inputs, hidden=None):
        ...
    
    def init_hidden(self, encoder_output):
        encoder_output = encoder_output.unsqueeze(0)
        return self._cnn_to_h0(encoder_output), self._cnn_to_c0(encoder_output)

Хак, чтобы все работало со старым циклом обучения:

In [None]:
def evaluate_model(model, iterator):
    return 0.

In [None]:
model = Decoder(vocab_size=len(target_field.vocab), cnn_feature_size=img_vectors.shape[1]).to(DEVICE)

pad_idx = target_field.vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx).to(DEVICE)

optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, train_iter, epochs_count=30, val_iter=test_iter)

Проверим, что работает генерация:

In [None]:
batch = next(iter(test_iter))

img = path_field.vocab.itos[batch.path[0].item()]

img = plt.imread('Flicker8k_Dataset/' + img)
img = imresize(img, (299, 299)).astype('float32') / 255.
plt.imshow(img)

**Задание** Напишите цикл генерации из модели:

# Referrence

Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, et al, 2014 [[pdf]](https://arxiv.org/pdf/1409.3215.pdf)  
Show and Tell: A Neural Image Caption Generator, Oriol Vinyals et al, 2014 [[arxiv]](https://arxiv.org/abs/1411.4555)  
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, Samy Bengio, et al, 2015 [[arxiv]](https://arxiv.org/abs/1506.03099)  
Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, 2015 [[arxiv]](https://arxiv.org/abs/1508.07909)  
Massive Exploration of Neural Machine Translation Architectures, Denny Britz, et al, 2017 [[pdf]](https://arxiv.org/pdf/1703.03906.pdf)

Neural Machine Translation (seq2seq) Tutorial [tensorflow/nmt](https://github.com/tensorflow/nmt)  
[A Word of Caution on Scheduled Sampling for Training RNNs](https://www.inference.vc/scheduled-sampling-for-rnns-scoring-rule-interpretation/)

cs224n, [Machine Translation, Seq2Seq and Attention](https://www.youtube.com/watch?v=IxQtK2SjWWM)

[The Annotated Encoder Decoder](https://bastings.github.io/annotated_encoder_decoder/)  
[Seq2Seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models](http://seq2seq-vis.io)