In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


if torch.cuda.is_available():
    from torch.cuda import FloatTensor, LongTensor
else:
    from torch import FloatTensor, LongTensor

np.random.seed(42)

# RNN, part 2

## POS Tagging

We have already looked at the use of recurrent networks for classification.

<img src="http://karpathy.github.io/assets/rnn/diags.jpeg">


*From [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)*

Let's move on to another option - sequence labeling (the last picture).

The most popular examples for such a problem setting are Part-of-Speech Tagging and Named Entity Recognition.

We will now solve POS Tagging for English.

We will work with the following tags:
- ADJ - adjective (new, good, high, ...)
- ADP - adposition (on, of, at, ...)
- ADV - adverb (really, already, still, ...)
- CONJ - conjunction (and, or, but, ...)
- DET - determiner, article (the, a, some, ...)
- NOUN - noun (year, home, costs, ...)
- NUM - numeral (twenty-four, fourth, 1991, ...)
- PRT - particle (at, on, out, ...)
- PRON - pronoun (he, their, her, ...)
- VERB - verb (is, say, told, ...)
- . - punctuation marks (. , ;)
- X - other (ersatz, esprit, dunno, ...)

In [None]:
import nltk
from sklearn.cross_validation import train_test_split

nltk.download('brown')
nltk.download('universal_tagset')

data = nltk.corpus.brown.tagged_sents(tagset='universal')

In [None]:
for word, tag in data[0]:
    print('{:15}\t{}'.format(word, tag))

Construct a partitioning into train / val / test - finally, everything is just like normal people.

We will study on the train, according to val - we will select the parameters and do all sorts of early stopping, and on test - we will accept the model for its final quality.

In [None]:
train_data, test_data = train_test_split(data, test_size=0.25, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.15, random_state=42)

print('Words count in train set:', sum(len(sent) for sent in train_data))
print('Words count in val set:', sum(len(sent) for sent in val_data))
print('Words count in test set:', sum(len(sent) for sent in test_data))

Construct mappings from words to an index and from a tag to an index:

In [None]:
words = {word for sample in train_data for word, tag in sample}
word2ind = {word: ind + 1 for ind, word in enumerate(words)}
word2ind['<pad>'] = 0

tags = {tag for sample in train_data for word, tag in sample}
tag2ind = {tag: ind + 1 for ind, tag in enumerate(tags)}
tag2ind['<pad>'] = 0

print('Unique words in train = {}. Tags = {}'.format(len(word2ind), tags))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

from collections import Counter

tag_distribution = Counter(tag for sample in train_data for _, tag in sample)
tag_distribution = [tag_distribution[tag] for tag in tags]

plt.figure(figsize=(10, 5))

bar_width = 0.35
plt.bar(np.arange(len(tags)), tag_distribution, bar_width, align='center', alpha=0.5)
plt.xticks(np.arange(len(tags)), tags)
    
plt.show()

What is the easiest tagger you can think of? Let's just memorize which tags are most likely for a word (or for a sequence):

<img src="https://www.nltk.org/images/tag-context.png">

*From [Categorizing and Tagging Words, nltk](https://www.nltk.org/book/ch05.html)*

The picture shows that two previous predicted tags + current word are used to predict $ t_n $. According to the case, the probability for $ P (t_n | w_n, t_ {n-1}, t_ {n-2}) $ is considered, the tag with the maximum probability is selected.

This idea is implemented more accurately in Hidden Markov Models: the probabilities $ P (w_n | t_n), P (t_n | t_ {n-1}, t_ {n-2}) $ are calculated from the training building, and their product is maximized.

The simplest option is a unigram model that takes into account only the word:

In [None]:
import nltk

default_tagger = nltk.DefaultTagger('NN')

unigram_tagger = nltk.UnigramTagger(train_data, backoff=default_tagger)
print('Accuracy of unigram tagger = {:.2%}'.format(unigram_tagger.evaluate(test_data)))

Add transition probabilities:

In [None]:
bigram_tagger = nltk.BigramTagger(train_data, backoff=unigram_tagger)
print('Accuracy of bigram tagger = {:.2%}'.format(bigram_tagger.evaluate(test_data)))

Note that the backoff is important:

In [None]:
trigram_tagger = nltk.TrigramTagger(train_data)
print('Accuracy of trigram tagger = {:.2%}'.format(trigram_tagger.evaluate(test_data)))

## We increase context with recurrent networks.

The unigram model works surprisingly well, but we are going to learn the nets.

Homonymy is the main reason why the unigram model is bad:
* “He cashed a check at the ** bank **” *
vs
* “He sat on the ** bank ** of the river” *

Therefore, it is very useful for us to consider the context when predicting the tag.

Let's use LSTM - it can work with the context very well:

<img src="https://image.ibb.co/kgmoff/Baseline-Tagger.png" width="50%">

Blue shows the selection of features from the word, LSTM Orange - it builds word embeddings according to the context, and then the green logistic regression makes tag predictions.

In [None]:
def convert_data(data, word2ind, tag2ind):
    X = [[word2ind.get(word, 0) for word, _ in sample] for sample in data]
    y = [[tag2ind[tag] for _, tag in sample] for sample in data]
    
    return X, y

X_train, y_train = convert_data(train_data, word2ind, tag2ind)
X_val, y_val = convert_data(val_data, word2ind, tag2ind)
X_test, y_test = convert_data(test_data, word2ind, tag2ind)

In [None]:
def iterate_batches(data, batch_size):
    X, y = data
    n_samples = len(X)

    indices = np.arange(n_samples)
    np.random.shuffle(indices)
    
    for start in range(0, n_samples, batch_size):
        end = min(start + batch_size, n_samples)
        
        batch_indices = indices[start:end]
        
        max_sent_len = max(len(X[ind]) for ind in batch_indices)
        X_batch = np.zeros((max_sent_len, len(batch_indices)))
        y_batch = np.zeros((max_sent_len, len(batch_indices)))
        
        for batch_ind, sample_ind in enumerate(batch_indices):
            X_batch[:len(X[sample_ind]), batch_ind] = X[sample_ind]
            y_batch[:len(y[sample_ind]), batch_ind] = y[sample_ind]
            
        yield X_batch, y_batch

In [None]:
X_batch, y_batch = next(iterate_batches((X_train, y_train), 4))

X_batch, y_batch

**Task** Implement `LSTMTagger`:

In [None]:
class LSTMTagger(nn.Module):
    def __init__(self, vocab_size, tagset_size, word_emb_dim=100, lstm_hidden_dim=128, lstm_layers_count=1):
        super().__init__()
        
        <create layers>

    def forward(self, inputs):
        <apply them>

**Task** Learn how to count accuracy and loss (and at the same time check that the model works)

In [None]:
model = LSTMTagger(
    vocab_size=len(word2ind),
    tagset_size=len(tag2ind)
)

X_batch, y_batch = torch.LongTensor(X_batch), torch.LongTensor(y_batch)

logits = model(X_batch)

<calc accuracy>

In [None]:
criterion = nn.CrossEntropyLoss()
<calc loss>

**Task** Insert this calculation in the function:

In [None]:
import math
from tqdm import tqdm


def do_epoch(model, criterion, data, batch_size, optimizer=None, name=None):
    epoch_loss = 0
    correct_count = 0
    sum_count = 0
    
    is_train = not optimizer is None
    name = name or ''
    model.train(is_train)
    
    batches_count = math.ceil(len(data[0]) / batch_size)
    
    with torch.autograd.set_grad_enabled(is_train):
        with tqdm(total=batches_count) as progress_bar:
            for i, (X_batch, y_batch) in enumerate(iterate_batches(data, batch_size)):
                X_batch, y_batch = LongTensor(X_batch), LongTensor(y_batch)
                logits = model(X_batch)

                loss = <calc loss>

                epoch_loss += loss.item()

                if optimizer:
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

                cur_correct_count, cur_sum_count = <calc accuracy>

                correct_count += cur_correct_count
                sum_count += cur_sum_count

                progress_bar.update()
                progress_bar.set_description('{:>5s} Loss = {:.5f}, Accuracy = {:.2%}'.format(
                    name, loss.item(), cur_correct_count / cur_sum_count)
                )
                
            progress_bar.set_description('{:>5s} Loss = {:.5f}, Accuracy = {:.2%}'.format(
                name, epoch_loss / batches_count, correct_count / sum_count)
            )

    return epoch_loss / batches_count, correct_count / sum_count


def fit(model, criterion, optimizer, train_data, epochs_count=1, batch_size=32,
        val_data=None, val_batch_size=None):
        
    if not val_data is None and val_batch_size is None:
        val_batch_size = batch_size
        
    for epoch in range(epochs_count):
        name_prefix = '[{} / {}] '.format(epoch + 1, epochs_count)
        train_loss, train_acc = do_epoch(model, criterion, train_data, batch_size, optimizer, name_prefix + 'Train:')
        
        if not val_data is None:
            val_loss, val_acc = do_epoch(model, criterion, val_data, val_batch_size, None, name_prefix + '  Val:')

In [None]:
model = LSTMTagger(
    vocab_size=len(word2ind),
    tagset_size=len(tag2ind)
).cuda()

criterion = nn.CrossEntropyLoss().cuda()
optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, train_data=(X_train, y_train), epochs_count=50,
    batch_size=64, val_data=(X_val, y_val), val_batch_size=512)

### Masking

**Task** Check Yourself - Do You Consider Losses and Accuracy on Paddings? It is very easy to get high quality due to this.

The loss function has a parameter `ignore_index`, for such purposes. For accuracy, you need to use masking - multiplication by a mask of zeros and ones, where zeros are on padding positions (and then averaging over non-zero positions in the mask).

**Task** Calculate the quality of the model on the test

### Bidirectional LSTM

Thanks to BiLSTM, you can use both contests at once in predicting the word tag. Those. for each $ w_i $ forward token, LSTM will issue the view $ \mathbf {f_i} \sim (w_1, \ldots, w_i) $ - built over the entire left context - and $ \mathbf {b_i} \sim (w_n, \ldots, w_i) $ - representation of the right context. Their concatenation will automatically capture the entire accessible context of the word: $ \ mathbf {h_i} = [\mathbf {f_i}, \mathbf {b_i}] \sim (w_1, \ldots, w_n) $.

<img src="https://www.researchgate.net/profile/Wang_Ling/publication/280912217/figure/fig2/AS:391505383575555@1470353565299/Illustration-of-our-neural-network-for-POS-tagging.png" width="50%">

*From [Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation](https://arxiv.org/abs/1508.02096)*

**Task** Add Bidirectional LSTM.

### Pre-Learned Embeddings

We know what a cool thing - the pre-learning embedding. With the current size of the training sample, it was still possible to teach them from scratch - with a smaller one it would be completely bad.

Therefore, the standard pipeline is to download embeds, shove them into the net. Run it:

In [None]:
import gensim.downloader as api

w2v_model = api.load('glove-wiki-gigaword-100')


Let's build a submatrix for words from our training sample:

In [None]:
known_count = 0
embeddings = np.zeros((len(word2ind), w2v_model.vectors.shape[1]))
for word, ind in word2ind.items():
    word = word.lower()
    if word in w2v_model.vocab:
        embeddings[ind] = w2v_model.get_vector(word)
        known_count += 1
        
print('Know {} out of {} word embeddings'.format(known_count, len(word2ind)))

**Task** Make a model with a pre-trained matrix. Use `nn.Embedding.from_pretrained`.

In [None]:
class LSTMTaggerWithPretrainedEmbs(nn.Module):
    def __init__(self, embeddings, tagset_size, lstm_hidden_dim=64, lstm_layers_count=1):
        super().__init__()
        
        <create me>

    def forward(self, inputs):
        <use me>

In [None]:
model = LSTMTaggerWithPretrainedEmbs(
    embeddings=embeddings,
    tagset_size=len(tag2ind)
).cuda()

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, train_data=(X_train, y_train), epochs_count=50,
    batch_size=64, val_data=(X_val, y_val), val_batch_size=512)

**Task** Estimate the quality of the model on the test sample. Please note that it is not at all necessary to be limited to vectors from the trimmed matrix - there may well be words in the test that were not in the train and for which there are embeddings.

In [None]:
<calc test accuracy>

### Pre-training of pre-trained vectors

** Assignment ** Why not try to teach a vector? To do this, simply replace the `freeze = False` flag in the` from_pretrained` method. Try it.

** Task ** In fact, it is clear why this is bad - after that, the old pre-trained vectors (which did not fall into the train) cannot be used. Check what quality is obtained on the test with the old vectors.

To deal with this, you can use this technique: impose $ l_2 $ -regularization on the pre-trained vectors so that they do not move away from the original vectors, and for words whose embeddings we do not know, build random vectors and teach them as usual.

You can read about it a little bit here: [Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP] (https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting) or in the Goldberg book.

** Task ** Try to implement it.

## We need to go deeper, character network



Let me remind you that in the last lesson we built an LSTM network that processed character sequences and predicted which language the word belongs to.

LSTM acted as a feature extractor that works with an arbitrary size sequence of characters (well, almost arbitrary - we were limited to the maximum word length). Butch for the network had the dimension `(max_word_len, batch_size)`.

Now, again, we want to use the same idea to extract features from a sequence of characters — because a sequence of characters should be useful for predicting a part of speech, right?

The network will have to remember, for example, that `-ly` is often about an adverb, and` -tion` about a noun.
<img src="https://image.ibb.co/kzbh6L/Char-Bi-LSTM.png" width="50%">

The rest of the network will be the same.

Find the boundary for the length of the words:

In [None]:
from collections import Counter 
    
def find_max_len(counter, threshold):
    sum_count = sum(counter.values())
    cum_count = 0
    for i in range(max(counter)):
        cum_count += counter[i]
        if cum_count > sum_count * threshold:
            return i
    return max(counter)

word_len_counter = Counter()
for sent in data:
    for word, _ in sent:
        word_len_counter[len(word)] += 1
    
threshold = 0.99
MAX_WORD_LEN = find_max_len(word_len_counter, threshold)

print('Max word len for {:.0%} of words is {}'.format(threshold, MAX_WORD_LEN))

Построим алфавит:

In [None]:
from string import punctuation

def get_range(first_symb, last_symb):
    return set(chr(c) for c in range(ord(first_symb), ord(last_symb) + 1))

chars = get_range('a', 'z') | get_range('A', 'Z') | get_range('0', '9') | set(punctuation)
char2ind = {c : i + 1 for i, c in enumerate(chars)}
char2ind['<pad>'] = 0

**Task** Convert the data, as in the function above - only now the words should be displayed not in one index, but in a sequence.

Trim the words by `MAX_WORD_LEN`.

In [None]:
def convert_data(data, char2ind, tag2ind):
    X, y = <calc it>
    return X, y
  
X_train, y_train = convert_data(train_data, char2ind, tag2ind)
X_val, y_val = convert_data(val_data, char2ind, tag2ind)
X_test, y_test = convert_data(test_data, char2ind, tag2ind)

Напишем генератор батчей:

In [None]:
def iterate_batches(data, batch_size):
    X, y = data
    n_samples = len(X)

    indices = np.arange(n_samples)
    np.random.shuffle(indices)
    
    for start in range(0, n_samples, batch_size):
        end = min(start + batch_size, n_samples)
        
        batch_indices = indices[start: end]
        
        sent_len = max(len(X[ind]) for ind in batch_indices)
        word_len = max(len(word) for ind in batch_indices for word in X[ind])
            
        X_batch = np.zeros((sent_len, len(batch_indices), word_len))
        y_batch = np.zeros((sent_len, len(batch_indices)))
        
        for batch_ind, sample_ind in enumerate(batch_indices):
            for word_ind, word in enumerate(X[sample_ind]):
                X_batch[word_ind, batch_ind, :len(word)] = word
            y_batch[:len(y[sample_ind]), batch_ind] = y[sample_ind]
            
        yield X_batch, y_batch

In [None]:
X_batch, y_batch = next(iterate_batches((X_train, y_train), 4))

X_batch.shape, y_batch.shape

**Task** Implement a network that accepts a batch of size `(seq_len, batch_size, word_len)` and returns `(seq_len, batch_size, word_emb_dim)`. This can be any function that can in a sequence of arbitrary length. We have already looked at convolutional and recurrent networks for such a task — try both.

In [None]:
class CharsEmbedding(nn.Module):
    def __init__(self, vocab_size, char_emb_dim=24, word_emb_dim=100):
        super().__init__()
        
        <create Conv or LSTM encoder>
        
    def forward(self, inputs):
        <apply>

**Задание** Реализуйте теггер с эмбеддингами символьного уровня.

In [None]:
class LSTMTagger(nn.Module):
    def __init__(self, char_vocab_size, tagset_size, char_emb_dim=24, 
                 word_emb_dim=128, lstm_hidden_dim=128, lstm_layers_count=1):
        super().__init__()
        
        <create it>

    def forward(self, inputs):
        <apply>

In [None]:
model = LSTMTagger(char_vocab_size=len(char2ind), tagset_size=len(tag2ind)).cuda()

criterion = nn.CrossEntropyLoss(ignore_index=0).cuda()
optimizer = optim.Adam(model.parameters())

fit(model, criterion, optimizer, train_data=(X_train, y_train), epochs_count=20, 
    batch_size=24, val_data=(X_val, y_val), val_batch_size=32)

**Задание** Оцените его качество.

In [None]:
_, test_accuracy = do_epoch(model, criterion, (X_test, y_test), batch_size=32)

### Visualization

** Task ** Calculate symbol level embeddings (trained inside the model before this) for 1000 random words from `word2ind`.

In [None]:
embeddings, index2word = <calc me>

In [None]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

from sklearn.manifold import TSNE
from sklearn.preprocessing import scale


def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()
    
    if isinstance(color, str): 
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: 
        pl.show(fig)
    return fig


def get_tsne_projection(word_vectors):
    tsne = TSNE(n_components=2, verbose=100)
    return scale(tsne.fit_transform(word_vectors))
    
    
def visualize_embeddings(embeddings, token):
    tsne = get_tsne_projection(embeddings)
    draw_vectors(tsne[:, 0], tsne[:, 1], token=token)
    

visualize_embeddings(embeddings, index2word)

**Task** Calculate embeddings for all words from the train and for several random words from the test that are not found in the train, find their closest neighbors in their embeddigs of the symbol level.

### Word embeddings

** Task ** Only symbolic embeddingings may not be enough. Give another word embeddings. Words should result in lower case — case-related attributes should catch character LSTM.

These embeddings can simply be concatenated, folded, or the gate can be used (as in LSTM). For example, by embedding, the words predict $ o = \sigma (w) $ - how good it is and combine in this proportion with symbolic embedding: $ o \odot w + (1 - o) \odot \tilde w $, where $ \tilde w $ - embedding of the word, obtained by the symbolic level. Check out the different options.

### Communication word embeddingov and embeddingov character level
In word embeddings, we build a mapping from a word to an index. As a result, the input batch is quite small - it is good for learning (faster transfer to a video card). With symbolic embeddings, trouble - but it can be fixed.

Let's assume for each word in `word2ind` its sequence of character indexes. Get a matrix. This matrix can be transferred with the model to a video card. Then a batch will be needed from the word indices - using it you can make a look (using `F. embedding`) in the matrix and get a three-dimensional matrix with symbols.

The advantage is that you can get both word embeddings and symbol level embeds immediately in one batch. It is convenient and energy efficient.

Another idea is that after we have trained the model, it is possible to prescribe the embeddingings of words of the symbolic level - lookup in the embeddingings table is much simpler than a convolutional or recurrent network over symbols. Thus, for example, embeddings are obtained in [FastText] (https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) - they are also initially considered at the symbolic (N-gram) level.

## Encoder-decoder

You can complicate the model - add another recurrent layer. The first layer will serve as a sequence encoder, the second, easier one - to decode the sequence. Decode means that it must take as input the state for the given token from the encoder and the previous predicted tag.

Everything will look something like this:

<img src="https://image.ibb.co/jOrfT0/Encoder-Decoder.png" width="50%">


Green is already `LSTM`, not` Linear`, but it assumes immediately a hidden state from the previous token (green arrow), the previous predicted tag (dotted arrow) and the state from BiLSTM - the contextual representation of the word.

This model should be trained with teacher-forcing - passing the correct labels as answers to the dotted arrows. On the prediction, you need to implement a beam search — keep several best paths (tag sequences) for the decoded sequence at once.

**Task** Risk to realize it.

(In general, we will deal with this in more detail, when we get to the machine translation - you can come back here after it :))

# Referrence

Speech and Language Processing, Chapter 8, Part-of-speech Tagging. Daniel Jurafsky [[pdf](https://web.stanford.edu/~jurafsky/slp3/8.pdf)]

Learning Character-level Representations for Part-of-Speech Tagging, dos Santos et al, 2014 [pdf](http://proceedings.mlr.press/v32/santos14.pdf)  
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation, Wang Ling et al, 2015 [arxiv](https://arxiv.org/abs/1508.02096)  
Bidirectional LSTM-CRF Models for Sequence Tagging, Zhiheng Huang et al, 2015 [arxiv](https://arxiv.org/abs/1508.01991)  
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, Xuezhe Ma et al, 2016 [arxiv](https://arxiv.org/abs/1603.01354)  
Improving Part-of-speech Tagging via Multi-task Learning and Character-level Word Representations, Daniil Anastasyev et al, 2018 [pdf](http://www.dialog-21.ru/media/4282/anastasyevdg.pdf) :)  