# ELMO
ELMO stands for Embeddings from Language Models. For details checkout the paper from Allen Institute https://arxiv.org/pdf/1802.05365.pdf

The main idea behind the ELMO is to get contextualized word embeddings. In contrast to regular word2vec, which uses bag-of-words (order of words doesn't matter in this case) it builds dense represetations of the words using recurrent neural network that highly sensitive to the order of words.

To make the representations of the words meaningful the network is optimised to solve language modelling task. The task of the Language Model is to predict next word given all previous words. The loss function for this task is the following:

$L_i(x_i) = -log(p(x_i | x_0, .. , x_{i-1}))$

For the backward direction and inversed word order the following formula is used:

$L_i(x_i) = -log(p(x_i | x_{i+1}, .. , x_{N}))$

### Download the dataset

We will use the Text8 dataset. This dataset consists of lowercased sequence of words from Wikipedia. All punctuation is replaced with whitespaces

In [1]:
import requests
from tqdm import tqdm


download_link = 'http://lnsigo.mipt.ru/export/datasets/text8'

def download(file_name, source_url):
    CHUNK = 16 * 1024

    r = requests.get(source_url, stream=True)
    total_length = int(r.headers.get('content-length', 0))

    with open(file_name, 'wb') as f:
        pbar = tqdm(total=total_length, unit='B', unit_scale=True)
        for chunk in r.iter_content(chunk_size=CHUNK):
            if chunk:  # filter out keep-alive new chunks
                pbar.update(len(chunk))
                f.write(chunk)
        f.close()
        
download('text8', download_link)

100%|██████████| 100M/100M [00:01<00:00, 75.1MB/s] 


Read the dataset and split it into words

In [2]:
with open('text8') as f:
    words = f.read().split()
n_words = len(words)
print('Number of words in the dataset: {}'.format(n_words))

Number of words in the dataset: 17005207


### Prepare train validation and test sets
Split the data to train validation and test sets with parts 0.8, 0.1 0.1 respectively

In [3]:
train_part = 0.8
validation_part = 0.1
test_part = 0.1
n_train = int(n_words * train_part)
n_valid = int(n_words * validation_part)
n_test = n_words - n_train - n_valid
train_set = words[:n_train]
valid_set = words[n_train: n_train + n_valid]
test_set = words[n_train + n_valid:]

### Vocabulary

Now we will create the vocabulary to perform conversion from tokens to indices and vice versa. This structure is needed to perform embedding lookup. We have a matrix with embeddings. Each row of this matrix corresponds to embedding of certain word. For example the row that corresponds to the word 'scientist' has index 7 in the embedding matrix. Then to perform lookup from this matrix we need to pass to embedding lookup table index 7. This Vocabulary will convert the word 'scientist' to number 7. We also restrict the maximum number of words to 10000 to make learning easier.

In [4]:
from collections import Counter
from collections import defaultdict
import random
import numpy as np
from itertools import chain

# Number of tokens in the vocabulary
VOCAB_SIZE = 10000

# Dictionary class. Each instance holds tags or tokens or characters and provides
# dictionary like functionality like indices to tokens and tokens to indices.
class Vocabulary:
    def __init__(self, tokens, max_tokens=10000):
        # We set default ind to position of <UNK>
        self._t2i = defaultdict(lambda: 0)
        self._i2t = list()
        self.frequencies = Counter(tokens)
        self.counter = 0
        # The token with 0 index will be unknow token (or out of vocabulary token)
        self._add_token('<UNK>', 2**30)
        for token, freq in self.frequencies.most_common(max_tokens - 1):
            self._add_token(token, freq)
    
    def _add_token(self, token, frequency):
        self._t2i[token] = self.counter
        self.frequencies[token] += 0
        self._i2t.append(token)
        self.counter += 1

    def idx2tok(self, idx):
        return self._i2t[idx]

    def idxs2toks(self, idxs, filter_paddings=False):
        toks = []
        for idx in idxs:
            if not filter_paddings or idx != self.tok2idx('<PAD>'):
                toks.append(self._i2t[idx])
        return toks

    def tok2idx(self, tok):
        return self._t2i[tok]

    def toks2idxs(self, toks):
        return [self._t2i[tok] for tok in toks]

    def __getitem__(self, key):
        return self._t2i[key]

    def __len__(self):
        return self.counter

    def __contains__(self, item):
        return item in self._t2i
    
vocab = Vocabulary(train_set, VOCAB_SIZE)

### Batch generator

To train the network we have to pass batches of examples to it. In our case we have to pass number of 'sentences' to the network. Since the dataset has no punctuation we will split the total sequence of tokens into subsequences of given length and pass the them to the network. Typical length of 'sentence' (subsequence) is around 35 words.

In [5]:
def batch_generator(batch_size,
                    sentence_len,
                    data_type='train',
                    shuffle=True,
                    allow_smaller_last_batch=True):
    if data_type == 'train':
        tokens = train_set
    elif data_type == 'valid':
        tokens = valid_set
    else:
        tokens = test_set
    sentences = []
    for n in range(len(tokens) // sentence_len):
        sentences.append(tokens[n * sentence_len: (n + 1) * sentence_len])
    n_samples = len(sentences)
    
    if shuffle and data_type == 'train':
        order = np.random.permutation(n_samples)
    else:
        order = np.arange(n_samples)
    n_batches = n_samples // batch_size
    if allow_smaller_last_batch and n_samples % batch_size:
        n_batches += 1
    for k in range(n_batches):
        batch_start = k * batch_size
        batch_end = min((k + 1) * batch_size, n_samples)
        x_batch = np.zeros([batch_end - batch_start, sentence_len], dtype=np.int32)
        for ind, n in enumerate(range(batch_start, batch_end)):
            sentence = sentences[order[n]]
            sentence_idxs = vocab.toks2idxs(sentence)
            x_batch[ind] = sentence_idxs
        yield x_batch
    

Now try it out

In [6]:
batch = next(batch_generator(2, 10, 'test'))

for sentence_idxs in batch:
    print('Sententence indices:')
    print(sentence_idxs)
    print('Sententence words:')
    print(vocab.idxs2toks(sentence_idxs))

Sententence indices:
[402   0  11 108   1 130 215  12  45  51]
Sententence words:
['white', '<UNK>', 'is', 'made', 'the', 'same', 'way', 'as', 'its', 'more']
Sententence indices:
[ 426 5196  402    0    1    0   26 8128    3   61]
Sententence words:
['famous', 'cousin', 'white', '<UNK>', 'the', '<UNK>', 'are', 'crushed', 'and', 'after']


### Network
Now we will assemble the ELMO network. As a basis we will use GRU units as a faster an less prone to overfitting solution.
Following the paper the network will contain: embedding layer, two recurrent layers, and output prediction layer. The embedding layer will convert indices of words to corresponding vectors. Two recurrent layers will contextualize the word representations. The output layer will project the hidden states produced by recurrent neural network to probability scores of next token. The dimensionality of the output is equal to vocabulary size. So, for each token we get the probobality (how likely to see this token as the next token). 

In [7]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.functional as F


class ELMO(nn.Module):
    def __init__(self,
                 vocab_size,
                 n_hidden,
                 embedding_dim):
        super(ELMO, self).__init__()


        self.embed = nn.Embedding(vocab_size, embedding_dim)
        
        # Two forward layers
        self.gru_fw_1 = torch.nn.GRU(embedding_dim, n_hidden)
        self.gru_fw_2 = torch.nn.GRU(n_hidden, n_hidden)
        
        # Two backward layers
        self.gru_bw_1 = torch.nn.GRU(embedding_dim, n_hidden)
        self.gru_bw_2 = torch.nn.GRU(n_hidden, n_hidden)
        
        self.output_projection = nn.Linear(n_hidden, embedding_dim)
        self.output_layer = nn.Linear(embedding_dim, vocab_size)
        
    
    def forward(self, batch, direction):
        """ Compute forward pass in given direction
        Args:
            batch: is a torch Variable [batch_size x sentence_length] (int64)
            direction: either 'forward' of 'backward'
        Returns:
            logits: log probabilities for the given direction
        """
        # Sequences for backward direction assumed to be reversed
        batch_size, seq_len = batch.size()
        # We drop last token in order to predict every following token
        emb = self.embed(batch[:, :-1])
        output_embeddings = emb
        
        # Following dicumentation input to GRU has the following order (seq_len, batch, input_size)
        emb = emb.permute([1, 0, 2])
        
        # Forward direction
        if direction == 'forward':
            units, _ = self.gru_fw_1(emb)
            units, _ = self.gru_fw_2(units)
            
        # Backward direction
        else:
            units, _ = self.gru_bw_1(emb)
            units, _ = self.gru_bw_2(units)
        units = self.output_projection(units)
        logits = self.output_layer(units)
        
        return logits.permute([1, 0, 2])
    
    # For using as a pretrained model
    def embed_batch(self, batch_fw, batch_bw):
        """This method should be used to pass embeddings of pre-trained model another task
        Args:
            batch_fw - tensor of indices of words with size [batch_size, seq_len] (int64)
            batch_fw - tensor of indices of words with reverse order, size [batch_size, seq_len] (int64)
        
        Returns:
            units_0, units_1, units_2 - units from token level, first RNN layer, and second rnn layer
        """
        
        batch_size, seq_len = batch_fw.size()
        # We drop last token in order to predict every following token
        emb_fw = self.embed(batch_fw)
        emb_bw = self.embed(batch_bw)
        
        # Following dicumentation input to GRU has the following order (seq_len, batch, input_size)
        emb_fw_perm = emb_fw.permute([1, 0, 2])
        emb_bw_perm = emb_bw.permute([1, 0, 2])
        
        # Forward direction
        units_fw_1, _ = self.gru_fw_1(emb_fw_perm)
        units_fw_2, _ = self.gru_fw_2(units_fw_1)
        units_fw_1 = units_fw_1.permute([1, 0, 2])
        units_fw_2 = units_fw_2.permute([1, 0, 2])

        # Backward direction
        units_bw_1, _ = self.gru_bw_1(emb_bw_perm)
        units_bw_2, _ = self.gru_fw_2(units_bw_1)
        units_bw_1 = units_bw_1.permute([1, 0, 2])
        units_bw_2 = units_bw_2.permute([1, 0, 2])

        # Create representaions for each layer
        units_0 = torch.cat([emb_fw, torch.zeros_like(units_fw_1), torch.zeros_like(units_bw_1)], 2)
        units_1 = torch.cat([emb_fw, units_fw_1, units_bw_1], 2)
        units_2 = torch.cat([emb_fw, units_fw_2, units_bw_2], 2)
        
        return units_0, units_1, units_2
        
    

Now create the instance of the network:

In [13]:
hidden_size = 256
embedding_dim = 256

net = ELMO(VOCAB_SIZE, hidden_size, embedding_dim)
net.cuda()

ELMO(
  (embed): Embedding(10000, 256)
  (gru_fw_1): GRU(256, 256)
  (gru_fw_2): GRU(256, 256)
  (gru_bw_1): GRU(256, 256)
  (gru_bw_2): GRU(256, 256)
  (output_projection): Linear(in_features=256, out_features=256, bias=True)
  (output_layer): Linear(in_features=256, out_features=10000, bias=True)
)

Create supplimentary function to feed pytorch Variables into the network. It will prepare forward and backward directions inputs and outputs. The backward direction the input is just the reversed sequence.

In [14]:
def prepare_batch(x):
    x = x.astype(np.int64)
    np_x_fw = x
    np_x_bw = np.array(x[:, ::-1])
    x_fw = Variable(torch.from_numpy(np_x_fw)).cuda()
    x_bw = Variable(torch.from_numpy(np.array(np_x_bw))).cuda()

    y_fw = Variable(torch.from_numpy(np_x_fw[:, 1:])).cuda()
    y_bw = Variable(torch.from_numpy(np_x_bw[:, 1:])).cuda()
    return x_fw, x_bw, y_fw, y_bw

### Train
We will train the network for 10 epochs. After each epoch we will perform validation. And at the end of the training we will perform testing. The log loss will be displayed during the training.

In [15]:
n_epochs = 3
sequence_len = 35
batch_size = 64
print_loss_every = 1000
learning_rate = 0.001

optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
loss_function = torch.nn.NLLLoss()
for k in range(n_epochs):
    sum_loss = 0
    print('Epoch: {}'.format(k))
    for n, x in enumerate(batch_generator(batch_size, sequence_len, 'train')):
        x_fw, x_bw, y_fw, y_bw = prepare_batch(x)
        net.zero_grad()
        
        # Forward direction
        logits_fw = net.forward(x_fw, 'forward')
        logits_fw = nn.functional.log_softmax(logits_fw, 2)
        loss_fw = loss_function(logits_fw.permute([0, 2, 1]), y_fw)
        
        # Backward direction
        logits_bw = net.forward(x_bw, 'backward')
        logits_bw = nn.functional.log_softmax(logits_bw, 2)
        loss_bw = loss_function(logits_bw.permute([0, 2, 1]), y_bw)
        
        loss = (loss_fw + loss_bw) / 2
        loss_val = loss.cpu().data.numpy()
        sum_loss += loss_val
        loss.backward()
        optimizer.step()
        if n % print_loss_every == print_loss_every - 1:
            print('Train loss: {}'.format(sum_loss / print_loss_every))
            sum_loss = 0
            
    # Affects only batch_norm and dropout layers
    net.eval()        
        
    sum_loss = 0
    for n, x in enumerate(batch_generator(batch_size, sequence_len, 'valid')):
        x_fw, x_bw, y_fw, y_bw = prepare_batch(x)
        # Forward direction
        logits_fw = net.forward(x_fw, 'forward')
        logits_fw = nn.functional.log_softmax(logits_fw, 2)
        loss_fw = loss_function(logits_fw.permute([0, 2, 1]), y_fw)
        
        # Backward direction
        logits_bw = net.forward(x_bw, 'backward')
        logits_bw = nn.functional.log_softmax(logits_bw, 2)
        loss_bw = loss_function(logits_bw.permute([0, 2, 1]), y_bw)
        
        loss = (loss_fw + loss_bw) / 2
        loss_val = loss.cpu().data.numpy()
        sum_loss += loss_val
    print('Validation loss: {}'.format(sum_loss / n))
    
    # Affects only batch_norm and dropout layers
    net.train()

Epoch: 0
Train loss: [5.716759]
Train loss: [5.249713]
Train loss: [5.0771194]
Train loss: [4.9769306]
Train loss: [4.9078603]
Train loss: [4.857866]
Validation loss: [4.841934]
Epoch: 1
Train loss: [4.7844253]
Train loss: [4.7626805]
Train loss: [4.7430506]
Train loss: [4.7212906]
Train loss: [4.7104015]
Train loss: [4.690979]
Validation loss: [4.725791]
Epoch: 2
Train loss: [4.6198497]
Train loss: [4.6301255]
Train loss: [4.630003]
Train loss: [4.624859]
Train loss: [4.619018]
Train loss: [4.6117063]
Validation loss: [4.675135]


Now try to sample from the model forward direction network given short initial phrase like "the most"

In [12]:
x = [vocab.toks2idxs('the most'.split())]
for n in range(10):
    x_fw = Variable(torch.from_numpy(np.array(x))).cuda()
    logits_fw = net.forward(x_fw, 'forward')
    logits = nn.functional.log_softmax(logits_fw, 2).cpu().data.numpy()
    logits = logits[0][-1]
    # Softmax
    p = np.exp(logits) / np.sum(np.exp(logits))
    # Sample from obtained probability distribution
    new_tok_ind = np.argmax(np.random.multinomial(1, p - 1e-9))
    x[0].append(new_tok_ind)

print(vocab.idxs2toks(x[0]))

['the', 'most', 'lowest', 'prominent', 'and', 'rugby', 'popular', 'league', 'team', '<UNK>', 'are', 'ranked']


Evaluate the model on the test set

In [23]:
sum_loss = 0
for n, x in enumerate(batch_generator(batch_size, sequence_len, 'test')):
    x_fw, x_bw, y_fw, y_bw = prepare_batch(x)
    # Forward direction
    logits_fw = net.forward(x_fw, 'forward')
    logits_fw = nn.functional.log_softmax(logits_fw, 2)
    loss_fw = loss_function(logits_fw.permute([0, 2, 1]), y_fw)

    # Backward direction
    logits_bw = net.forward(x_bw, 'backward')
    logits_bw = nn.functional.log_softmax(logits_bw, 2)
    loss_bw = loss_function(logits_bw.permute([0, 2, 1]), y_bw)

    loss = (loss_fw + loss_bw) / 2
    loss_val = loss.cpu().data.numpy()
    sum_loss += loss_val
print('Test loss: {}'.format(sum_loss / n))

Test loss: [4.580204]


### Homework

#### Extend the model
Try this additional features to get loss on test lower than 4.0:
- Dropout (typically added to the embeddings and before output layer) https://arxiv.org/abs/1708.02182
- Gradient clipping to prevent exploding gradients
- Tied output weights (input embedding matrix can be used for output layer) https://arxiv.org/abs/1611.01462
- larger hidden sizes for rnn and hidden sizes
- LSTM instead of GRU
- Inialize the rucurrent units with trainable states
- Pre-trained word embeddings (for instance [GloVe](https://nlp.stanford.edu/projects/glove/) or [FastText](https://github.com/facebookresearch/fastText))
- Compute [perplexity](https://en.wikipedia.org/wiki/Perplexity) of the model


#### Try it on real data

Pretrain the language model on the large dataset, for example [Amazon reviews](https://www.kaggle.com/bittlingmayer/amazonreviews/data), and implement the pre-trained model for  [IMDB sentiment analisis task](http://ai.stanford.edu/~amaas/data/sentiment/). For this task build the network that takes contextualized embeddings poduced by ELMO and pass them to another network. Use embed_batch method of the network class to obtain ELMO embeddings.

