# Language Models

**Notebook created by [Daniel Fojo](https://www.linkedin.com/in/daniel-fojo/) for the [Postgraduate course in artificial intelligence with deep learning](https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgrau-artificial-intelligence-deep-learning/) in [UPC School](https://www.talent.upc.edu/ing/) (2020).**


In [None]:
import torch
import math
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import os
import time
import torchtext
from torchtext.data.utils import get_tokenizer
if not torch.cuda.is_available():
    raise RuntimeError("You should enable GPU runtime.")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


A language model is a probability distribution over sequences of words. To train a Deep Learning language model, we will task the model to, given a sequence of words, predict the following one.

For this lab, we will train 2 different models. The first one will be a simple RNN model with a encoder decoder structure. The second one will be a Transformer Model.

# RNN Model

First we will declare the model that we will use. We will start with a simple RNN model made of an encoder, a recurrent module, and a decoder.

### Model

##### **Exercise 1**
Write the forward method of the model, using the encoder layer, the rnn, and the decoder. Note that the forward method should return both the decoded output and the hidden state from the RNN. Use the dropout layer after the encoder layer.

In [None]:
class RNNModel(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, ntoken, embedding_size, nhid, nlayers, dropout=0.5, pretrained_embeddings=None):
        super().__init__()

        self.pretrained_embeddings = pretrained_embeddings
        if pretrained_embeddings is None:
            self.encoder = nn.Embedding(ntoken, embedding_size)
        self.drop = nn.Dropout(dropout)
        self.rnn = nn.LSTM(embedding_size, nhid, nlayers, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        self.nhid = nhid
        self.nlayers = nlayers

    def forward(self, x, hidden):
        ...
        return decoded, hidden

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                weight.new_zeros(self.nlayers, bsz, self.nhid))
    

### Hyperparameters

In [None]:
batch_size = 20
bptt = 35  # Back Propagation Through Time
embedding_size = 300  # 650 gives better results, but is much slower
hidden_size = 300  # 650 gives better results, but is much slower
n_layers = 2
lr = 1e-2


### Data loading

For our task we will use the WikiText2 dataset. This language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Starting from sequential data. We will arrange the dataset into columns. For instance, with the alphabet as the sequence and batch size 4, we'd get
```
┌ a g m s ┐
│ b h n t │
│ c i o u │
│ d j p v │
│ e k q w │
└ f l r x ┘
```
These columns are treated as independent by the model, which means that the dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient batch processing.

In [None]:
dataset = torchtext.data.Field(tokenize=get_tokenizer("spacy"),
                            init_token='<sos>',
                            eos_token='<eos>',
                            lower=True)
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(dataset)

# build the vocabulary
dataset.build_vocab(train_txt)
ntokens = len(dataset.vocab.stoi)  # stoi = string to int


In [None]:
# make iterator for splits
train_iter, valid_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train_txt, val_txt, test_txt), batch_size=batch_size, bptt_len=bptt, device=device)


### Instantiate model

Here, we instantiate the model and the optimizer. We will also use a LR scheduler to decrease the LR after every epoch.

#### **Exercise 2**

Instantiate the model with the correct hyperparameters. Then, instantiate also the correct loss function for a language model, the Adam optimizer and a [StepLR](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.StepLR) learning rate scheduler with step 1 and gamma 0.95.

In [None]:
model = ...

criterion = ...
optimizer = ...
scheduler = ...

### Train function

Now we define the train function that trains the model for an epoch.

#### **Exercise 3**

Complete the training function with help of the code comments. You can check the documentation of [clip_grad_norm_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html#torch.nn.utils.clip_grad_norm_). 

In [None]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)  # For LSTMs


def train():
    model.train()
    total_loss = 0.
    start_time = time.time()
    hidden = model.init_hidden(batch_size)
    for i, batch in enumerate(train_iter):
        data, target = batch.text, batch.target

        # Set gradients to zero
        ...
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        hidden = repackage_hidden(hidden)

        # Compute the output and the new hidden state
        output, hidden = ...

        output = output.permute(0, 2, 1)
        loss = criterion(output, target)
        loss.backward()

        # use `clip_grad_norm_` to clip the norm of the gradients to 0.25. It will help the training of the rnn
        ...
        optimizer.step()

        total_loss += loss.item()
        log_interval = 100
        if i % log_interval == 0 and i > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print(f'| epoch {epoch:3d} | {i:5d}/{len(train_iter):5d} batches | lr {lr:.4f} | ms/batch {elapsed * 1000 / log_interval:5.2f} | loss {cur_loss:5.2f} | ppl {math.exp(cur_loss):8.2f}')
            total_loss = 0
            start_time = time.time()

### Validation function

Now we will define the validation function, that given a dataset (val or test) will evaluate the loss of the prediction of the model in that dataset.

#### **Exercise 4**

Complete the validation function to compute the loss. `data_source` corresponds to `val_data` or `test_data` (depending on which phase of the training code we are).



In [None]:
@torch.no_grad()
def evaluate(data_source):
    model.eval()
    total_loss = 0.
    n = 0
    hidden = ...
    for i, batch in enumerate(data_source):
        data, target = batch.text, batch.target
        output, hidden = ....
        output = output.permute(0, 2, 1)
        total_loss += target.numel() * criterion(output, target).item()
        n += target.numel()
    return total_loss / n

### Training loop

This is the training loop code. At any point you can hit stop to get out of training early.

In [None]:
best_val_loss = float("inf")

# At any point you can hit stop to get out of training early.
epochs = 4
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(valid_iter)
        print('-' * 89)
        print(f'| end of epoch {epoch} | time: {(time.time() - epoch_start_time):.2f}s | valid loss {val_loss:.2f} | valid ppl {math.exp(val_loss):.2f}')
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if val_loss < best_val_loss:
            with open("best_checkpoint.pth", 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        scheduler.step()
        lr = scheduler.get_last_lr()[0]
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open("best_checkpoint.pth", 'rb') as f:
    model = torch.load(f)
    # after load the rnn params are not a continuous chunk of memory
    # this makes them a continuous chunk, and will speed up forward pass
    model.rnn.flatten_parameters()

# Run on test data. 
with torch.no_grad():
    test_loss = evaluate(test_iter)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | test ppl {math.exp(test_loss):8.2f}')
print('=' * 89)


### Text generation

Now we can test the performance of our language model, by first inputting a random word to the model, generating a new word (by taking the most likely output from the model) and then inputting the generated word to the model iteratively.

In [None]:
model.eval()

hidden = model.init_hidden(bsz=1)
x = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)
temperature = 1  # Higher will increase diversity
text = ""
with torch.no_grad():
    for i in range(1000):
        output, hidden = model(x, hidden)
        word_weights = (output / temperature).exp().cpu().squeeze()  # Softmax without normalizing
        word_idx = torch.multinomial(word_weights, 1)[0]
        x = torch.tensor([[word_idx]], dtype=torch.long).to(device)
        word = dataset.vocab.itos[word_idx]
        text += word + ('\n' if i % 20 == 19  or word == '<eos>' else ' ')

print(text)


# Use pretrained embeddings

In [None]:
dataset = torchtext.data.Field(tokenize=get_tokenizer("spacy"),
                            init_token='<sos>',
                            eos_token='<eos>',
                            lower=True)
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(dataset)

# build the vocabulary
embeddings = torchtext.vocab.GloVe(name='6B', dim=300)
dataset.build_vocab(train_txt, vectors=embeddings)  # Specify the embedding https://nlp.stanford.edu/projects/glove/
ntokens = len(dataset.vocab.stoi)  # stoi = string to int
print(dataset.vocab.vectors.shape)

In [None]:
def get_embedding(word):
    index = dataset.vocab.stoi[word]
    return dataset.vocab.vectors[index]

print(get_embedding("queen"))

#### **Exercise 5**

Complete the code to find the closest embeddings to `queen` - `woman` + `man`. You can use `torch.dist` to compute the distance between 2 vectors.


In [None]:
embedding = ... - ... + ...

distances = []
for i, vector in enumerate(dataset.vocab.vectors):
    if dataset.vocab.stoi["queen"] != i:
        distance = ...
        distances.append(distance)
    else:
        distances.append(torch.tensor(float("inf")))
distances = torch.stack(distances)
indices = torch.topk(-distances, k=5)[1]
print([dataset.vocab.itos[ind] for ind in indices])

#### **Exercise 6**

Feel free to try on your own to see what embeddings are close to each other

## Model

In [None]:
model = RNNModel(ntokens, embedding_size, hidden_size, n_layers, pretrained_embeddings=dataset.vocab.vectors.to(device)).to(device)
lr = 1e-3
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

## Training loop

In [None]:
best_val_loss = float("inf")

# At any point you can hit stop to get out of training early.
epochs = 4
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(valid_iter)
        print('-' * 89)
        print(f'| end of epoch {epoch} | time: {(time.time() - epoch_start_time):.2f}s | valid loss {val_loss:.2f} | valid ppl {math.exp(val_loss):.2f}')
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if val_loss < best_val_loss:
            with open("best_checkpoint.pth", 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        scheduler.step()
        lr = scheduler.get_last_lr()[0]
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open("best_checkpoint.pth", 'rb') as f:
    model = torch.load(f)
    # after load the rnn params are not a continuous chunk of memory
    # this makes them a continuous chunk, and will speed up forward pass
    model.rnn.flatten_parameters()

# Run on test data. 
with torch.no_grad():
    test_loss = evaluate(test_iter)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | test ppl {math.exp(test_loss):8.2f}')
print('=' * 89)


### Text generation

In [None]:
model.eval()

hidden = model.init_hidden(bsz=1)
x = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)
temperature = 1  # Higher will increase diversity
text = ""
with torch.no_grad():
    for i in range(1000):
        output, hidden = model(x, hidden)
        word_weights = (output / temperature).exp().cpu().squeeze()  # Softmax without normalizing
        word_idx = torch.multinomial(word_weights, 1)[0]
        x = torch.tensor([[word_idx]], dtype=torch.long).to(device)
        word = dataset.vocab.itos[word_idx]
        text += word + ('\n' if i % 20 == 19  or word == '<eos>' else ' ')

print(text)


# Extra: Transformer Model

Now we will train a Transformer to solve the language modelling task. The structure of the architecture of the model is the following:

![alt text](https://pytorch.org/tutorials/_images/transformer_architecture.jpg)

Even though it seems complicated, with PyTorch and the `nn.TransformerEncoder` module this can be implemented in an easy (or at least, easier) way.

### Positional Encoder

First, we will use a Positional Encoding module. Positional Encoding injects some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension as the embeddings, so that the two can be summed. Here, we use sine and cosine functions of different frequencies.

In [None]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        r"""Inputs of forward function
        Args:
            x: the sequence fed to the positional encoder model (required).
        Shape:
            x: [sequence length, batch size, embed dim]
            output: [sequence length, batch size, embed dim]
        Examples:
            >>> output = pos_encoder(x)
        """

        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

### Model

In [None]:
from torch.nn import TransformerEncoder, TransformerEncoderLayer


class TransformerModel(nn.Module):
    """Container module with an encoder, a recurrent or transformer module, and a decoder."""

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super().__init__()
        self.ninp = ninp
        self.src_mask = None

        self.encoder = nn.Embedding(ntoken, ninp)

        self.pos_encoder = PositionalEncoding(ninp, dropout)

        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        
        self.decoder = nn.Linear(ninp, ntoken)

    def _generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)  # Lower triangular matrix with ones.
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def forward(self, x, has_mask=True):
        if has_mask:
            device = x.device
            if self.src_mask is None or self.src_mask.size(0) != len(x):
                mask = self._generate_square_subsequent_mask(len(x)).to(device)
                self.src_mask = mask
        else:
            self.src_mask = None

        embeddings = self.encoder(x)
        embeddings = embeddings * math.sqrt(self.ninp)
        embeddings = self.pos_encoder(embeddings)

        encoded = self.transformer_encoder(embeddings, self.src_mask)
        decoded = self.decoder(encoded)
        return decoded


### Hyperparameters

In [None]:
batch_size = 20
bptt = 35  # Back Propagation Through Time
embedding_size = 300  # 650 gives better results, but is much slower
hidden_size = 300  # 650 gives better results, but is much slower
n_layers = 2
n_heads = 2  # Transformer heads
lr = 1e-3


### Model
Here, we instantiate the model and the optimizer. We will also use a LR scheduler to decrease the LR after every epoch.

In [None]:
model = TransformerModel(ntokens, embedding_size, n_heads, hidden_size, n_layers).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

### Train function
Now we define the train function that trains the model for an epoch.

In [None]:
def train():
    model.train()
    total_loss = 0.
    start_time = time.time()
    for i, batch in enumerate(train_iter):
        data, target = batch.text, batch.target
        model.zero_grad()
        output = model(data)
        output = output.permute(0, 2, 1)
        loss = criterion(output, target)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        log_interval = 100
        if i % log_interval == 0 and i > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print(f'| epoch {epoch:3d} | {i:5d}/{len(train_iter):5d} batches | lr {lr:.4f} | ms/batch {elapsed * 1000 / log_interval:5.2f} | loss {cur_loss:5.2f} | ppl {math.exp(cur_loss):8.2f}')
            total_loss = 0
            start_time = time.time()

### Validation function

Now we will define the validation function, that given a dataset (val or test) will evaluate the loss of the prediction of the model in that dataset.

In [None]:
@torch.no_grad()
def evaluate(data_source):
    model.eval()
    total_loss = 0.
    n = 0
    for i, batch in enumerate(data_source):
        data, target = batch.text, batch.target
        output = model(data)
        output = output.permute(0, 2, 1)
        total_loss += target.numel() * criterion(output, target).item()
        n += target.numel()
    return total_loss / n

### Training loop

This is the training loop code. At any point you can hit stop to get out of training early.

In [None]:
best_val_loss = float("inf")

# At any point you can hit stop to get out of training early.
epochs = 4
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(valid_iter)
        print('-' * 89)
        print(f'| end of epoch {epoch} | time: {(time.time() - epoch_start_time):.2f}s | valid loss {val_loss:.2f} | valid ppl {math.exp(val_loss):.2f}')
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if val_loss < best_val_loss:
            with open("best_checkpoint_transformer.pth", 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        scheduler.step()
        lr = scheduler.get_last_lr()[0]
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open("best_checkpoint_transformer.pth", 'rb') as f:
    model = torch.load(f)
    # after load the rnn params are not a continuous chunk of memory
    # this makes them a continuous chunk, and will speed up forward pass

# Run on test data. 
with torch.no_grad():
    test_loss = evaluate(test_iter)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | test ppl {math.exp(test_loss):8.2f}')
print('=' * 89)


### Text generation

In [None]:
model.eval()

x = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)
temperature = 1  # Higher will increase diversity
text = ""
with torch.no_grad():
    for i in range(1000):
        output = model(x)
        word_weights = (output / temperature).exp().cpu().squeeze()  # Softmax without normalizing
        word_idx = torch.multinomial(word_weights, 1)[0]
        word_tensor = torch.tensor([[word_idx]], dtype=torch.long).to(device)
        x = torch.cat([x, word_tensor], dim=1)
        x = x[:, -35:]
        word = dataset.vocab.itos[word_idx]
        text += word + ('\n' if i % 20 == 19  or word == '<eos>' else ' ')

print(text)
