<a href="https://colab.research.google.com/github/soutrik71/pytorch_classics/blob/main/notebook/LanguageModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Language Modelling with LSTMs in PyTorch

In [None]:
try:
  import google.colab
  !nvidia-smi -L
  !pip install datasets
  google.colab.drive.mount('/content/drive/')
  %cd ../content/drive/Othercomputers/MyMacBookPro/Pytorch/Seq2Seq/LanguageModels
except:
  IN_COLAB = False

GPU 0: Tesla T4 (UUID: GPU-d395cf08-6f98-fc76-dada-cdfaf9c0c14c)
Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
[Errno 2] No such file or directory: '../content/drive/Othercomputers/MyMacBookPro/Pytorch/Seq2Seq/LanguageModels'
/content/drive/Othercomputers/MyMacBookPro/Pytorch/Seq2Seq/LanguageModels


In this notebook, we will implement a Language Model using LSTMs in PyTorch! This will be our pipeline

![alt text](https://i.imgur.com/bRn9ztA.png "Title")


### Imports
We will use the datasets library from HuggingFace to load and map over the dataset, Torchtext to tokenize the dataset and construct the vocabulary and PyTorch to define, train and evaluate the model. The purpose of tqdm is just to show progress bars during training and evaluation.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import math

import torchtext

import datasets

from tqdm import tqdm

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(0)

<torch._C.Generator at 0x128b72e50>

Besides of the imports, we set a device variable that we will use later in functions to ensure that computation takes place on the GPU if possible and we set a seed value so that we can reproduce the results whenever we need to.

### Load the dataset

In [None]:
### Load the Dataset
dataset = datasets.load_dataset('wikitext', 'wikitext-2-raw-v1')
print(dataset)
print(dataset['train'][88]['text'])

Reusing dataset wikitext (/Users/essam/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
100%|██████████| 3/3 [00:00<00:00, 681.23it/s]

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})
 This ammunition , and that which I brought with me , was rapidly prepared for use at the Laboratory established at the Little Rock Arsenal for that purpose . As illustrating as the pitiful scarcity of material in the country , the fact may be stated that it was found necessary to use public documents of the State Library for cartridge paper . Gunsmiths were employed or conscripted , tools purchased or impressed , and the repair of the damaged guns I brought with me and about an equal number found at Little Rock commenced at once . But , after inspecting the work and observing the spirit of the men I decided that a garrison 500 strong could hold out against Fitch and that I would lead the remainder - about 1500 - to Gen 'l Rust as 




To load the dataset, we use the load_dataset() function from datasets. There are two WikiText datasets, an older version: WikiText-103 and a newer version: WikiText-2. For each there is a raw version and a slightly-preprocessed version. We have chosen the raw version of the newer dataset because we will take care of preprocessing by ourselves later.

The output from load_datasets has the train, test and validation sets already split for us. To print an example we first choose one of the three sets, then the row that corresponds to the example and then the name of the feature (column) that we would like to print. Here, the dataset always has one column 'text' which corresponds to a paragraph/piece of text from Wiki. If you try to change the index you might notice that sometimes there is no paragraph and rather an empty string so we will have to care of that later.

### Tokenize the dataset

In [None]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}
tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': tokenizer})
print(tokenized_dataset['train'][88]['tokens'])

Loading cached processed dataset at /Users/essam/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-20adb10e6b5bb205.arrow
Loading cached processed dataset at /Users/essam/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-33194e090c587500.arrow
Loading cached processed dataset at /Users/essam/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-fe8841d3cc3d3254.arrow


['this', 'ammunition', ',', 'and', 'that', 'which', 'i', 'brought', 'with', 'me', ',', 'was', 'rapidly', 'prepared', 'for', 'use', 'at', 'the', 'laboratory', 'established', 'at', 'the', 'little', 'rock', 'arsenal', 'for', 'that', 'purpose', '.', 'as', 'illustrating', 'as', 'the', 'pitiful', 'scarcity', 'of', 'material', 'in', 'the', 'country', ',', 'the', 'fact', 'may', 'be', 'stated', 'that', 'it', 'was', 'found', 'necessary', 'to', 'use', 'public', 'documents', 'of', 'the', 'state', 'library', 'for', 'cartridge', 'paper', '.', 'gunsmiths', 'were', 'employed', 'or', 'conscripted', ',', 'tools', 'purchased', 'or', 'impressed', ',', 'and', 'the', 'repair', 'of', 'the', 'damaged', 'guns', 'i', 'brought', 'with', 'me', 'and', 'about', 'an', 'equal', 'number', 'found', 'at', 'little', 'rock', 'commenced', 'at', 'once', '.', 'but', ',', 'after', 'inspecting', 'the', 'work', 'and', 'observing', 'the', 'spirit', 'of', 'the', 'men', 'i', 'decided', 'that', 'a', 'garrison', '500', 'strong', 'co

The next step is to tokenize every sequence in the dataset. To do this we get a tokenizer from torchtext as in line 1, we then define a function that given an example with feature 'text' returns an example with feature 'tokens' that contains the tokenization of the text. So if the text was "It makes sense." then it will return with ["it", "makes", "sense", "."]

We didn't implement our own tokenizer because built-in ones like this are usually carefully designed to deal with special cases.

In the third line, we use the map function from the datasets library to apply the tokenize_data function on each example. map will need to pass the example along with the tokenizer to tokenize_data so we pass the tokenizer in fn_kwargs as well. At this point, we no longer need the text column so we drop it.

This step is essential because the LSTM/RNN considers the sequence token by token. So it has to be broken down into tokens that we can iterate over.

### Construct the Vocabulary

In [None]:
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], min_freq=3)
vocab.insert_token('<unk>', 0)            # insert (push) unknown token
vocab.insert_token('<eos>', 1)            # insert end of sentence token (eop)
vocab.set_default_index(vocab['<unk>'])   # So that when a token isn't found, it returns the index of the unk token
print(len(vocab))                         # total number words in the vocabulary
print(vocab.get_itos()[:10])              # first 10 tokens converted to strings from tokens

29473
['<unk>', '<eos>', 'the', ',', '.', 'of', 'and', 'in', 'to', 'a']


In the first line we tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big (remember its length will be the number of neurons in the output classification layer) and some words only rarely occur. We then manually add an unk token and set is as the default index so that whenever we request from the vocabulary the index of a word that it doesn't have we get unk.

In line 4, we also insert an eos token. We will later insert it at the end of each sequence so the model learns to produce it when the sequence it's generating should end.

As shown in the cell output, the vocabulary length is about 30K and we notice that indeed unk and eos are in the vocabulary by printing the first 10 elements in it. 'itos' refers to 'index to string'.

### Construct the Data Loaders

A dataloader in PyTorch is a function that given a dataset give you a way to iterate over batches of it. In a batch, all the examples are processed in parallel.

In [None]:
def get_data(dataset, vocab, batch_size):
    data = []                                                       # Merge everything into one gigantic document that we wish to model (all the tokens)
    for example in dataset:
        if example['tokens']:                                       # if the example has tokens (not empty)
            tokens = example['tokens'].append('<eos>')              # append <eos> at the end of the sentence
            tokens = [vocab[token] for token in example['tokens']]  # convert tokens to indices
            data.extend(tokens)                                     # append tokens to data
    data = torch.LongTensor(data)                                   # convert data to tensor
    num_batches = data.shape[0] // batch_size
    data = data[:num_batches * batch_size]                         # We only need the first num_batches * batch_size elements
    data = data.view(batch_size, num_batches)            # Perceive the data as a matrix of batch_size rows and num_batches columns
    return data

#Notice that train_data[:, i] is the batch of next tokens for train_data[:, i - 1]

Here we define a function that does 4 things:

1- It appends each sequence of tokenized text with an eos token to mark its end.

2- It encodes each token to a numerical value equal to its index in the vocabulary. Note that those that occurred less than thrice in the dataset will map to the unknown token.

3- It combines all the numerical sequences into a list (1D Tensor).

4- It reshapes it into a 2D tensor of dimensions [batch_size, num_batches]

To clarify further, suppose the dataset involved only three pieces of text from Wiki that are:

"the more you read, the more things you will know and understand",
"curiosity is the wick in the candle of learning",
"eventually things start making sense"

then this function given a batch size of 5 returns a 2D tensor data of the form

![alt text](https://i.imgur.com/Hxg89f5.png "Title")

But with much more columns and as numbers instead of words. We will see in a little bit how we will use this to train our model. For now let's apply the function on the train, test and validation sets.

In [None]:
batch_size = 128
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_data = get_data(tokenized_dataset['test'], vocab, batch_size)
# We have 16214 batches, each of 128 words

### Define the Model

![alt text](https://i.imgur.com/ThXeq5j.png "Title")


The model we'll build will correspond to the diagram above. The three key components are an embedding layer, the LSTM layers, and the classification layer. We already know the purpose of the LSTM and classification layers. The purpose of the embedding layer is to map each word (given as an index) into a vector of E dimensions that further layers can learn from. Indecies or equivalently one-hot vectors are considered poor representations because they assume words have no relations between each other. This mapping is also learnt during training.

As a form of regularization, we will use a dropout layer before each of the embedding, LSTM, and output layers.

In [None]:
class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, tie_weights):
        super().__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_dim, vocab_size)

        if tie_weights:
            assert embedding_dim == hidden_dim, 'If tying weights then embedding_dim must equal hidden_dim'
            self.embedding.weight = self.fc.weight
        self.init_weights()

    def forward(self, src, hidden):
        embedding = self.dropout(self.embedding(src))
        output, hidden = self.lstm(embedding, hidden)
        output = self.dropout(output)
        prediction = self.fc(output)
        return prediction, hidden

    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hidden_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_emb)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.embedding_dim,
                    self.hidden_dim).uniform_(-init_range_other, init_range_other)
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hidden_dim,
                    self.hidden_dim).uniform_(-init_range_other, init_range_other)

    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
        cell = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
        return hidden, cell

    # We don't learn the hidden state so we can detach it from the computation graph
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach()
        cell = cell.detach()
        return hidden, cell



There a few things to highlight about the implementation of the model above:

1- The tie_weights argument. The purpose of this is to make the embedding layer share weights with the output layer. This helps reduce the number of parameters because it has been shown that the output weights also learn word embeddings in some sense. Note that for this to work the hidden and embedding layers must be of the same size.

2- The self.init_weights() call. We will initialize the weights as in this paper. They state to initialize the embedding weights uniformly in the range [-0.1, 0.1] and all other layers uniformly in the range [-1/sqrt(H), 1/sqrt(H)]. To apply this to the LSTM, we have to iterate through each of its layers to initialize its hidden to hidden and hidden to next layer weights.

3- We also implement a function to set the LSTM's hidden and cell state to zero.

4- Finally, the last function we will to implement under the LSTM class is detach_hidden. We will need this function while training to explicitly tell PyTorch that hidden states due to different sequences are independent. Don't worry about it for now.

### Tune Hyperparameters

In [None]:
vocab_size = len(vocab)
embedding_dim = 1024             # 400 in the paper
hidden_dim = 1024                # 1150 in the paper
num_layers = 2                   # 3 in the paper
dropout_rate = 0.65
tie_weights = True
lr = 1e-3                        # They used 30 and a different optimizer

Notice that we set the embedding and hidden dimensions as the same value because we will use weight tying.


### Initialize the Model

Here we initialize the model, optimizer and loss criterion. We also calculate the no. of parameters to be at 47M.

In [None]:
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, tie_weights).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 47,003,425 trainable parameters


Now we are ready to start training. But before we do let's go back and remember the structure of our data.
Recall, we had a [batch_size=128, num_batches=16214] tensor that looks something like

![alt text](https://i.imgur.com/ldrE2ff.png "Title")

Recall as well that the LSTM takes as input a tensor of shape [N, L, E] where N is the batch_size and L is the sequence length, E is the length of each element in the sequence (embedding length). Thus, we need to decide on a sequence length (L) and we need to break the dataset into chunks of that length then feed them one by one. If we decide to take L=4 for the table above then then a model is trained on all the data in 6 iterations because each color corresponds to a "batch of sequences" which is one feedforward pass to the model. Essentially, we need to go from batches of tokens to batches of sequences of tokens.

![alt text](https://i.imgur.com/D5eXqql.png "Title")

Yes, this means that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the sequence length L). For this reason we will later only reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

What we did is called "using a fixed backpropagation through time window" and it's just one of the ways to deal with the problem that we can't have a batch of sequences with unequal lengths.

Now because we haven't performed this step of breaking the dataset into "batches of L-sequences" we will define a function that given the index of the first batch of tokens in the batch returns the corresponding batch of sequences.


In [None]:
def get_batch(data, seq_len, num_batches, idx):
    src = data[:, idx:idx+seq_len]
    target = data[:, idx+1:idx+seq_len+1]             # The target is the src shifted by one batch
    return src, target

### Training and Evaluation

In [None]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):

    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):  # The last batch can't be a src
        optimizer.zero_grad()
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, num_batches, idx)
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)                 # model output

        prediction = prediction.reshape(batch_size * seq_len, -1)
        target = target.reshape(-1)
        loss = criterion(prediction, target)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here, line 4 sets the model to training mode to train mode (so dropout is not disabled). Lines 6 through 8 ensure that the dataset can be broken down into batches of length seq_len (L).

In the for loop, we consider the dataset at indecies [0, seq_len, 2*seq_len,..] each of these will be given to the get_batch function so it returns the corresponding sequence of batches for the input (src) and the labels (trg) as in line 16. Both have dimensions [batch size, seq_len].
In the first two lines of the for loop we zero the gradients due to the previous batch and detach its hidden state.

The prediction we get in line 20 has dimensions [batch size, seq_len, vocab size], we reshape that into [batch_size*seq_len, vocab] and flatten the target to [batch size * seq_len]. The loss function expects targets and predictions in this case.

In line 25 we compute the gradients in the network, we then clip all those that exceed 'clip' to sidestep exploding gradient. In line 27 we update the weights and in line 28 we compute the loss. Loss.item() has the total loss divided by the batch_size and sequence_length, we multiply by seq_len so we can calculate the average loss per sequence (instead of per token) in the end.
If you get a Nvidia T4 on Google Colab, training would take about 2 hours and a half under the hyperparamer setting above.

In [None]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, num_batches, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

The evaluation loop is similar to the training loop except that we no longer need to backprop or keep track of gradients.

Now we need to call the two functions

In [None]:
n_epochs = 50
seq_len = 50
clip = 0.25
saved = True

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

if saved:
    model.load_state_dict(torch.load('best-val-lstm_lm.pt',  map_location=device))
    test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
    print(f'Test Perplexity: {math.exp(test_loss):.3f}')
else:
    best_valid_loss = float('inf')

    for epoch in range(n_epochs):
        train_loss = train(model, train_data, optimizer, criterion, batch_size, seq_len, clip, device)
        valid_loss = evaluate(model, valid_data, criterion, batch_size, seq_len, device)

        lr_scheduler.step(valid_loss)

        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'best-val-lstm_lm.pt')

        print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
        print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')



Test Perplexity: 92.855


Here we use ReduceLROnPlateu to reduce the learning rate by a factor of 2 after every epoch with no improvement.



We also save the model with the highest validation loss and return the perplexity which a direct function of the loss that measures how confident the model is.

In [None]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  # temperature is unfair
            prediction = torch.multinomial(probs, num_samples=1).item()     # take one sample from the distribution

            while prediction == vocab['<unk>']:
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:
                break

            indices.append(prediction)

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

This is the last step in our pipeline!

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions (which are logits, remember the Softmax is applied in the loss function). Thus, we then apply Softmax (so we need to specify that we want the output due to the last word in the sequence which is the current one in the loop).

We divide the logits by a temperature value to alter the Softmax probability distribution. I recommend checking [this](https://lukesalamone.github.io/posts/what-is-temperature/) to understand the effect.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get unk then we give that another try.

Once we get eos we stop predicting.

We decode the prediction back to strings in lines 24 and 25.

In [None]:
prompt = 'Think about'
max_seq_len = 30
seed = 0

# convert the code above into a for loop
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
think about the world , but the case of the rise and power is not .

0.7
think about the exact thing .

0.75
think about the exact thing .

0.8
think about the exact thing .

1.0
think about trying by a authorities who had already an animal , very few believe . he suggest , later considering his nickname she was watching constant behavior . as the most



That's it! Congratulations for making it to the end of the story. Hope you learnt more about implementing NLP models in PyTorch. Till next time, au revoir.