<h1 style="color:rgb(0,120,170)">Hands-on AI II</h1>
<h2 style="color:rgb(0,120,170)">Unit 5 – Language Modeling with LSTM </h2>

<b>Authors:</b> N. Rekabsaz, B. Schäfl, S. Lehner, J. Brandstetter, E. Kobler, A. Schörgenhumer<br>
<b>Date:</b> 16-05-2023

This file is part of the "Hands-on AI II" lecture material. The following copyright statement applies to all code within this file.

<b>Copyright statement:</b><br>
This material, no matter whether in printed or electronic form, may be used for personal and non-commercial educational use only. Any reproduction of this material, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

<h2>Table of contents</h2>
<ol>
    <a href="#lm"><li style="font-size:large;font-weight:bold">Language Model Training and Evaluation</li></a>
    <ol style="margin-bottom:15px">
        <a href="#lm-parameters"><li style="font-size:medium">Defining Parameters</li></a>
        <a href="#lm-data"><li style="font-size:medium">Data & Dictionary Preparation</li></a>
        <a href="#lm-model"><li style="font-size:medium">Model Definition</li></a>
        <a href="#lm-training"><li style="font-size:medium">Training & Evaluation</li></a>
    </ol>
    <a href="#generation"><li style="font-size:large;font-weight:bold">Language Generation</li></a>
    
</ol>


<h3 style="color:rgb(0,120,170)">How to use this notebook</h3>

This notebook is designed to run from start to finish. There are different tasks (displayed in <span style="color:rgb(248,138,36)">orange boxes</span>) which might require small code modifications. Most/All of the used functions are imported from the file <code>u5_utils.py</code> which can be seen and treated as a black box. However, for further understanding, you can look at the implementations of the helper functions. In order to run this notebook, the packages which are imported at the beginning of <code>u5_utils.py</code> need to be installed.

In [1]:
# Import pre-defined utilities specific to this notebook.
import u5_utils as u5

# Import additional utilities needed in this notebook.
import numpy as np
import torch
import os
import time
import math
import ipdb

# Setup Jupyter notebook (warning: this may affect all Jupyter notebooks running on the same Jupyter server).
u5.setup_jupyter()

<h3 style="color:rgb(0,120,170)">Module versions</h3>

As mentioned in the introductory slides, specific minimum versions of Python itself as well as of used modules are recommended.

In [2]:
u5.check_module_versions()

Installed Python version: 3.9 (✓)
Installed numpy version: 1.23.5 (✓)
Installed pandas version: 1.4.4 (✓)
Installed PyTorch version: 2.0.1 (✓)


<a name="lm"></a><h2>Language Model Training and Evaluation</h2>
<p><p>In this section, we will create a language model with LSTM, trained on the words of a text corpus and evaluated on a hold-out set.
In detail, we use the Penn parsed corpus:

<center><cite>
Seth Kulick, Anthony Kroch, and Beatrice Santorini. 2014. The Penn Parsed Corpus of Modern British English: First Parsing Results and Analysis. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 662–667, Baltimore, Maryland. Association for Computational Linguistics.
</cite></center>

</p></p>


In [3]:
# Input & output parameters
data_path = os.path.join("resources", "penn")
save_path = "model.pt" # path to save the final model

# Training & evaluation parameters
train_batch_size = 32 # batch size for training
eval_batch_size = 32 # batch size for validation/test
max_seq_len = 40 # sequence length

# Random seed to facilitate reproducibility
torch.manual_seed(42)

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
print("Device:", device)

Device: cuda


<a name="lm-data"></a><h3 style="color:rgb(0,120,170)">Data & Dictionary Preperation</h3>
<p><p>
The train/val/test text corpora are loaded and tokenized. The provided data files are already pre-processed (cleaned). After loading, we create a dictionary based on the <i>training data</i>, which maps every word to a wordID. We then use the dictionary to map all the words in the data files to streams of wordIDs. Finally, the datasets are split into a set of sequences according to the given batch size.</p></p>

In [4]:
train_corpus = u5.Corpus(os.path.join(data_path, "train.txt"))
valid_corpus = u5.Corpus(os.path.join(data_path, "valid.txt"))
test_corpus = u5.Corpus(os.path.join(data_path, "test.txt"))

dictionary = u5.Dictionary()
train_corpus.fill_dictionary(dictionary)
ntokens = len(dictionary)
print(f"Number of tokens in dictionary {ntokens}")

Number of tokens in dictionary 10001


In [5]:
# Some samples in the dictionary ...
print(f"wordID of a word in the dictionary: {dictionary.word2idx['book']}")
print(f"A word in the dictionary based on its wordID: '{dictionary.idx2word[854]}'")

wordID of a word in the dictionary: 1203
A word in the dictionary based on its wordID: 'says'


In [6]:
train_data = train_corpus.words_to_ids(dictionary)
print(f"Train data: number of tokens {len(train_data)}")

valid_data = valid_corpus.words_to_ids(dictionary)
print(f"Validation data: number of tokens {len(valid_data)}")

test_data = test_corpus.words_to_ids(dictionary)
print(f"Test data: number of tokens {len(test_data)}")

print()
train_data_splits = u5.batchify(train_data, train_batch_size, device)
print(f"Train data split shape: {train_data_splits.shape}")

val_data_splits = u5.batchify(valid_data, eval_batch_size, device)
print(f"Validation data split shape: {val_data_splits.shape}")

test_data_splits = u5.batchify(test_data, eval_batch_size, device)
print(f"Test data batchified shape: {test_data_splits.shape}")

Train data: number of tokens 929589
Validation data: number of tokens 73760
Test data: number of tokens 82430

Train data split shape: torch.Size([29049, 32])
Validation data split shape: torch.Size([2305, 32])
Test data batchified shape: torch.Size([2575, 32])


<a name="tasks-one"></a><h3 style="color:rgb(0,120,170)">Tasks</h3>
    <div class="alert alert-warning">
        Execute the notebook until here and try to solve the following tasks:
        <ul>
            <li>Print the first 100 wordIDs of the 3rd sequence in <code>train_data_splits</code>.</li>
            <li>Print the first wordIDs in all sequences in <code>train_data_splits</code>. What should be the shape of the resulting tensor?</li>
        </ul>
</div>

<a name="lm-model"></a><h3 style="color:rgb(0,120,170)">Model Definition</h3>
<p><p>Our language model consists of an encoder matrix, an LSTM, and a decoder matrix. The decoder matrix transfers the hidden states from the low-embedding-dimension to the dimension of the size of vocabularies. The overall scheme of the model is shown below:
</p></p>

<center>
    <img src="resources/lm_lstm_model.png" alt="Image not found!" style="width: 50%;"/>
</center>


In [7]:
class LM_LSTMModel(torch.nn.Module):
    
    def __init__(self, ntoken, ninp, nhid):
        super().__init__()
        self.ntoken = ntoken
        self.encoder = torch.nn.Embedding(ntoken, ninp) # matrix E in the figure
        self.rnn = torch.nn.LSTM(ninp, nhid)
        self.decoder = torch.nn.Linear(nhid, ntoken) # matrix U in the figure
    
    def forward(self, input, hidden=None, return_logs=True):
        #ipdb.set_trace()
        emb = self.encoder(input)
        hiddens, last_hidden = self.rnn(emb, hidden)
        
        decoded = self.decoder(hiddens)
        if return_logs:
            y_hat = torch.nn.LogSoftmax(dim=-1)(decoded)
        else:
            y_hat = torch.nn.Softmax(dim=-1)(decoded)
        
        return y_hat, last_hidden

In [8]:
# Model parameters
emsize = 200  # size of word embeddings
nhid = 200  # number of hidden units per layer

model = LM_LSTMModel(ntokens, emsize, nhid)
model.to(device)

print(f"Model: {model}")
print(f"Model total trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

Model: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Model total trainable parameters: 4332001


<a name="tasks-one"></a><h3 style="color:rgb(0,120,170)">Tasks</h3>
    <div class="alert alert-warning">
        Execute the notebook until here and try to solve the following tasks:
        <ul>
            <li>Considering the provided figure, find the corresponding components of the language model in the <code>\_\_init\_\_</code> function of <code>LM_LSTMModel</code>.</li>
            <li>Read the <code>forward</code> method of <code>LM_LSTMModel</code> and try to follow the data flow of the language model (from input to output) as shown in the figure. </li>
        </ul>
</div>

<a name="lm-training"></a><h3 style="color:rgb(0,120,170)">Training and Evaluation</h3>
<p><p></p></p>

This section contains the code of training the model and evaluating the validation set. Performance is evaluated with [perplexity measure](https://en.wikipedia.org/wiki/Perplexity). Descriptions are provided as comments inside the code. The process of training is depicted below:

<center>
    <img src="resources/lm_training.png" alt="Image not found!" style="width: 50%;"/>
</center>


In [9]:
CUT_AFTER_BATCHES = 200  # JUST FOR DEBUGGING: cut the loop after these number of batches. Set to -1 to ignore


def train(model: torch.nn.Module, optimizer: torch.optim.Optimizer, dictionary: u5.Dictionary,
          max_seq_len: int, train_batch_size: int, train_data_splits,
          clipping: float, learning_rate: float, print_interval: int, epoch: int,
          criterion: torch.nn.Module = torch.nn.NLLLoss()):
    """
    Train the model. Training mode turned on to enable dropout.
    """
    model.train()
    total_loss = 0.0
    start_time = time.time()
    ntokens = len(dictionary)
    start_hidden = None
    n_batches = (train_data_splits.size(0) - 1) // max_seq_len
    
    for batch_i, i in enumerate(range(0, train_data_splits.size(0) - 1, max_seq_len)):
        batch_data, batch_targets = u5.get_batch(train_data_splits, i, max_seq_len)
        # ipdb.set_trace()
        
        # Don't forget it! Otherwise, the gradients are summed together!
        optimizer.zero_grad()
        
        # Repackaging batches only keeps the value of start_hidden and disconnects its computational graph.
        # If repackaging is not done the, gradients are calculated from the current point to the beginning
        # of the sequence which becomes computationally too expensive.
        if start_hidden is not None:
            start_hidden = u5.repackage_hidden(start_hidden)
        
        # Forward pass
        y_hat_logprobs, last_hidden = model(batch_data, start_hidden, return_logs=True)
        
        # Loss computation & backward pass
        y_hat_logprobs = y_hat_logprobs.view(-1, ntokens)
        loss = criterion(y_hat_logprobs, batch_targets.view(-1))
        loss.backward()
        
        # The last hidden states of the current step is set as the start hidden state of the next step.
        # This passes the information of the current batch to the next batch.
        start_hidden = last_hidden
        
        # Clipping gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clipping)
        
        # Updating parameters using SGD
        optimizer.step()
        
        total_loss += loss.item()
        
        if batch_i % print_interval == 0 and batch_i > 0:
            cur_loss = total_loss / print_interval
            elapsed = time.time() - start_time
            throughput = elapsed * 1000 / print_interval
            print(f"| epoch {epoch:3d} | {batch_i:5d}/{n_batches:5d} batches | lr {learning_rate:02.2f} | ms/batch {throughput:5.2f} "
                  f"| loss {cur_loss:5.2f} | perplexity {math.exp(cur_loss):8.2f}")
            total_loss = 0
            start_time = time.time()
        
        # Cuts the loop (only for debugging)
        if (CUT_AFTER_BATCHES != -1) and (batch_i >= CUT_AFTER_BATCHES):
            print(f"WARNING: Training is interrupted after {batch_i} batches")
            break
            

epochs = 2  # total number of training epochs
print_interval = 25  # print report statistics every x batches
lr = 20  # initial learning rate
clipping = 0.25  # gradient clipping
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)

best_val_loss = None

# Loop over epochs.
for epoch in range(epochs):
    epoch_start_time = time.time()
    train(model, optimizer, dictionary, max_seq_len, train_batch_size, train_data_splits, clipping, lr, print_interval, epoch)
    val_loss = u5.evaluate(model, dictionary, max_seq_len, eval_batch_size, val_data_splits)
    
    print("-" * 100)
    print(f"| end of epoch {epoch:3d} | time: {time.time() - epoch_start_time:5.2f}s"
          f"| valid loss {val_loss:5.2f} | valid perplexity {math.exp(val_loss):8.2f}")
    print("-" * 100)
    
    # Save the model if the validation loss is the best we've seen so far.
    if not best_val_loss or val_loss < best_val_loss:
        with open(save_path, "wb") as f:
            torch.save(model, f)
        best_val_loss = val_loss
    else:
        # Anneal the learning rate if no improvement has been seen in the validation dataset.
        lr /= 4.0
        for g in optimizer.param_groups:
            g["lr"] = lr

| epoch   0 |    25/  726 batches | lr 20.00 | ms/batch 208.00 | loss  7.43 | perplexity  1681.06
| epoch   0 |    50/  726 batches | lr 20.00 | ms/batch 198.61 | loss  6.22 | perplexity   500.28
| epoch   0 |    75/  726 batches | lr 20.00 | ms/batch 210.68 | loss  6.04 | perplexity   420.46
| epoch   0 |   100/  726 batches | lr 20.00 | ms/batch 215.21 | loss  5.98 | perplexity   394.43
| epoch   0 |   125/  726 batches | lr 20.00 | ms/batch 211.80 | loss  5.88 | perplexity   358.61
| epoch   0 |   150/  726 batches | lr 20.00 | ms/batch 213.63 | loss  5.78 | perplexity   324.09
| epoch   0 |   175/  726 batches | lr 20.00 | ms/batch 235.02 | loss  5.91 | perplexity   368.44
| epoch   0 |   200/  726 batches | lr 20.00 | ms/batch 245.80 | loss  5.86 | perplexity   351.82
----------------------------------------------------------------------------------------------------
| end of epoch   0 | time: 49.03s| valid loss  5.85 | valid perplexity   346.87
-----------------------------------

After finalizing the training, the best performing model (according to validation performance) is loaded and evaluated on the test corpus.


In [10]:
# Load the saved model.
with open(save_path, "rb") as f:
    model = torch.load(f)

test_loss = u5.evaluate(model, dictionary, max_seq_len, eval_batch_size, test_data_splits)
print("=" * 100)
print(f"| Test loss {test_loss:5.2f} | test perplexity {math.exp(test_loss):5.2f}")
print("=" * 100)

| Test loss  5.71 | test perplexity 301.39


<a name="tasks-one"></a><h3 style="color:rgb(0,120,170)">Tasks</h3>
    <div class="alert alert-warning">
        Execute the notebook until here and try to solve the following tasks:
        <ul>
            <li>Using <code>ipdb</code> at the beginning of the loop in <code>train</code>, look at the <code>batch_data</code> and <code>batch_targets</code>. What are their shapes? What do they contain? How are they related to each other?</li>
        </ul>
</div>

<a name="generation"></a><h2>Language Generation</h2>
<p><p>In this section, the trained model in the previous section is used to generate a sequence with a specific length. The language generation is done by sampling words from the predicted probability distribution of the language model. </p></p>



In [11]:
GENERATION_LENGTH = 10
START_WORD = "I"

start_hidden = None
START_WORD = START_WORD.lower()
    
generated_text = START_WORD
with torch.no_grad():
    wordid_input = dictionary.word2idx[START_WORD]
    for i in range(0, GENERATION_LENGTH):
        data = u5.batchify(torch.tensor([wordid_input]), 1, device)
        
        y_hat_probs, last_hidden = model(data, start_hidden, return_logs=False)
        
        prob_dist = torch.distributions.Categorical(y_hat_probs.squeeze())
        wordid_input = prob_dist.sample()
        word_generated = dictionary.idx2word[wordid_input]
        
        generated_text += " " + word_generated
        
        start_hidden = last_hidden

print(generated_text)

i finished these days would <eos> and one initial the diplomat


<a name="tasks-one"></a><h3 style="color:rgb(0,120,170)">Tasks</h3>
    <div class="alert alert-warning">
        Execute the notebook until here and try to solve the following tasks:
        <ul>
            <li>For one of the steps, calculate the sum of the generated probability distribution (<code>prob_dist</code>). Is it equal to 1.0?</li>
            <li>Change the length of the generated text. Does the text (still) remain coherent?</li>
        </ul>
</div>