# A2: Language Model

In this assignment, we will focus on building a language model using a text dataset of your choice. The objective is to train a model that can generate coherent and contextually relevant text based on a given input. Additionally, you will develop a simple web application to demonstrate the capabilities of your language model interactively.

## Task 1. Dataset Acquisition - Your first task is to find a suitable text dataset. (1 points)

### 1) Choose your dataset and provide a brief description. Ensure to source this dataset from reputable public databases or repositories. It is imperative to give proper credit to the dataset source in your documentation.

Note: The dataset can be based on any theme such as Harry Potter, Star Wars, jokes, Isaac Asimov’s works, Thai stories, etc. The key requirement is that the dataset should be text-rich and suitable for language modeling.

### 0. Import Libraries

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

import datasets, math, re
from collections import Counter
from tqdm import tqdm

In [2]:
# mimimum required torch version for MPS support "1.12+"
torch.__version__

'2.10.0'

In [3]:
# universal device selection: use gpu if available, else cpu
import torch

def get_device():
    if torch.cuda.is_available():
        return torch.device("cuda")      # NVIDIA GPU
    elif torch.backends.mps.is_available():
        return torch.device("mps")       # Apple Silicon GPU
    else:
        return torch.device("cpu")

device = get_device()

print(f"Using device: {device}")

Using device: mps


In [4]:
def force_cpu_device():
    return torch.device('cpu')

device = force_cpu_device()
print(f"Using device: {device}")

Using device: cpu


### 1. Load data - Theme : Shakespeare 

This project explores the ability of an LSTM language model to predict and generate text in the style of Shakespeare. For this purpose, I use the publicly available text of Shakespeare’s works from the Project Gutenberg website. More information about Project Gutenberg is provided below:

<i>Excerpt from Gutenberg site:</i>

<b>About Project Gutenberg</b>

Project Gutenberg is an online library of more than 75,000 free eBooks.

Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and his memory continues to inspire the creation of eBooks and related content today.

Since then, thousands of volunteers have digitized and diligently proofread the world’s literature. The entire Project Gutenberg collection is yours to enjoy.

All Project Gutenberg eBooks are completely free and always will be.


Text used for training : [The Project Gutenberg eBook of The Complete Works of William Shakespeare
](https://www.gutenberg.org/cache/epub/100/pg100.txt)

<details>
<summary>Contents </summary>

    THE SONNETS
    ALL’S WELL THAT ENDS WELL
    THE TRAGEDY OF ANTONY AND CLEOPATRA
    AS YOU LIKE IT
    THE COMEDY OF ERRORS
    THE TRAGEDY OF CORIOLANUS
    CYMBELINE
    THE TRAGEDY OF HAMLET, PRINCE OF DENMARK
    THE FIRST PART OF KING HENRY THE FOURTH
    THE SECOND PART OF KING HENRY THE FOURTH
    THE LIFE OF KING HENRY THE FIFTH
    THE FIRST PART OF HENRY THE SIXTH
    THE SECOND PART OF KING HENRY THE SIXTH
    THE THIRD PART OF KING HENRY THE SIXTH
    KING HENRY THE EIGHTH
    THE LIFE AND DEATH OF KING JOHN
    THE TRAGEDY OF JULIUS CAESAR
    THE TRAGEDY OF KING LEAR
    LOVE’S LABOUR’S LOST
    THE TRAGEDY OF MACBETH
    MEASURE FOR MEASURE
    THE MERCHANT OF VENICE
    THE MERRY WIVES OF WINDSOR
    A MIDSUMMER NIGHT’S DREAM
    MUCH ADO ABOUT NOTHING
    THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE
    PERICLES, PRINCE OF TYRE
    KING RICHARD THE SECOND
    KING RICHARD THE THIRD
    THE TRAGEDY OF ROMEO AND JULIET
    THE TAMING OF THE SHREW
    THE TEMPEST
    THE LIFE OF TIMON OF ATHENS
    THE TRAGEDY OF TITUS ANDRONICUS
    TROILUS AND CRESSIDA
    TWELFTH NIGHT; OR, WHAT YOU WILL
    THE TWO GENTLEMEN OF VERONA
    THE TWO NOBLE KINSMEN
    THE WINTER’S TALE
    A LOVER’S COMPLAINT
    THE PASSIONATE PILGRIM
    THE PHOENIX AND THE TURTLE
    THE RAPE OF LUCRECE
    VENUS AND ADONIS
</details>

In [5]:
import os
import requests

DATA_LOCAL_PATH = "../data/gutenberg_pg100.txt"

# Download if file doesn't exist locally
if not os.path.exists(DATA_LOCAL_PATH):
    url = "https://www.gutenberg.org/cache/epub/100/pg100.txt"
    response = requests.get(url)
    text = response.text
    # Save to a local file
    with open(DATA_LOCAL_PATH, "w", encoding="utf-8") as f:
        f.write(text)
else:
    with open(DATA_LOCAL_PATH, "r", encoding="utf-8") as f:
        text = f.read()

print(text[:1000])  # Print the first 1000 characters

The Project Gutenberg eBook of The Complete Works of William Shakespeare
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Complete Works of William Shakespeare

Author: William Shakespeare

Release date: January 1, 1994 [eBook #100]
                Most recently updated: August 24, 2025

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***




The Complete Works of William Shakespeare

by William Shakespeare




                    Contents

    THE SONNETS
    ALL’S WELL THAT ENDS WELL
    THE TRAGEDY OF ANTONY AND CLEOPATRA
 


In [6]:
shakespeare_content = [
    "THE SONNETS",
    "ALL’S WELL THAT ENDS WELL",
    "THE TRAGEDY OF ANTONY AND CLEOPATRA",
    "AS YOU LIKE IT",
    "THE COMEDY OF ERRORS",
    "THE TRAGEDY OF CORIOLANUS",
    "CYMBELINE",
    "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK",
    "THE FIRST PART OF KING HENRY THE FOURTH",
    "THE SECOND PART OF KING HENRY THE FOURTH",
    "THE LIFE OF KING HENRY THE FIFTH",
    "THE FIRST PART OF HENRY THE SIXTH",
    "THE SECOND PART OF KING HENRY THE SIXTH",
    "THE THIRD PART OF KING HENRY THE SIXTH",
    "KING HENRY THE EIGHTH",
    "THE LIFE AND DEATH OF KING JOHN",
    "THE TRAGEDY OF JULIUS CAESAR",
    "THE TRAGEDY OF KING LEAR",
    "LOVE’S LABOUR’S LOST",
    "THE TRAGEDY OF MACBETH",
    "MEASURE FOR MEASURE",
    "THE MERCHANT OF VENICE",
    "THE MERRY WIVES OF WINDSOR",
    "A MIDSUMMER NIGHT’S DREAM",
    "MUCH ADO ABOUT NOTHING",
    "THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE",
    "PERICLES, PRINCE OF TYRE",
    "KING RICHARD THE SECOND",
    "KING RICHARD THE THIRD",
    "THE TRAGEDY OF ROMEO AND JULIET",
    "THE TAMING OF THE SHREW",
    "THE TEMPEST",
    "THE LIFE OF TIMON OF ATHENS",
    "THE TRAGEDY OF TITUS ANDRONICUS",
    "TROILUS AND CRESSIDA",
    "TWELFTH NIGHT; OR, WHAT YOU WILL",
    "THE TWO GENTLEMEN OF VERONA",
    "THE TWO NOBLE KINSMEN",
    "THE WINTER’S TALE",
    "A LOVER’S COMPLAINT",
    "THE PASSIONATE PILGRIM",
    "THE PHOENIX AND THE TURTLE",
    "THE RAPE OF LUCRECE",
    "VENUS AND ADONIS"

]

print("Number of works:", len(shakespeare_content))

Number of works: 44


In [7]:
def extract_works(text):
    # Read from line 84 to skip the header and footer of Project Gutenberg text
    lines = text.splitlines()[83:196041]
    print(f"Total lines after header removal: {len(lines)}")
    # rejoin lines into a single string for easier searching
    text = "\n".join(lines)

    # Split into works by title
    works = []

    for i in range(len(shakespeare_content)):
        title = shakespeare_content[i]
        next_title = shakespeare_content[i + 1] if i + 1 < len(shakespeare_content) else None

        start_idx = text.find(title)
        end_idx = text.find(next_title) if next_title else len(text)

        if start_idx != -1:
            work_text = text[start_idx:end_idx].strip()
            works.append(work_text)
            print("-" * 80)

            print(f"Extracted work {i}: {title}, length: {len(work_text)}")
            print(f"Work snippet: {work_text[:50]}...\n")
        else:
            print(f"Title '{title}' not found in text.")

    return works



In [8]:
shakespeare_works = extract_works(text)
print("Number of works extracted:", len(shakespeare_works))

Total lines after header removal: 195958
--------------------------------------------------------------------------------
Extracted work 0: THE SONNETS, length: 98328
Work snippet: THE SONNETS

                    1

From fairest c...

--------------------------------------------------------------------------------
Extracted work 1: ALL’S WELL THAT ENDS WELL, length: 134619
Work snippet: ALL’S WELL THAT ENDS WELL



Contents

ACT I
Scene...

--------------------------------------------------------------------------------
Extracted work 2: THE TRAGEDY OF ANTONY AND CLEOPATRA, length: 152395
Work snippet: THE TRAGEDY OF ANTONY AND CLEOPATRA


Contents

AC...

--------------------------------------------------------------------------------
Extracted work 3: AS YOU LIKE IT, length: 127037
Work snippet: AS YOU LIKE IT




Contents

 ACT I
 Scene I. An O...

--------------------------------------------------------------------------------
Extracted work 4: THE COMEDY OF ERRORS, length: 88328


## Task 2. Model Training - Incorporate the chosen dataset into our existing code framework. Train a language model that can understand the context and style of the text. (2 Points)

### 1) Detail the steps taken to preprocess the text data. (1 points)

#### Recheck data - The previous step loads data as raw text and splits into 44 different Shakespeare's work.

In [9]:
def show_work_stats(works):
    print("Total rows extracted: {} \n".format(len(shakespeare_works)))
    print("Length of each work:")
    
    for i, work in enumerate(works):
        print(f"Work {i} length: {len(work)}")
        print(f"Snippet of Work {i}: {work[:80]!r}")

show_work_stats(shakespeare_works)

Total rows extracted: 44 

Length of each work:
Work 0 length: 98328
Snippet of Work 0: 'THE SONNETS\n\n                    1\n\nFrom fairest creatures we desire increase,\nT'
Work 1 length: 134619
Snippet of Work 1: 'ALL’S WELL THAT ENDS WELL\n\n\n\nContents\n\nACT I\nScene I. Rossillon. A room in the C'
Work 2 length: 152395
Snippet of Work 2: 'THE TRAGEDY OF ANTONY AND CLEOPATRA\n\n\nContents\n\nACT I\nScene I.\nAlexandria. A Roo'
Work 3 length: 127037
Snippet of Work 3: 'AS YOU LIKE IT\n\n\n\n\nContents\n\n ACT I\n Scene I. An Orchard near Oliver’s house\n Sc'
Work 4 length: 88328
Snippet of Work 4: 'THE COMEDY OF ERRORS\n\n\n\n\nContents\n\nACT I\nScene I. A hall in the Duke’s palace\nSc'
Work 5 length: 165949
Snippet of Work 5: 'THE TRAGEDY OF CORIOLANUS\n\n\n\n\nContents\n\n ACT I\n Scene I. Rome. A street\n Scene I'
Work 6 length: 161233
Snippet of Work 6: 'CYMBELINE\n\n\n\n\nContents\n\nACT I\nScene I. Britain. The garden of Cymbeline’s palace'
Work 7 length: 177933
Snippet of

### 2. Preprocessing

#### Data cleaning and preparation
After inspection of data downloaded from Gutenberg, multiple data cleaning steps are taken

0. Text has been separated into each Shakespeare work while loading data
1. Remove non-printable character

2. Remove unwanted special characters except . ! ? : ' , ; and whitespace

3. Add spaces around punctuation - to ensure that punctuation marks are treated as separate tokens during tokenization. This helps the language model distinguish between words and punctuation, making it easier to learn correct sentence structure and generate more accurate text. 

            For example, "hello!" becomes "hello !", so "hello" and "!" are separate tokens.

4. Remove page numbers - idenfied as standalone numbers on lines - This has to be done before normalizing whitespaces, here the page number identification is based on single number with whitespaces in whole line.

5. Normalize whitespace - will remove all types of whitespace—including newlines (\n), tabs (\t), and extra spaces—by replacing any sequence of whitespace characters with a single space. 

6. Add special tokens to denote <START> and <END> of work to help model learn boundaries and not bleed words of one work into each other

In [10]:
DOC_START_DELIMITER = "<START>"
DOC_END_DELIMITER = "<END>"
SPACE = " "

In [11]:
import re

def clean_data(works):
    cleaned_works = []
    for work in works:
        # Lowercase
        work = work.lower()

        # Remove non-ASCII , non-printable data
        work = re.sub(r'[^\x00-\x7F]+', '', work)
        # Remove unwanted special characters except . ! ? : ' , ; and whitespace
        work = re.sub(r"[^a-z0-9\.\!\?\:\'\,\;\s]", '', work)
        # Add spaces around punctuation
        work = re.sub(r'([\.\!\?])', r' \1 ', work)
        # remove page numbers - idenfied as standalone numbers on lines
        work = re.sub(r'^\s*\d+\s*$', '', work, flags=re.MULTILINE)
        # Normalize whitespace
        work = re.sub(r'\s+', ' ', work).strip()

        

        # Add special tokens to denote <START> and <END> of work to help model learn boundaries 
        # and not bleed words of one work into each other
        cleaned_works.append(DOC_START_DELIMITER + SPACE + work + SPACE + DOC_END_DELIMITER)
    return cleaned_works

shakespeare_works_clean = clean_data(shakespeare_works)

show_work_stats(shakespeare_works_clean)

Total rows extracted: 44 

Length of each work:
Work 0 length: 93664
Snippet of Work 0: '<START> the sonnets from fairest creatures we desire increase, that thereby beau'
Work 1 length: 134647
Snippet of Work 1: '<START> alls well that ends well contents act i scene i . rossillon . a room in '
Work 2 length: 153086
Snippet of Work 2: '<START> the tragedy of antony and cleopatra contents act i scene i . alexandria '
Work 3 length: 126026
Snippet of Work 3: '<START> as you like it contents act i scene i . an orchard near olivers house sc'
Work 4 length: 88663
Snippet of Work 4: '<START> the comedy of errors contents act i scene i . a hall in the dukes palace'
Work 5 length: 166483
Snippet of Work 5: '<START> the tragedy of coriolanus contents act i scene i . rome . a street scene'
Work 6 length: 160794
Snippet of Work 6: '<START> cymbeline contents act i scene i . britain . the garden of cymbelines pa'
Work 7 length: 177755
Snippet of Work 7: '<START> the tragedy of hamlet, prince of den

##### Using Hugging face Dataset :

Structure:

- A Dataset is like a table (similar to a pandas DataFrame), where each row is a data sample and each column is a feature (e.g., "text", "label").
- It supports multiple columns, various data types, and can be split into train/validation/test sets using a DatasetDict.

Usage: 

```sh
# load data sets from Hugging face hub or from local files
from datasets import load_dataset
dataset = load_dataset("imdb")  # Loads the IMDB reviews dataset

# create from python objects eg. list or array
from datasets import Dataset
data = [{"text": "hello", "label": 0}, {"text": "world", "label": 1}]
dataset = Dataset.from_list(data)

# accessing data
print(dataset[0])  # {'text': 'hello', 'label': 0}

# processing - use map functions, filter, shuffle, and split datasets efficiently.
dataset = dataset.map(lambda x: {"text": x["text"].upper()})

```

Benefits:

- Handles large datasets efficiently (memory-mapped, streaming).
- Integrates seamlessly with Hugging Face Transformers for model training.
- Supports easy preprocessing, tokenization, and batching.
- Built-in support for dataset splits, shuffling, and filtering.
- Can load from many formats (CSV, JSON, text, etc.) and the Hugging Face Hub.

In [12]:
from datasets import Dataset

def list_to_dataset(data_list):
   return Dataset.from_list([{"text": item} for item in data_list])

In [13]:
# Convert shakespeare_works_clean to a Hugging Face Dataset
from datasets import Dataset

sp_datasets = Dataset.from_list([{"text": work} for work in shakespeare_works_clean])
print(sp_datasets)
print(sp_datasets[0]['text'][:200])  # Print the first 200 characters of the first entry

Dataset({
    features: ['text'],
    num_rows: 44
})
<START> the sonnets from fairest creatures we desire increase, that thereby beautys rose might never die, but as the riper should by time decease, his tender heir might bear his memory: but thou contr


In [14]:
from datasets import DatasetDict

train_test = sp_datasets.train_test_split(test_size=0.2)

# 10% test set and 10% validation set
train_test_valid = train_test['test'].train_test_split(test_size=0.5)

dataset = DatasetDict({
    'train': train_test['train'],
    'test': train_test_valid['test'],
    'validation': train_test_valid['train']})

dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 35
    })
    test: Dataset({
        features: ['text'],
        num_rows: 5
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 4
    })
})

In [15]:
print(dataset['train'][0]['text'][:80])
print(dataset['validation'][0]['text'][:80]) 
print(dataset['test'][0]['text'][:80]) 

<START> the tragedy of king lear contents act i scene i . a room of state in kin
<START> the tragedy of julius caesar contents act i scene i . rome . a street sc
<START> the life of king henry the fifth contents act i prologue . scene i . lon


#### Tokenizing

In [16]:
# Exact copy of torchtext's basic_english tokenizer
# Source: https://github.com/pytorch/text/blob/main/torchtext/data/utils.py

_patterns = [r"\'", r"\"", r"\.", r"<br \/>", r",", r"\(", r"\)", r"\!", r"\?", r"\;", r"\:", r"\s+"]
_replacements = [" '  ", "", " . ", " ", " , ", " ( ", " ) ", " ! ", " ? ", " ", " ", " "]
_patterns_dict = list((re.compile(p), r) for p, r in zip(_patterns, _replacements))

def _basic_english_normalize(line):
    line = line.lower()
    for pattern_re, replaced_str in _patterns_dict:
        line = pattern_re.sub(replaced_str, line)
    return line.split()

def basic_english_tokenizer(text):
    """Tokenizer matching torchtext's basic_english implementation"""
    return _basic_english_normalize(text)

tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}

tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': basic_english_tokenizer})

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

In [17]:
print(tokenized_dataset['train'][0]['tokens'][:20])
print(tokenized_dataset['validation'][0]['tokens'][:20]) 
print(tokenized_dataset['test'][0]['tokens'][:20]) 

['<start>', 'the', 'tragedy', 'of', 'king', 'lear', 'contents', 'act', 'i', 'scene', 'i', '.', 'a', 'room', 'of', 'state', 'in', 'king', 'lears', 'palace']
['<start>', 'the', 'tragedy', 'of', 'julius', 'caesar', 'contents', 'act', 'i', 'scene', 'i', '.', 'rome', '.', 'a', 'street', 'scene', 'ii', '.', 'the']
['<start>', 'the', 'life', 'of', 'king', 'henry', 'the', 'fifth', 'contents', 'act', 'i', 'prologue', '.', 'scene', 'i', '.', 'london', '.', 'an', 'antechamber']


#### Numericalizing

Use torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also add `unk` to handle missing vocab and `eos` to identify the end of sentence.

This is a common and recommended practice in NLP. Limiting the vocabulary to words that appear at least a few times (e.g., 2 or 3) helps reduce memory usage and model complexity, while special tokens like unk and eos are standard for handling unknown words and marking sequence boundaries. This approach is widely used in language modeling and text processing.

In [18]:
UNKNOWN_TOKEN = "<unk>"
END_OF_SENTENCE_TOKEN = "<eos>"

In [19]:
# Custom Vocab class to replace torchtext.vocab
class Vocab:
    def __init__(self, counter, min_freq=1, specials=None):
        self.itos = []  # index to string
        self.stoi = {}  # string to index
        self.default_index = 0
        
        # Add special tokens first
        if specials:
            for token in specials:
                self._add_token(token)
        
        # Add tokens that meet min_freq threshold
        for token, count in counter.most_common():
            if count >= min_freq:
                if token not in self.stoi:
                    self._add_token(token)
    
    def _add_token(self, token):
        if token not in self.stoi:
            self.stoi[token] = len(self.itos)
            self.itos.append(token)
    
    def set_default_index(self, index):
        self.default_index = index
    
    def get_itos(self):
        return self.itos
    
    def __getitem__(self, token):
        return self.stoi.get(token, self.default_index)
    
    def __len__(self):
        return len(self.itos)

# Build vocabulary from tokenized data
counter = Counter()
for tokens in tokenized_dataset['train']['tokens']:
    counter.update(tokens)

vocab = Vocab(counter, min_freq=3, specials=[UNKNOWN_TOKEN, END_OF_SENTENCE_TOKEN])
vocab.set_default_index(vocab[UNKNOWN_TOKEN])

In [20]:
print(f"Length of vocab: {len(vocab)}")                         
print(vocab.get_itos()[:10])    

Length of vocab: 11372
['<unk>', '<eos>', ',', '.', 'the', 'and', 'i', 'to', 'of', 'a']


### 3. Prepare the batch loader

In [21]:
def get_data(dataset, vocab, batch_size):
    data = []
    for example in dataset:
        if example['tokens']:
            tokens = example['tokens'].append('<eos>')
            tokens = [vocab[token] for token in example['tokens']]
            data.extend(tokens)
    data = torch.LongTensor(data)
    num_batches = data.shape[0] // batch_size
    data = data[:num_batches * batch_size]
    data = data.view(batch_size, num_batches) #view vs. reshape (whether data is contiguous)
    return data #[batch size, seq len]

In [22]:
batch_size = 128
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_data  = get_data(tokenized_dataset['test'],  vocab, batch_size)

In [23]:
train_data.shape

torch.Size([128, 7373])

### 2) Describe the model architecture and the training process. (1 points)

### 4. Modeling

<img src="class/figures/LM.png" width=600>

In [24]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim    = hid_dim
        self.emb_dim    = emb_dim
        
        self.embedding  = nn.Embedding(vocab_size, emb_dim)
        self.lstm       = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout    = nn.Dropout(dropout_rate)
        self.fc         = nn.Linear(hid_dim, vocab_size)
        
        self.init_weights()
    
    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_emb)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        # Fix: use .data.uniform_() instead of replacing the tensor
        for name, param in self.lstm.named_parameters():
            if 'weight' in name:
                param.data.uniform_(-init_range_other, init_range_other)
            elif 'bias' in name:
                param.data.zero_()
    
    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
        
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach() #not to be used for gradient computation
        cell   = cell.detach()
        return hidden, cell
        
    def forward(self, src, hidden):
        #src: [batch_size, seq len]
        embedding = self.dropout(self.embedding(src)) #harry potter is
        #embedding: [batch-size, seq len, emb dim]
        output, hidden = self.lstm(embedding, hidden)
        #ouput: [batch size, seq len, hid dim]
        #hidden: [num_layers * direction, seq len, hid_dim]
        output = self.dropout(output)
        prediction = self.fc(output)
        #prediction: [batch_size, seq_len, vocab_size]
        return prediction, hidden

### 5. Training

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [25]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
# A dropout rate of 0.65 means that during training, 65% of the neurons in the dropout layers 
# are randomly set to zero at each update, which helps prevent overfitting and improves generalization.
dropout_rate = 0.65              
lr = 1e-3                     

In [26]:
model = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 40,094,828 trainable parameters


In [27]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [28]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, seq len]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    # tqdm for progress bar
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [29]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [30]:
model_path = '../model'
model_filename = device.type + "_a2-lstm_lm.pt"

In [31]:
n_epochs = 10 if device.type == 'cpu' else 50
seq_len  = 50 #<----decoding length
clip    = 0.25

print("Trainiing using device:", device.type, " for ", n_epochs, " epochs")


lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), f'{model_path}/{model_filename}')

    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

Trainiing using device: cpu  for  10  epochs


                                                           

Epoch: 01
	Train Perplexity: 454.039
	Valid Perplexity: 222.943


                                                           

Epoch: 02
	Train Perplexity: 260.505
	Valid Perplexity: 171.761


                                                           

Epoch: 03
	Train Perplexity: 199.797
	Valid Perplexity: 146.579


                                                           

Epoch: 04
	Train Perplexity: 170.912
	Valid Perplexity: 132.285


                                                           

Epoch: 05
	Train Perplexity: 153.192
	Valid Perplexity: 122.293


                                                           

Epoch: 06
	Train Perplexity: 141.031
	Valid Perplexity: 117.471


                                                           

Epoch: 07
	Train Perplexity: 130.901
	Valid Perplexity: 113.629


                                                           

Epoch: 08
	Train Perplexity: 122.639
	Valid Perplexity: 109.984


                                                           

Epoch: 09
	Train Perplexity: 115.252
	Valid Perplexity: 106.936


                                                           

Epoch: 10
	Train Perplexity: 109.006
	Valid Perplexity: 103.996


Comparison of GPU vs CPU performance on training model

GPU training Perplexity:
```sh
Epoch: 50
	Train Perplexity: 79.642
	Valid Perplexity: 147.209
```

CPU training Perpexity:

```sh
Epoch: 10
	Train Perplexity: 109.006
	Valid Perplexity: 103.996
```
CPU training completed in 16m 43.3s


GPU is more than 5x faster to complete the training for 50 epoch. However, the GPU training perplexity is higher even with 50 epoch, where as CPU perplexity is ~100 by 10th epoch. I will be using model trained by CPU.

### 6. Testing

In [32]:
model_full_path = f'{model_path}/{model_filename}'
print("Loading best model and evaluating on test set from location " + model_full_path)
model.load_state_dict(torch.load(model_full_path,  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Loading best model and evaluating on test set from location ../model/cpu_a2-lstm_lm.pt
Test Perplexity: 140.969


### 7. Save and Load Model (Pickling)

PyTorch provides two ways to save models:
1. **Save state_dict (Recommended)** - Only saves weights, need model class to load
2. **Save entire model** - Uses pickle, saves everything but less portable

In [33]:
# Load checkpoint
checkpoint_path = "../model"
checkout_filename = "lstm_lm_checkpoint.pt"
checkpoint = torch.load(f'{checkpoint_path}/{checkout_filename}', map_location=device)

# Recreate model with saved hyperparameters
loaded_model = LSTMLanguageModel(
    checkpoint['vocab_size'],
    checkpoint['emb_dim'],
    checkpoint['hid_dim'],
    checkpoint['num_layers'],
    checkpoint['dropout_rate']
).to(device)

# Load weights
loaded_model.load_state_dict(checkpoint['model_state_dict'])
loaded_model.eval()

print("Model loaded from checkpoint!")

Model loaded from checkpoint!


In [34]:
# pickle vocab
import pickle

vocab_filename = "a2_vocab_lm.pkl"

with open(f'{model_path}/{vocab_filename}', 'wb') as f:
    pickle.dump(vocab, f)

### 8. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [35]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [36]:
def prompt(prompt_str):
    max_seq_len = 30
    seed = 0

    #smaller the temperature, more diverse tokens but comes 
    #with a tradeoff of less-make-sense sentence
    temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
    for temperature in temperatures:
        generation = generate(prompt_str, max_seq_len, temperature, model, basic_english_tokenizer, 
                            vocab, device, seed)
        print(str(temperature)+'\n'+' '.join(generation)+'\n')

In [37]:
# Test 1 : Prompts exactly as Shakespeare's words
prompt("To be, or not to be, that is the question:")

0.5
to be , or not to be , that is the question of your voices . i know you , for the more of a match , and the king , to make my best part to the ground , to say

0.7
to be , or not to be , that is the question of your voices . yet ill fear her . what now , go from me . exit . act v scene i . the same . the same camp near

0.75
to be , or not to be , that is the question of your voices . yet were you gone , being now alive to from me . he doth make his language . but and tell the day o th towns

0.8
to be , or not to be , that is the question of your voices . yet were you gone , being now alive to from me . he doth always . besides , but and tell the day o th towns

1.0
to be , or not to be , that is the question of your voices . yet were you gone for being now alive to from me . he doth always . besides , condition and divorce the dead two maid !



In [38]:
# Test 2 : Another Prompts exactly as Shakespeare's words
prompt("The sonnet opens with a prologue that sets the scene:")

0.5
the sonnet opens with a prologue that sets the scene , and the better of her . go , come . exit . enter falstaff . tranio . my lord , i do beseech you , sir , i will

0.7
the sonnet opens with a prologue that sets the scene , and the one of her . what now , go from me . jaquenetta . the king is a word . palamon . o , sir , you will

0.75
the sonnet opens with a prologue that sets the scene queen and the prince of anjou , attended to her . go from me . exit . act v scene i . rome . a room in the countesss palace

0.8
the sonnet opens with a prologue that sets the scene queen and the prince of anjou , attended to her . go from me . exit . act v scene i . the same . the same camp near dover

1.0
the sonnet opens with a prologue that sets the scene dull and the are of one . reenter gentlemen . enter brabantio . posthumus . it is stirrd to the tower . prince . tell the tribunes , sir .



In [39]:
# Test 3 : Prompts with missing token
prompt("Frankenstein is")

0.5
<unk> is a man . i know you , for the more of a match , and the other , and by the two of heaven , the next cock , the

0.7
<unk> is a pair of men . he is himself to her . go from me . exit . act v scene i . the same . the same camp near dover

0.75
<unk> is a pair of men . romeo . ill warrant thee more . posthumus . it is a very . besides , but and tell the day o th towns of

0.8
<unk> is a pair of acheron . he kills himself . enter the lieutenant from france . jaquenetta . my lord , i shall be best by the duke of our towns

1.0
<unk> is returned and made him to see . reenter gentlemen . enter the lieutenant from antonio . jaquenetta . sirrah . clarence . trinculo , and tell the tribunes . two



In [40]:
# Test 4 : Prompts for new creation
prompt("The meaning of life is")

0.5
the meaning of life is returned and made him to the king . enter the king . king . he doth the king , and i will see him . king . o , you

0.7
the meaning of life is returned and made him to the better . enter the king . antonio . he doth the king . trinculo . and tell the duke of battle , and ,

0.75
the meaning of life is returned and made him to see . reenter gentlemen . enter the lieutenant . antonio . most wise , ophelia . i do not think it . poins . o

0.8
the meaning of life is returned and made him to see . reenter gentlemen . enter the lieutenant . antonio . most wise , ophelia . i do not think it . poins . o

1.0
the meaning of life is returned and made him to see . reenter gentlemen . enter brabantio . posthumus . it is stirrd to the tower . prince . tell the tribunes , sir .



# Task 3. Text Generation - Web Application Development - Develop a simple web application thatdemonstrates the capabilities of your language model. (2 points)


1) The application should include an input box where users can type in a text prompt.
2) Based on the input, the model should generate and display a continuation of the text. For example,
if the input is ”Harry Potter is”, the model might generate ”a wizard in the world of Hogwarts”.
3) Provide documentation on how the web application interfaces with the language model.

#### <font color="green">ANSWER:</font>

Multiple steps are carried out to develop the web application. Flask - general purpose web framework is used for this assignment. 

Web app folder : `A2/app`

Python file list:
1. `app.py` - main python code to run the site. Use `uv run app.py` to start web app.
2. `lstm.py` - LSTMLanguageModel class 
3. `tokenizer.py` - helps to tokenize the text. `torchtext.data.util.tokenizer` is incompatible and deprecated with python v3.13.x. To fix this a copy of code maintained in codebase.
4. `vocab.py` - Vocab class 

As per recommendation, instead of saving big size model, a checkpoint is saved and later loaded in application to build LSTM model with saved hyperparameters. Code to load the model is contained in `app.py`

Check [README](../README.md) for more details and screenshots
