# A2: Language Model

In this assignment, we will focus on building a language model using a text dataset of your choice. The objective is to train a model that can generate coherent and contextually relevant text based on a given input. Additionally, you will develop a simple web application to demonstrate the capabilities of your language model interactively.

## Task 1. Dataset Acquisition - Your first task is to find a suitable text dataset. (1 points)

### 1) Choose your dataset and provide a brief description. Ensure to source this dataset from reputable public databases or repositories. It is imperative to give proper credit to the dataset source in your documentation.

Note: The dataset can be based on any theme such as Harry Potter, Star Wars, jokes, Isaac Asimov’s works, Thai stories, etc. The key requirement is that the dataset should be text-rich and suitable for language modeling.

### 0. Import Libraries

In [45]:
import torch
import torch.nn as nn
import torch.optim as optim

import datasets, math, re
from collections import Counter
from tqdm import tqdm

In [46]:
# mimimum required torch version for MPS support "1.12+"
torch.__version__

'2.10.0'

In [47]:
# universal device selection: use gpu if available, else cpu
import torch

def get_device():
    if torch.cuda.is_available():
        return torch.device("cuda")      # NVIDIA GPU
    elif torch.backends.mps.is_available():
        return torch.device("mps")       # Apple Silicon GPU
    else:
        return torch.device("cpu")

device = get_device()

print(f"Using device: {device}")

Using device: mps


In [48]:
def force_cpu_device():
    return torch.device('cpu')

### 1. Load data from Gutenberg project

<i>Excerpt from Gutenberg site:</i>

<b>About Project Gutenberg</b>

Project Gutenberg is an online library of more than 75,000 free eBooks.

Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and his memory continues to inspire the creation of eBooks and related content today.

Since then, thousands of volunteers have digitized and diligently proofread the world’s literature. The entire Project Gutenberg collection is yours to enjoy.

All Project Gutenberg eBooks are completely free and always will be.


Text used for training : [The Project Gutenberg eBook of The Complete Works of William Shakespeare
](https://www.gutenberg.org/cache/epub/100/pg100.txt)

<details>
<summary>Contents </summary>

    THE SONNETS
    ALL’S WELL THAT ENDS WELL
    THE TRAGEDY OF ANTONY AND CLEOPATRA
    AS YOU LIKE IT
    THE COMEDY OF ERRORS
    THE TRAGEDY OF CORIOLANUS
    CYMBELINE
    THE TRAGEDY OF HAMLET, PRINCE OF DENMARK
    THE FIRST PART OF KING HENRY THE FOURTH
    THE SECOND PART OF KING HENRY THE FOURTH
    THE LIFE OF KING HENRY THE FIFTH
    THE FIRST PART OF HENRY THE SIXTH
    THE SECOND PART OF KING HENRY THE SIXTH
    THE THIRD PART OF KING HENRY THE SIXTH
    KING HENRY THE EIGHTH
    THE LIFE AND DEATH OF KING JOHN
    THE TRAGEDY OF JULIUS CAESAR
    THE TRAGEDY OF KING LEAR
    LOVE’S LABOUR’S LOST
    THE TRAGEDY OF MACBETH
    MEASURE FOR MEASURE
    THE MERCHANT OF VENICE
    THE MERRY WIVES OF WINDSOR
    A MIDSUMMER NIGHT’S DREAM
    MUCH ADO ABOUT NOTHING
    THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE
    PERICLES, PRINCE OF TYRE
    KING RICHARD THE SECOND
    KING RICHARD THE THIRD
    THE TRAGEDY OF ROMEO AND JULIET
    THE TAMING OF THE SHREW
    THE TEMPEST
    THE LIFE OF TIMON OF ATHENS
    THE TRAGEDY OF TITUS ANDRONICUS
    TROILUS AND CRESSIDA
    TWELFTH NIGHT; OR, WHAT YOU WILL
    THE TWO GENTLEMEN OF VERONA
    THE TWO NOBLE KINSMEN
    THE WINTER’S TALE
    A LOVER’S COMPLAINT
    THE PASSIONATE PILGRIM
    THE PHOENIX AND THE TURTLE
    THE RAPE OF LUCRECE
    VENUS AND ADONIS
</details>

In [49]:
import os
import requests

DATA_LOCAL_PATH = "../data/gutenberg_pg100.txt"

# Download if file doesn't exist locally
if not os.path.exists(DATA_LOCAL_PATH):
    url = "https://www.gutenberg.org/cache/epub/100/pg100.txt"
    response = requests.get(url)
    text = response.text
    # Save to a local file
    with open(DATA_LOCAL_PATH, "w", encoding="utf-8") as f:
        f.write(text)
else:
    with open(DATA_LOCAL_PATH, "r", encoding="utf-8") as f:
        text = f.read()

print(text[:1000])  # Print the first 1000 characters

The Project Gutenberg eBook of The Complete Works of William Shakespeare
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Complete Works of William Shakespeare

Author: William Shakespeare

Release date: January 1, 1994 [eBook #100]
                Most recently updated: August 24, 2025

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***




The Complete Works of William Shakespeare

by William Shakespeare




                    Contents

    THE SONNETS
    ALL’S WELL THAT ENDS WELL
    THE TRAGEDY OF ANTONY AND CLEOPATRA
 


In [50]:
shakespeare_content = [
    "THE SONNETS",
    "ALL’S WELL THAT ENDS WELL",
    "THE TRAGEDY OF ANTONY AND CLEOPATRA",
    "AS YOU LIKE IT",
    "THE COMEDY OF ERRORS",
    "THE TRAGEDY OF CORIOLANUS",
    "CYMBELINE",
    "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK",
    "THE FIRST PART OF KING HENRY THE FOURTH",
    "THE SECOND PART OF KING HENRY THE FOURTH",
    "THE LIFE OF KING HENRY THE FIFTH",
    "THE FIRST PART OF HENRY THE SIXTH",
    "THE SECOND PART OF KING HENRY THE SIXTH",
    "THE THIRD PART OF KING HENRY THE SIXTH",
    "KING HENRY THE EIGHTH",
    "THE LIFE AND DEATH OF KING JOHN",
    "THE TRAGEDY OF JULIUS CAESAR",
    "THE TRAGEDY OF KING LEAR",
    "LOVE’S LABOUR’S LOST",
    "THE TRAGEDY OF MACBETH",
    "MEASURE FOR MEASURE",
    "THE MERCHANT OF VENICE",
    "THE MERRY WIVES OF WINDSOR",
    "A MIDSUMMER NIGHT’S DREAM",
    "MUCH ADO ABOUT NOTHING",
    "THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE",
    "PERICLES, PRINCE OF TYRE",
    "KING RICHARD THE SECOND",
    "KING RICHARD THE THIRD",
    "THE TRAGEDY OF ROMEO AND JULIET",
    "THE TAMING OF THE SHREW",
    "THE TEMPEST",
    "THE LIFE OF TIMON OF ATHENS",
    "THE TRAGEDY OF TITUS ANDRONICUS",
    "TROILUS AND CRESSIDA",
    "TWELFTH NIGHT; OR, WHAT YOU WILL",
    "THE TWO GENTLEMEN OF VERONA",
    "THE TWO NOBLE KINSMEN",
    "THE WINTER’S TALE",
    "A LOVER’S COMPLAINT",
    "THE PASSIONATE PILGRIM",
    "THE PHOENIX AND THE TURTLE",
    "THE RAPE OF LUCRECE",
    "VENUS AND ADONIS"

]

print("Number of works:", len(shakespeare_content))

Number of works: 44


In [51]:
def extract_works(text):
    # Read from line 84 to skip the header
    lines = text.splitlines()[83:]
    print(f"Total lines after header removal: {len(lines)}")
    # rejoin lines into a single string for easier searching
    text = "\n".join(lines)

    # Split into works by title
    works = []

    for i in range(len(shakespeare_content)):
        title = shakespeare_content[i]
        next_title = shakespeare_content[i + 1] if i + 1 < len(shakespeare_content) else None

        start_idx = text.find(title)
        end_idx = text.find(next_title) if next_title else len(text)

        if start_idx != -1:
            work_text = text[start_idx:end_idx].strip()
            works.append(work_text)
            print(f"Extracted work {i}: {title}, length: {len(work_text)}")
            print(f"Work snippet: {work_text[:100]}...\n")
        else:
            print(f"Title '{title}' not found in text.")

    return works



In [52]:
shakespeare_works = extract_works(text)
print("Number of works extracted:", len(shakespeare_works))

Total lines after header removal: 196312
Extracted work 0: THE SONNETS, length: 98328
Work snippet: THE SONNETS

                    1

From fairest creatures we desire increase,
That thereby beauty’s...

Extracted work 1: ALL’S WELL THAT ENDS WELL, length: 134619
Work snippet: ALL’S WELL THAT ENDS WELL



Contents

ACT I
Scene I. Rossillon. A room in the Countess’s palace.
Sc...

Extracted work 2: THE TRAGEDY OF ANTONY AND CLEOPATRA, length: 152395
Work snippet: THE TRAGEDY OF ANTONY AND CLEOPATRA


Contents

ACT I
Scene I.
Alexandria. A Room in Cleopatra’s pal...

Extracted work 3: AS YOU LIKE IT, length: 127037
Work snippet: AS YOU LIKE IT




Contents

 ACT I
 Scene I. An Orchard near Oliver’s house
 Scene II. A Lawn befor...

Extracted work 4: THE COMEDY OF ERRORS, length: 88328
Work snippet: THE COMEDY OF ERRORS




Contents

ACT I
Scene I. A hall in the Duke’s palace
Scene II. A public pla...

Extracted work 5: THE TRAGEDY OF CORIOLANUS, length: 165949
Work snippet: THE TRAGEDY 

## Task 2. Model Training - Incorporate the chosen dataset into our existing code framework. Train a language model that can understand the context and style of the text. (2 Points)

### 1) Detail the steps taken to preprocess the text data. (1 points)

#### Recheck data - The previous step loads data as raw text and splits into 44 different Shakespeare's work.

In [53]:
def show_work_stats(works):
    print("Total rows extracted: {} \n".format(len(shakespeare_works)))
    print("Length of each work:")
    
    for i, work in enumerate(works):
        print(f"Work {i} length: {len(work)}")
        print(f"Snippet of Work {i}: {work[:80]!r}")

show_work_stats(shakespeare_works)

Total rows extracted: 44 

Length of each work:
Work 0 length: 98328
Snippet of Work 0: 'THE SONNETS\n\n                    1\n\nFrom fairest creatures we desire increase,\nT'
Work 1 length: 134619
Snippet of Work 1: 'ALL’S WELL THAT ENDS WELL\n\n\n\nContents\n\nACT I\nScene I. Rossillon. A room in the C'
Work 2 length: 152395
Snippet of Work 2: 'THE TRAGEDY OF ANTONY AND CLEOPATRA\n\n\nContents\n\nACT I\nScene I.\nAlexandria. A Roo'
Work 3 length: 127037
Snippet of Work 3: 'AS YOU LIKE IT\n\n\n\n\nContents\n\n ACT I\n Scene I. An Orchard near Oliver’s house\n Sc'
Work 4 length: 88328
Snippet of Work 4: 'THE COMEDY OF ERRORS\n\n\n\n\nContents\n\nACT I\nScene I. A hall in the Duke’s palace\nSc'
Work 5 length: 165949
Snippet of Work 5: 'THE TRAGEDY OF CORIOLANUS\n\n\n\n\nContents\n\n ACT I\n Scene I. Rome. A street\n Scene I'
Work 6 length: 161233
Snippet of Work 6: 'CYMBELINE\n\n\n\n\nContents\n\nACT I\nScene I. Britain. The garden of Cymbeline’s palace'
Work 7 length: 177933
Snippet of

### 2. Preprocessing

#### Data cleaning and preparation
After inspection of data downloaded from Gutenberg, multiple data cleaning steps are taken


1. Remove non-printable character

2. Remove unwanted special characters except . ! ? : ' , ; and whitespace

3. Add spaces around punctuation - to ensure that punctuation marks are treated as separate tokens during tokenization. This helps the language model distinguish between words and punctuation, making it easier to learn correct sentence structure and generate more accurate text. 

            For example, "hello!" becomes "hello !", so "hello" and "!" are separate tokens.

4. Remove page numbers - idenfied as standalone numbers on lines - This has to be done before normalizing whitespaces, here the page number identification is based on single number with whitespaces in whole line.

5. Normalize whitespace - will remove all types of whitespace—including newlines (\n), tabs (\t), and extra spaces—by replacing any sequence of whitespace characters with a single space. 

6. Add special tokens to denote <START> and <END> of work to help model learn boundaries and not bleed words of one work into each other

In [54]:
DOC_START_DELIMITER = "<START>"
DOC_END_DELIMITER = "<END>"
SPACE = " "

In [55]:
import re

def clean_data(works):
    cleaned_works = []
    for work in works:
        # Lowercase
        work = work.lower()

        # Remove non-ASCII , non-printable data
        work = re.sub(r'[^\x00-\x7F]+', '', work)
        # Remove unwanted special characters except . ! ? : ' , ; and whitespace
        work = re.sub(r"[^a-z0-9\.\!\?\:\'\,\;\s]", '', work)
        # Add spaces around punctuation
        work = re.sub(r'([\.\!\?])', r' \1 ', work)
        # remove page numbers - idenfied as standalone numbers on lines
        work = re.sub(r'^\s*\d+\s*$', '', work, flags=re.MULTILINE)
        # Normalize whitespace
        work = re.sub(r'\s+', ' ', work).strip()

        

        # Add special tokens to denote <START> and <END> of work to help model learn boundaries 
        # and not bleed words of one work into each other
        cleaned_works.append(DOC_START_DELIMITER + SPACE + work + SPACE + DOC_END_DELIMITER)
    return cleaned_works

shakespeare_works_clean = clean_data(shakespeare_works)

show_work_stats(shakespeare_works_clean)

Total rows extracted: 44 

Length of each work:
Work 0 length: 93664
Snippet of Work 0: '<START> the sonnets from fairest creatures we desire increase, that thereby beau'
Work 1 length: 134647
Snippet of Work 1: '<START> alls well that ends well contents act i scene i . rossillon . a room in '
Work 2 length: 153086
Snippet of Work 2: '<START> the tragedy of antony and cleopatra contents act i scene i . alexandria '
Work 3 length: 126026
Snippet of Work 3: '<START> as you like it contents act i scene i . an orchard near olivers house sc'
Work 4 length: 88663
Snippet of Work 4: '<START> the comedy of errors contents act i scene i . a hall in the dukes palace'
Work 5 length: 166483
Snippet of Work 5: '<START> the tragedy of coriolanus contents act i scene i . rome . a street scene'
Work 6 length: 160794
Snippet of Work 6: '<START> cymbeline contents act i scene i . britain . the garden of cymbelines pa'
Work 7 length: 177755
Snippet of Work 7: '<START> the tragedy of hamlet, prince of den

##### Using Hugging face Dataset :

Structure:

- A Dataset is like a table (similar to a pandas DataFrame), where each row is a data sample and each column is a feature (e.g., "text", "label").
- It supports multiple columns, various data types, and can be split into train/validation/test sets using a DatasetDict.

Usage: 

```sh
# load data sets from Hugging face hub or from local files
from datasets import load_dataset
dataset = load_dataset("imdb")  # Loads the IMDB reviews dataset

# create from python objects eg. list or array
from datasets import Dataset
data = [{"text": "hello", "label": 0}, {"text": "world", "label": 1}]
dataset = Dataset.from_list(data)

# accessing data
print(dataset[0])  # {'text': 'hello', 'label': 0}

# processing - use map functions, filter, shuffle, and split datasets efficiently.
dataset = dataset.map(lambda x: {"text": x["text"].upper()})

```

Benefits:

- Handles large datasets efficiently (memory-mapped, streaming).
- Integrates seamlessly with Hugging Face Transformers for model training.
- Supports easy preprocessing, tokenization, and batching.
- Built-in support for dataset splits, shuffling, and filtering.
- Can load from many formats (CSV, JSON, text, etc.) and the Hugging Face Hub.

In [56]:
from datasets import Dataset

def list_to_dataset(data_list):
   return Dataset.from_list([{"text": item} for item in data_list])

In [57]:
# Convert shakespeare_works_clean to a Hugging Face Dataset
from datasets import Dataset

sp_datasets = Dataset.from_list([{"text": work} for work in shakespeare_works_clean])
print(sp_datasets)
print(sp_datasets[0])  # Print the first entry to verify

Dataset({
    features: ['text'],
    num_rows: 44
})


In [58]:
from datasets import DatasetDict

train_test = sp_datasets.train_test_split(test_size=0.2)

# 10% test set and 10% validation set
train_test_valid = train_test['test'].train_test_split(test_size=0.5)

dataset = DatasetDict({
    'train': train_test['train'],
    'test': train_test_valid['test'],
    'validation': train_test_valid['train']})

dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 35
    })
    test: Dataset({
        features: ['text'],
        num_rows: 5
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 4
    })
})

In [59]:
print(dataset['train'][0]['text'])
print(dataset['validation'][0]['text']) 
print(dataset['test'][0]['text']) 



#### Tokenizing

In [65]:
# Exact copy of torchtext's basic_english tokenizer
# Source: https://github.com/pytorch/text/blob/main/torchtext/data/utils.py

_patterns = [r"\'", r"\"", r"\.", r"<br \/>", r",", r"\(", r"\)", r"\!", r"\?", r"\;", r"\:", r"\s+"]
_replacements = [" '  ", "", " . ", " ", " , ", " ( ", " ) ", " ! ", " ? ", " ", " ", " "]
_patterns_dict = list((re.compile(p), r) for p, r in zip(_patterns, _replacements))

def _basic_english_normalize(line):
    line = line.lower()
    for pattern_re, replaced_str in _patterns_dict:
        line = pattern_re.sub(replaced_str, line)
    return line.split()

def basic_english_tokenizer(text):
    """Tokenizer matching torchtext's basic_english implementation"""
    return _basic_english_normalize(text)

tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}

tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': basic_english_tokenizer})

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

In [61]:
print(tokenized_dataset[0])



##### Numericalizing

Use torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also add `unk` to handle missing vocab and `eos` to identify the end of sentence.

This is a common and recommended practice in NLP. Limiting the vocabulary to words that appear at least a few times (e.g., 2 or 3) helps reduce memory usage and model complexity, while special tokens like unk and eos are standard for handling unknown words and marking sequence boundaries. This approach is widely used in language modeling and text processing.

In [62]:
UNKNOWN_TOKEN = "<unk>"
END_OF_SENTENCE_TOKEN = "<eos>"

In [63]:
# Custom Vocab class to replace torchtext.vocab
class Vocab:
    def __init__(self, counter, min_freq=1, specials=None):
        self.itos = []  # index to string
        self.stoi = {}  # string to index
        self.default_index = 0
        
        # Add special tokens first
        if specials:
            for token in specials:
                self._add_token(token)
        
        # Add tokens that meet min_freq threshold
        for token, count in counter.most_common():
            if count >= min_freq:
                if token not in self.stoi:
                    self._add_token(token)
    
    def _add_token(self, token):
        if token not in self.stoi:
            self.stoi[token] = len(self.itos)
            self.itos.append(token)
    
    def set_default_index(self, index):
        self.default_index = index
    
    def get_itos(self):
        return self.itos
    
    def __getitem__(self, token):
        return self.stoi.get(token, self.default_index)
    
    def __len__(self):
        return len(self.itos)

# Build vocabulary from tokenized data
counter = Counter()
for tokens in tokenized_dataset['train']['tokens']:
    counter.update(tokens)

vocab = Vocab(counter, min_freq=3, specials=[UNKNOWN_TOKEN, END_OF_SENTENCE_TOKEN])
vocab.set_default_index(vocab[UNKNOWN_TOKEN])

ValueError: Column 'train' doesn't exist.

### 2) Describe the model architecture and the training process. (1 points)