# A2: Language Model

In this assignment, we will focus on building a language model using a text dataset of your choice. The objective is to train a model that can generate coherent and contextually relevant text based on a given input. Additionally, you will develop a simple web application to demonstrate the capabilities of your language model interactively.

## Task 1. Dataset Acquisition - Your first task is to find a suitable text dataset. (1 points)

### 1) Choose your dataset and provide a brief description. Ensure to source this dataset from reputable public databases or repositories. It is imperative to give proper credit to the dataset source in your documentation.

Note: The dataset can be based on any theme such as Harry Potter, Star Wars, jokes, Isaac Asimov’s works, Thai stories, etc. The key requirement is that the dataset should be text-rich and suitable for language modeling.

### 0. Import Libraries

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

import datasets, math, re
from collections import Counter
from tqdm import tqdm

In [3]:
# mimimum required torch version for MPS support "1.12+"
torch.__version__

'2.10.0'

In [4]:
# universal device selection: use gpu if available, else cpu
import torch

def get_device():
    if torch.cuda.is_available():
        return torch.device("cuda")      # NVIDIA GPU
    elif torch.backends.mps.is_available():
        return torch.device("mps")       # Apple Silicon GPU
    else:
        return torch.device("cpu")

device = get_device()

print(f"Using device: {device}")

Using device: mps


In [7]:
def force_cpu_device():
    return torch.device('cpu')

### 1. Load data from Gutenberg project

<i>Excerpt from Gutenberg site:</i>

<b>About Project Gutenberg</b>

Project Gutenberg is an online library of more than 75,000 free eBooks.

Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and his memory continues to inspire the creation of eBooks and related content today.

Since then, thousands of volunteers have digitized and diligently proofread the world’s literature. The entire Project Gutenberg collection is yours to enjoy.

All Project Gutenberg eBooks are completely free and always will be.


Text used for training : [The Project Gutenberg eBook of The Complete Works of William Shakespeare
](https://www.gutenberg.org/cache/epub/100/pg100.txt)

<details>
<summary>Contents </summary>

    THE SONNETS
    ALL’S WELL THAT ENDS WELL
    THE TRAGEDY OF ANTONY AND CLEOPATRA
    AS YOU LIKE IT
    THE COMEDY OF ERRORS
    THE TRAGEDY OF CORIOLANUS
    CYMBELINE
    THE TRAGEDY OF HAMLET, PRINCE OF DENMARK
    THE FIRST PART OF KING HENRY THE FOURTH
    THE SECOND PART OF KING HENRY THE FOURTH
    THE LIFE OF KING HENRY THE FIFTH
    THE FIRST PART OF HENRY THE SIXTH
    THE SECOND PART OF KING HENRY THE SIXTH
    THE THIRD PART OF KING HENRY THE SIXTH
    KING HENRY THE EIGHTH
    THE LIFE AND DEATH OF KING JOHN
    THE TRAGEDY OF JULIUS CAESAR
    THE TRAGEDY OF KING LEAR
    LOVE’S LABOUR’S LOST
    THE TRAGEDY OF MACBETH
    MEASURE FOR MEASURE
    THE MERCHANT OF VENICE
    THE MERRY WIVES OF WINDSOR
    A MIDSUMMER NIGHT’S DREAM
    MUCH ADO ABOUT NOTHING
    THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE
    PERICLES, PRINCE OF TYRE
    KING RICHARD THE SECOND
    KING RICHARD THE THIRD
    THE TRAGEDY OF ROMEO AND JULIET
    THE TAMING OF THE SHREW
    THE TEMPEST
    THE LIFE OF TIMON OF ATHENS
    THE TRAGEDY OF TITUS ANDRONICUS
    TROILUS AND CRESSIDA
    TWELFTH NIGHT; OR, WHAT YOU WILL
    THE TWO GENTLEMEN OF VERONA
    THE TWO NOBLE KINSMEN
    THE WINTER’S TALE
    A LOVER’S COMPLAINT
    THE PASSIONATE PILGRIM
    THE PHOENIX AND THE TURTLE
    THE RAPE OF LUCRECE
    VENUS AND ADONIS
</details>

In [11]:
import os
import requests

DATA_LOCAL_PATH = "../data/gutenberg_pg100.txt"

# Download if file doesn't exist locally
if not os.path.exists(DATA_LOCAL_PATH):
    url = "https://www.gutenberg.org/cache/epub/100/pg100.txt"
    response = requests.get(url)
    text = response.text
    # Save to a local file
    with open(DATA_LOCAL_PATH, "w", encoding="utf-8") as f:
        f.write(text)
else:
    with open(DATA_LOCAL_PATH, "r", encoding="utf-8") as f:
        text = f.read()

print(text[:1000])  # Print the first 1000 characters

The Project Gutenberg eBook of The Complete Works of William Shakespeare
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Complete Works of William Shakespeare

Author: William Shakespeare

Release date: January 1, 1994 [eBook #100]
                Most recently updated: August 24, 2025

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***




The Complete Works of William Shakespeare

by William Shakespeare




                    Contents

    THE SONNETS
    ALL’S WELL THAT ENDS WELL
    THE TRAGEDY OF ANTONY AND CLEOPATRA
 


In [None]:
import re

def clean_text(text):
    # Remove Gutenberg header/footer
    start = text.find("*** START")
    end = text.find("*** END")
    if start != -1 and end != -1:
        text = text[start:end]
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Lowercase
    text = text.lower()
    # Remove non-ASCII (optional) - r'[^\x00-\x7F]+' is a regular expression that matches any character not in the ASCII range (hex 00 to 7F).
    # Printable characters: space, letters (A–Z, a–z), digits (0–9), punctuation, and basic symbols (32–126)
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    return text

cleaned = clean_text(text)



In [18]:
print(cleaned[:1000])  # Print the first 1000 characters of cleaned text    

len(cleaned)

*** start of the project gutenberg ebook the complete works of william shakespeare *** the complete works of william shakespeare by william shakespeare contents the sonnets alls well that ends well the tragedy of antony and cleopatra as you like it the comedy of errors the tragedy of coriolanus cymbeline the tragedy of hamlet, prince of denmark the first part of king henry the fourth the second part of king henry the fourth the life of king henry the fifth the first part of henry the sixth the second part of king henry the sixth the third part of king henry the sixth king henry the eighth the life and death of king john the tragedy of julius caesar the tragedy of king lear loves labours lost the tragedy of macbeth measure for measure the merchant of venice the merry wives of windsor a midsummer nights dream much ado about nothing the tragedy of othello, the moor of venice pericles, prince of tyre king richard the second king richard the third the tragedy of romeo and juliet the taming 

5265793