# LSTM Language Models

You guys probably very excited about ChatGPT.  In today class, we will be implementing a very simple language model, which is basically what ChatGPT is, but with a simple LSTM.  You will be surprised that it is not so difficult at all.

Paper that we base on is *Regularizing and Optimizing LSTM Language Models*, https://arxiv.org/abs/1708.02182

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import torchtext, datasets, math, torchtext
from tqdm import tqdm
from datasets import Dataset, DatasetDict


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/tljh/user/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/tljh/user/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/opt/tljh/user/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/opt/tljh/user/lib/python3.12/site-

In [2]:
import torch

# Check if CUDA is available
if torch.cuda.is_available():
    # Get the number of GPUs
    num_gpus = torch.cuda.device_count()
    print(f"Number of GPUs available: {num_gpus}")
    
    # List all GPU devices
    for i in range(num_gpus):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("CUDA is not available. No GPUs detected.")

Number of GPUs available: 4
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
GPU 2: NVIDIA GeForce RTX 2080 Ti
GPU 3: NVIDIA GeForce RTX 2080 Ti


In [3]:
device = torch.device(f"cuda:{3}" if torch.cuda.is_available() else "cpu")

In [4]:
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## 1. Load data - Cleaned Harry Potter Story

In [5]:
file_path = 'dataset/cleaned_story.csv'
data = pd.read_csv(file_path)

hf_dataset = Dataset.from_pandas(data)

hf_dataset = hf_dataset.shuffle(seed=42)

train_size = int(0.8 * len(hf_dataset))
validation_size = int(0.1 * len(hf_dataset))

train_dataset = hf_dataset.select(range(train_size))
validation_dataset = hf_dataset.select(range(train_size, train_size + validation_size))
test_dataset = hf_dataset.select(range(train_size + validation_size, len(hf_dataset)))

final_dataset = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset,
    'test': test_dataset
})

print(final_dataset['train'].shape)
print(final_dataset['test'].shape)
print(final_dataset['validation'].shape)

(66614, 1)
(8328, 1)
(8326, 1)


## 2. Tokenization

In [6]:
# Tokenize the dataset using the torchtext tokenizer
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

# Define the tokenize function
def tokenize_data(example):
    return {'tokens': tokenizer(example['text'])}

# Apply the tokenize function to all splits
tokenized_dataset = final_dataset.map(
    tokenize_data,
    remove_columns=['text']
)

# Save the tokenized dataset to files (optional)
tokenized_dataset.save_to_disk('tokenized_dataset')

# Inspect the processed dataset
print(tokenized_dataset)

Map:   0%|          | 0/66614 [00:00<?, ? examples/s]

Map:   0%|          | 0/8326 [00:00<?, ? examples/s]

Map:   0%|          | 0/8328 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/66614 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/8326 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/8328 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['tokens'],
        num_rows: 66614
    })
    validation: Dataset({
        features: ['tokens'],
        num_rows: 8326
    })
    test: Dataset({
        features: ['tokens'],
        num_rows: 8328
    })
})


In [7]:
print(tokenized_dataset['train'][1]['tokens'])

['it', 'was', 'eerie', ',', 'spine-tingling', ',', 'unearthly', 'it', 'lifted', 'the', 'hair', 'on', 'harry’s', 'scalp', 'and', 'made', 'his', 'heart', 'feel', 'as', 'though', 'it', 'was', 'swelling', 'to', 'twice', 'its', 'normal', 'size', '.']


In [8]:
print(tokenized_dataset['test'][1]['tokens'])

['just', 're-', 'member', 'tread', 'carefully', 'around', 'dolores', 'umbridge', '.', '”', '“but', 'i', 'was', 'telling', 'the', 'truth', '!', '”', 'said', 'harry', ',', 'outraged', '.']


In [9]:
print(tokenized_dataset['validation'][1]['tokens'])

['chuck', 'us', 'the', 'hair', 'and', 'the', 'potion', ',', 'then', '.', '”', 'within', 'two', 'minutes', ',', 'ron', 'stood', 'before', 'them', ',', 'as', 'small', 'and', 'ferrety', 'as', 'the', 'sick', 'wizard', ',', 'and', 'wearing', 'the', 'navy', 'blue', 'robes', 'that', 'had', 'been', 'folded', 'in', 'his', 'bag', '.']


### Numericalizing

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [10]:
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], min_freq=3)
vocab.insert_token('<unk>', 0)
vocab.insert_token('<eos>', 1)
vocab.set_default_index(vocab['<unk>'])

In [11]:
print(len(vocab))

13388


In [12]:
print(vocab.get_itos()[:10])

['<unk>', '<eos>', '.', ',', 'the', '”', 'and', 'to', 'of', 'a']


## 3. Prepare the batch loader

### Prepare data

Given "Chaky loves eating at AIT", and "I really love deep learning", and given batch size = 3, we will get three batches of data "Chaky loves eating at", "AIT `<eos>` I really", "love deep learning `<eos>`".  

In [13]:
def get_data(dataset, vocab, batch_size):
    data = []
    for example in dataset:
        if example['tokens']:
            tokens = example['tokens'].append('<eos>')
            tokens = [vocab[token] for token in example['tokens']]
            data.extend(tokens)
    data = torch.LongTensor(data)
    num_batches = data.shape[0] // batch_size
    data = data[:num_batches * batch_size]
    data = data.view(batch_size, num_batches) #view vs. reshape (whether data is contiguous)
    return data #[batch size, seq len]

In [14]:
batch_size = 256
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_data  = get_data(tokenized_dataset['test'],  vocab, batch_size)

In [15]:
train_data.shape

torch.Size([256, 4527])

## 4. Modeling 

<img src="figures/LM.png" width=600>

In [16]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim    = hid_dim
        self.emb_dim    = emb_dim
        
        self.embedding  = nn.Embedding(vocab_size, emb_dim)
        self.lstm       = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout    = nn.Dropout(dropout_rate)
        self.fc         = nn.Linear(hid_dim, vocab_size)
        
        self.init_weights()
    
    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_other)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #We
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim,   
                self.hid_dim).uniform_(-init_range_other, init_range_other) #Wh
    
    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
        
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach() #not to be used for gradient computation
        cell   = cell.detach()
        return hidden, cell
        
    def forward(self, src, hidden):
        #src: [batch_size, seq len]
        embedding = self.dropout(self.embedding(src)) #harry potter is
        #embedding: [batch-size, seq len, emb dim]
        output, hidden = self.lstm(embedding, hidden)
        #ouput: [batch size, seq len, hid dim]
        #hidden: [num_layers * direction, seq len, hid_dim]
        output = self.dropout(output)
        prediction =self.fc(output)
        #prediction: [batch_size, seq_len, vocab_size]
        return prediction, hidden

## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [17]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3                     

In [18]:
model      = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer  = optim.Adam(model.parameters(), lr=lr)
criterion  = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 44,225,612 trainable parameters


In [19]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [20]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, seq len]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [21]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [23]:
n_epochs = 100
seq_len  = 50 #<----decoding length
clip    = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'st125338_best-val-lstm_lm.pt')
        
    if epoch % 10 == 0:
        print(f'Stamp at: {epoch} Epoch')
        print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
        print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

                                                         

Stamp at: 0 Epoch
	Train Perplexity: 555.006
	Valid Perplexity: 469.764


                                                         

Stamp at: 10 Epoch
	Train Perplexity: 71.514
	Valid Perplexity: 68.975


                                                         

Stamp at: 20 Epoch
	Train Perplexity: 51.991
	Valid Perplexity: 59.737


                                                         

Stamp at: 30 Epoch
	Train Perplexity: 42.327
	Valid Perplexity: 57.420


IOPub message rate exceeded.7/90 [00:20<00:03,  3.84it/s]
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

                                                         

Stamp at: 60 Epoch
	Train Perplexity: 39.128
	Valid Perplexity: 56.972


                                                         

Stamp at: 70 Epoch
	Train Perplexity: 39.098
	Valid Perplexity: 56.972


                                                         

Stamp at: 80 Epoch
	Train Perplexity: 39.088
	Valid Perplexity: 56.972


                                                         

Stamp at: 90 Epoch
	Train Perplexity: 39.080
	Valid Perplexity: 56.972


                                                         

## 6. Testing

In [24]:
model.load_state_dict(torch.load('st125338_best-val-lstm_lm.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 56.889


## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [25]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [27]:
prompt = 'The Hogwarts is '
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
the hogwarts is a witch , and a week ago , and the great half-breed of the family .

0.7
the hogwarts is a witch , the greatest wizard of the dark world .

0.75
the hogwarts is a old map , and at least off a fire .

0.8
the hogwarts is a old map , and at least off a fire .

1.0
the hogwarts is — hic — usually — those — i ’ave a plain deal , keeping it in the end — what seemed to get up to a feast and we know

