# A2: Text Generation- LSTM Language Model

### Library Imports for NLP and Deep Learning

This code segment imports a comprehensive set of libraries essential for conducting deep learning and natural language processing (NLP) tasks:

- **Core Libraries**: Includes `os` for operating system interactions, and `math` for mathematical operations. Additionally `time` for timing and scheduling tasks and `json` for handling JSON operations
- **Deep Learning**: Utilizes `torch` along with its neural network (`nn`) and optimization (`optim`) submodules for building and training deep learning models.
- **Natural Language Processing**: Employs `torchtext` for text processing and `datasets` for easy access to a wide range of datasets. It specifically uses `get_tokenizer` for tokenizing text and `vocab` for managing vocabulary and embeddings.
- **Progress Tracking**: Integrates `tqdm` for progress bars, enhancing visibility and tracking of long-running operations.

These imports are foundational for tackling complex tasks in machine learning, offering tools for data handling, model creation, and algorithm optimization.



In [1]:
# Import the os module for operating system interactions, such as file path operations
import os

# Import core PyTorch modules for building and training neural network models
import torch
import torch.nn as nn  # Neural network module
import torch.optim as optim  # Optimization algorithms

# Import torchtext for natural language processing (NLP) tasks, providing utilities for text data preprocessing and handling
import torchtext
import datasets  # Import the datasets module for accessing and using a variety of datasets
import math  # Import math for mathematical operations, often used in setting hyperparameters or processing data

# Import specific utilities from torchtext
from torchtext.data.utils import get_tokenizer  # Utility for tokenizing text data
import torchtext.vocab as vocab  # Module for handling vocabularies and embeddings

# Import tqdm for providing progress bars during loops, enhancing visibility of long-running operations
from tqdm import tqdm

# Import time for timing operations, useful in performance measurement or scheduling tasks
import time
import json  # Import the JSON module to handle JSON operations

Note that to actually see the version numbers, we would need to run this code in a Python environment where the libraries are installed.

In [None]:
## Checking Library Versions
print(torch.__version__)
print(torchtext.__version__)
print(datasets.__version__)

2.1.0+cu121
0.16.0+cpu
2.16.1


This code snippet is responsible for setting up the device configuration in PyTorch, ensuring that tensor computations are performed on a GPU if one is available, otherwise on a CPU. It performs the following:

- **Check CUDA Availability**: Checks if CUDA (GPU support) is available.
- **Set Device**: Sets the device to 'cuda' if a GPU is available, otherwise to 'cpu'.
- **Print Device**: Outputs the device being used, providing clear feedback on whether the computations will leverage GPU acceleration or not.

This setup is crucial for optimizing performance in deep learning tasks, enabling faster computations and efficient resource utilization when a GPU is available.


In [2]:
# Determine if a CUDA (NVIDIA GPU) is available, set the device to 'cuda'. Otherwise, use 'cpu'.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


# Print the device that will be used for tensor operations ('cuda' or 'cpu')
print(device)

cpu


This code segment is crucial for ensuring reproducibility in PyTorch models. It performs the following actions:

- **Set a Fixed Seed**: Initializes a constant seed value (`SEED`) for any random number generation.
- **Seed PyTorch**: Applies the seed to PyTorch's random number generator, ensuring that model initialization, random number generation, and other stochastic processes are consistent across runs.
- **Ensure Deterministic Behavior**: Configures CuDNN (a neural network backend used by PyTorch when running on CUDA) to behave deterministically, further ensuring that results are consistent and reproducible, particul


In [None]:
# Set a seed value to ensure reproducibility of results
SEED = 1234
# Seed the random number generator for PyTorch to ensure consistent results during training
torch.manual_seed(SEED)
# Make CuDNN backend (used by CUDA for neural network computations) behavior deterministic
torch.backends.cudnn.deterministic = True

## 1. Load data - Harry Potter Books Dataset

### 1.1. Dataset Source
- **Website**: [HPD Dataset Download](https://nuochenpku.github.io/HPD.github.io/download)
- **Research Background of the dataset**: Utilized in the study for enhancing dialogue agents and character alignment in AI models, detailed in the paper [Dialogue-style Large Language Models for Character Alignment in Open-domain Dialogue Agents](https://arxiv.org/abs/2211.06869).

#### Dataset Description
- Contains text from the Harry Potter novels (7 books) in txt format, structured for training and evaluating Language Models (LLMs) like ChatGPT and GPT4.

#### Institutions
- **Developed by**: Tencent AI Lab & Department of Systems Analysis (DSA), Hong Kong University of Science and Technology (Guangzhou), and Hong Kong University of Science and Technology.

---

For additional details or to access the dataset, visit the [HPD Dataset Download page](https://nuochenpku.github.io/HPD.github.io/download)

### 1.2. Text Aggregation

This codesegment concatenates text from structured files into a single string, preserving the natural order of chapters and books. It's ideal for comprehensive text analysis or streamlined reading of large text datasets.

In [3]:
# Define the root directory where the book chapters are stored
root_directory = 'harry-potter-book-chapters'

# Initialize a string to store the concatenated text of all files
all_text = ""

# List all directories (each representing a book) within the root directory
book_directories = [dir_name for dir_name in os.listdir(root_directory) if os.path.isdir(os.path.join(root_directory, dir_name))]

# Iterate through each book directory in alphabetical order
for book_dir in sorted(book_directories):
    # Construct the path to the current book directory
    book_path = os.path.join(root_directory, book_dir)

    # List all text files (representing chapters) in the current book directory, ensuring they end with '.txt'
    chapter_files = [file_name for file_name in os.listdir(book_path) if file_name.endswith('.txt')]

    # Sort the chapter files numerically based on the chapter number to maintain the correct order
    sorted_chapter_files = sorted(chapter_files, key=lambda x: int(x.split('_')[1].split('.')[0]))

    # Iterate through each chapter file in sorted order
    for chapter_file in sorted_chapter_files:
        # Construct the path to the current chapter file
        chapter_path = os.path.join(book_path, chapter_file)

        # Open and read the chapter file, ensuring the correct encoding
        with open(chapter_path, 'r', encoding='utf-8') as file:
            # Read the content of the chapter file and append it to the all_text string, adding a space after each chapter
            all_text += file.read() + " "

# Output the first 100 characters of the concatenated text to verify the content
print(all_text[:100])


CHAPTER ONE
THE BOY WHO LIVED
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say 


## 2. Preprocessing

### 2.1. Text Tokenization
This snippet utilizes the `torchtext` library's `get_tokenizer` function to tokenize a large body of text. It performs the following steps:

- **Initialize Tokenizer**: Sets up a basic English tokenizer, suitable for processing English text.
- **Tokenize Text**: Applies the tokenizer to the pre-compiled large text (`all_text`), effectively splitting it into individual words or tokens.

The output, `all_tokens`, is a list of words derived from the original text, ready for further natural language processing or text analysis tasks.

In [4]:
# Initialize a tokenizer for English language, using the 'basic_english' model from torchtext
tokenizer = get_tokenizer('basic_english')

# Use the tokenizer to split the concatenated text into tokens (words)
all_tokens = tokenizer(all_text)

In [5]:
all_tokens[:10]

['chapter', 'one', 'the', 'boy', 'who', 'lived', 'mr', '.', 'and', 'mrs']

In [6]:
len(all_tokens)

1323714

### 2.2. Numericalization
This code snippet focuses on constructing a vocabulary from the tokenized text and managing special tokens. It performs the following steps:

1. **Vocabulary Construction**: Builds a vocabulary (`vocab_obj`) from the tokenized text (`all_tokens`), considering only tokens that appear at least 3 times (`min_freq=3`).
2. **Special Tokens**: Inserts special tokens `<unk>` (unknown) and `<eos>` (end of sentence) at the beginning of the vocabulary.
3. **Unknown Token Handling**: Sets the default index for unknown tokens to that of `<unk>`, ensuring any out-of-vocabulary token is mapped to `<unk>`.

This setup is crucial for NLP models, providing a robust framework for handling the text data, including rare or unknown tokens, and marking the end of sequences.


In [7]:
# Build a vocabulary from the list of tokens, considering only those that appear at least 3 times (min_freq=3)
vocab_obj = vocab.build_vocab_from_iterator([all_tokens], min_freq=3)

# Insert special tokens <unk> (unknown) and <eos> (end of sentence) into the vocabulary
vocab_obj.insert_token('<unk>', 0)  # <unk> token is inserted at index 0
vocab_obj.insert_token('<eos>', 1)  # <eos> token is inserted at index 1

# Set the default index for unknown tokens to the index of <unk>
# This means that any token not found in the vocabulary will be represented as <unk>
vocab_obj.set_default_index(vocab_obj['<unk>'])

In [8]:
vocab_obj.get_itos()[:10]

['<unk>', '<eos>', '.', ',', 'the', '”', 'to', 'and', 'of', 'a']

### 2.3. Train, Validataion, Test Data Split
This snippet of code is used to partition a tokenized text dataset into training, validation, and testing segments. It does so by:

1. **Defining Ratios**: Setting the respective proportions of the dataset to be used for training (70%), validation (15%), and testing (15%).
2. **Calculating Indices**: Determining the indices at which the dataset should be split based on the defined ratios:
    - `train_end_idx`: Marks the end of the training set and the beginning of the validation set.
    - `val_end_idx`: Marks the end of the validation set and the beginning of the testing set.

By doing so, the code ensures that the dataset is divided according to specified proportions, facilitating unbiased model training, fine-tuning, and evaluation.


In [9]:
# Define the ratio of the dataset to be used for training, validation, and testing
train_ratio = 0.7
val_ratio = 0.15
test_ratio = 0.15  # Note: It's good practice to ensure that train_ratio + val_ratio + test_ratio == 1

# Calculate the total number of tokens in the dataset
total_tokens = len(all_tokens)

# Calculate the index at which the training dataset ends and validation dataset begins
train_end_idx = int(total_tokens * train_ratio)

# Calculate the index at which the validation dataset ends (and testing dataset begins)
val_end_idx = train_end_idx + int(total_tokens * val_ratio)

This code segment divides the entire tokenized text data into three distinct datasets: training, validation, and testing. It performs the partitioning based on predefined indices (`train_end_idx`, `val_end_idx`) derived from specified ratios. The resulting datasets are:

- **Training Dataset (`train_tokens`)**: Contains the initial segment of the data, used for training the model.
- **Validation Dataset (`val_tokens`)**: Consists of the middle segment, used for tuning model parameters and preventing overfitting.
- **Testing Dataset (`test_tokens`)**: Comprises the final segment, used for evaluating the model's performance on unseen data.

This structured approach to data partitioning is fundamental for supervised learning, enabling the model to learn effectively, validate its learning, and finally, test its generalization capabilities.


In [10]:
# Slice the list of all tokens to get only the tokens for the training dataset
# This includes all tokens from the beginning up to the training end index
train_tokens = all_tokens[:train_end_idx]

# Slice the list of all tokens to get only the tokens for the validation dataset
# This includes tokens from the training end index up to the validation end index
val_tokens = all_tokens[train_end_idx:val_end_idx]

# Slice the list of all tokens to get only the tokens for the testing dataset
# This includes all tokens from the validation end index to the end of the list
test_tokens = all_tokens[val_end_idx:]

In [11]:
len(train_tokens), len(val_tokens), len(test_tokens)

(926599, 198557, 198558)

##### Dataset Token Distribution

The dataset is split into training, validation, and testing sets. Below is a table summarizing the distribution of tokens across these sets:

| Dataset       | Number of Tokens |
|---------------|------------------|
| Training      | 926,599          |
| Validation    | 198,557          |
| Test          | 198,558          |
| **Total**     | **1,323,714**    |

This distribution ensures a balanced approach to model training, validation, and testing, providing ample data for each phase of model development.


## 3. Prepare the batch loader

This function, `get_data`, prepares the tokenized text data for input into a neural network model. The steps are:

1. **Append Special Token**: Adds an `<eos>` (end of sentence) token to the end of the tokens list.
2. **Token to Index Conversion**: Converts each token into its corresponding index based on the provided vocabulary. It handles unknown tokens by mapping them to a default index (usually for `<unk>`).
3. **Tensor Conversion**: Transforms the list of indices into a PyTorch tensor, ensuring compatibility with PyTorch models.
4. **Batching**: Reshapes the data into batches, each of a specified size (`batch_size`), to facilitate efficient processing during model training or inference.

The result is a tensor suitable for feeding into batched neural network computations, aligning with the requirements of many deep learning frameworks.


In [12]:
def get_data(tokens, vocab, batch_size):
    # Append '<eos>' token at the end of tokens
    tokens.append('<eos>')

    # Convert tokens to indices based on the vocabulary
    indices = [vocab[token] if token in vocab else vocab.get_default_index() for token in tokens]

    # Convert list of indices to PyTorch tensor
    data = torch.LongTensor(indices)

    # Calculate the number of full batches
    num_batches = data.shape[0] // batch_size

    # Trim data to a whole number of batches
    data = data[:num_batches * batch_size]

    # Reshape data to [batch_size, num_batches]
    data = data.view(batch_size, num_batches)

    return data  # [batch size, seq len]

This code snippet demonstrates the usage of the `get_data` function to convert tokenized text data into a format suitable for model training, validation, and testing. It performs the following:

- **Batch Size Definition**: Sets the batch size, which determines how many data points (tokens) are processed together in one iteration.
- **Training Data Preparation**: Converts the `train_tokens` into a batched tensor format, creating an organized structure that facilitates efficient model training.
- **Validation Data Preparation**: Similarly processes `val_tokens` to prepare the validation data, which is crucial for tuning model parameters and preventing overfitting.
- **Testing Data Preparation**: Processes `test_tokens` to prepare the testing data, ensuring the model's performance is evaluated on unseen data.

The resulting data structures (`train_data`, `val_data`, `test_data`) are optimally prepared for feeding into a neural network in a batched manner, promoting computational efficiency and model performance.


In [13]:
# Define the batch size for creating batches of data
batch_size = 128  # Batch size can be adjusted based on the requirement or hardware capability

# Prepare the training data using the get_data function
# This converts the train_tokens into a tensor of indices and organizes it into batches
train_data = get_data(train_tokens, vocab_obj, batch_size)

# Similarly, prepare the validation data
# This converts the val_tokens into a tensor of indices and organizes it into batches
val_data = get_data(val_tokens, vocab_obj, batch_size)

# Finally, prepare the testing data
# This converts the test_tokens into a tensor of indices and organizes it into batches
test_data = get_data(test_tokens, vocab_obj, batch_size)


In [14]:
train_data.shape, val_data.shape, test_data.shape

(torch.Size([128, 7239]), torch.Size([128, 1551]), torch.Size([128, 1551]))

## 4. Modeling

<img src="figures/LM.png" width=600>

### LSTM Language Model

This `LSTMLanguageModel` class, a PyTorch `nn.Module`, defines a recurrent neural network for language modeling using LSTM (Long Short-Term Memory) cells. Key components and functionalities include:

1. **Layer Initialization**: Defines an embedding layer, an LSTM layer, a dropout layer, and a fully connected layer.
2. **Weight Initialization**: Initializes weights of the embedding and fully connected layers uniformly. Also, sets up LSTM weights for both input-hidden and hidden-hidden connections.
3. **Hidden State Management**: Provides functions to initialize and detach hidden states, ensuring proper management across training iterations.
4. **Forward Pass**: Outlines the data flow from input tokens through the embedding layer, LSTM cells, dropout, and finally the fully connected layer to produce predictions.


In [15]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        # Initialize the hyperparameters
        self.num_layers = num_layers
        self.hid_dim    = hid_dim
        self.emb_dim    = emb_dim

        # Define the layers
        self.embedding  = nn.Embedding(vocab_size, emb_dim)
        self.lstm       = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout    = nn.Dropout(dropout_rate)
        self.fc         = nn.Linear(hid_dim, vocab_size)

        # Initialize weights
        self.init_weights()

    def init_weights(self):

        # Set the range for uniform distribution
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_other)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()

        # Initialize LSTM weights uniformly
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #We
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #Wh

    def init_hidden(self, batch_size, device):
        # Initialize hidden and cell states for LSTM
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell

    def detach_hidden(self, hidden):
        # Detach hidden states from the graph to prevent backpropagating through the entire history
        hidden, cell = hidden
        hidden = hidden.detach() #not to be used for gradient computation
        cell   = cell.detach()
        return hidden, cell

    def forward(self, src, hidden):
        # Embedding layer
        #src: [batch_size, seq len]
        embedding = self.dropout(self.embedding(src)) # Transform token IDs to embeddings
        #embedding: [batch-size, seq len, emb dim]
        # LSTM layer
        output, hidden = self.lstm(embedding, hidden) # Get output and new hidden state from LSTM
        #ouput: [batch size, seq len, hid dim]
        #hidden: [num_layers * direction, seq len, hid_dim]
        # Apply dropout to the output of LSTM
        output = self.dropout(output)
        # Fully connected layer
        prediction =self.fc(output) # Transform LSTM output to prediction scores for each token in the vocabulary
        #prediction: [batch_size, seq_len, vocab_size]
        return prediction, hidden # Return prediction scores and hidden states

## 5. Training

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

This code snippet initializes the LSTM language model along with its hyperparameters, optimizer, and loss function:

- **Model Parameters**: Sets up the model with specific dimensions for the embedding and hidden layers, the number of LSTM layers, and the dropout rate. These parameters are adjustable and are currently based on values either chosen empirically or referenced from a specific paper.
- **Model Instantiation**: Creates an instance of the `LSTMLanguageModel` class on the specified `device` (GPU or CPU).
- **Optimizer**: Utilizes the Adam optimizer with a predefined learning rate for model training.
- **Loss Function**: Adopts the Cross-Entropy Loss, suitable for multi-class classification tasks common in language modeling.
- **Trainable Parameters**: Calculates and prints the total number of trainable parameters in the model, providing insight into the model's complexity and computational demands.




In [16]:
# Set the vocabulary size to the size of the vocabulary object
vocab_size = len(vocab_obj)

# Define the dimensions for the embedding and hidden layers, number of LSTM layers, dropout rate, and learning rate
emb_dim = 1024  # Embedding dimension, set to 1024 (400 in the referenced paper)
hid_dim = 1024  # Hidden dimension, set to 1024 (1150 in the referenced paper)
num_layers = 2  # Number of LSTM layers, set to 2 (3 in the referenced paper)
dropout_rate = 0.65  # Dropout rate
lr = 1e-3  # Learning rate

# Initialize the LSTM language model with the specified parameters
model = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)

# Define the optimizer (Adam) with the learning rate
optimizer = optim.Adam(model.parameters(), lr=lr)

# Define the loss function (CrossEntropyLoss) for multi-class classification
criterion = nn.CrossEntropyLoss()

# Calculate the number of trainable parameters in the model
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

# Print the total number of trainable parameters
print(f'The model has {num_params:,} trainable parameters')

The model has 43,705,166 trainable parameters


The `get_batch` function is designed to prepare batches of data for training the language model. It performs the following steps:

1. **Source Data Extraction**: Retrieves a sequence of tokens (`src`) of length `seq_len` from the data tensor, starting at position `idx`.
2. **Target Data Extraction**: Retrieves the target sequence (`target`), which is the `src` sequence shifted by one position. This offset allows the model to predict the next token in the sequence.



In [17]:
def get_batch(data, seq_len, idx):
    # Extract a sequence of length 'seq_len' from the data starting from 'idx' for source (input)
    src = data[:, idx:idx + seq_len]

    # Extract the target sequence which is offset by one from the source sequence
    # The target for language modeling is typically the next token in the sequence
    target = data[:, idx + 1:idx + seq_len + 1]  # Target is shifted by one token to predict the next token

    return src, target

The `train` function orchestrates the training process of the LSTM Language Model for one epoch. It includes:

1. **Batch Preparation**: Adjusts the data to ensure each batch is complete and a multiple of `seq_len`.
2. **Hidden State Management**: Initializes and detaches hidden states to manage memory and computational graph efficiently.
3. **Training Loop**: Iterates over the data, fetching batches, and performing forward and backward passes.
    - **Batch Fetching**: Retrieves source and target sequences.
    - **Model Forward Pass**: Computes predictions based on source sequences and hidden states.
    - **Loss Calculation**: Uses the criterion to compute loss between predictions and targets.
    - **Backpropagation**: Computes gradients and updates model parameters using the optimizer.
    - **Gradient Clipping**: Prevents exploding gradients by clipping the gradients to a maximum value (`clip`).
4. **Loss Tracking**: Accumulates and returns the average loss over the epoch, providing insight into model performance.


In [None]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):

    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, seq len]
    num_batches = data.shape[-1]
    # print(data.shape)
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    # print(data.shape)
    num_batches = data.shape[-1]

    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)

    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()

        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]
        target = target.reshape(-1)
        loss = criterion(prediction, target)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

The `evaluate` function is designed to assess the LSTM Language Model's performance on validation or test data. Its key steps include:

1. **Model Setup**: Sets the model to evaluation mode and prepares the data in batches.
2. **Loss Computation**: Iterates over the data without calculating gradients (`torch.no_grad()`) to optimize memory and computation during evaluation.
3. **Batch Processing**: For each batch, it fetches source and target sequences, computes the model's predictions, and calculates the loss using the provided criterion.
4. **Result Aggregation**: Accumulates the loss over all batches and returns the average loss per batch.

This function provides a mechanism to evaluate the model's performance without impacting its parameters, ensuring an unbiased assessment of its ability to generalize to new data.


In [19]:
def evaluate(model, data, criterion, batch_size, seq_len, device):
    epoch_loss = 0  # Initialize loss
    model.eval()  # Set the model to evaluation mode

    # Adjust data to ensure each batch is a multiple of seq_len
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches - 1) % seq_len]
    num_batches = data.shape[-1]

    # Initialize hidden state of LSTM
    hidden = model.init_hidden(batch_size, device)

    # Disable gradient calculation for evaluation to save memory and computation
    with torch.no_grad():
        # Loop through the data in steps of seq_len
        for idx in range(0, num_batches - 1, seq_len):
            # Detach hidden states from the computational graph
            hidden = model.detach_hidden(hidden)

            # Get the batch of data
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size = src.shape[0]

            # Forward pass through the model
            prediction, hidden = model(src, hidden)

            # Reshape prediction and target to fit the criterion format
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            # Compute the loss
            loss = criterion(prediction, target)
            # Accumulate loss
            epoch_loss += loss.item() * seq_len

    # Return average loss per batch
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

This code snippet orchestrates the complete training and evaluation process of the LSTM Language Model for a predefined number of epochs. Key steps include:

1. **Training and Evaluation Loop**: For each epoch, the model is trained on the training data and then evaluated on the validation data.
2. **Learning Rate Adjustment**: Uses a learning rate scheduler to reduce the learning rate if the validation loss doesn't improve, aiding in model convergence.
3. **Model Saving**: Stores the state of the model with the lowest validation loss, ensuring the best model is retained.
4. **Performance Metrics**: Calculates and logs the perplexity for both training and validation sets for each epoch, offering insight into the model's performance.
5. **Time Tracking**: Measures and logs the time taken for each epoch and the total training time, helping in evaluating the efficiency of the training process.

This structured approach ensures systematic training, performance monitoring, and model optimization, leading to a robust and well-performing language model.


In [None]:
# Define the number of epochs, sequence length for training, and the gradient clipping value
n_epochs = 50
seq_len = 50  # Decoding length
clip = 0.25

# Initialize a learning rate scheduler to reduce the learning rate based on validation loss
lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

# Initialize the best validation loss to a high number
best_valid_loss = float('inf')

# Start measuring the total time for the training process
total_start_time = time.time()

# Start the training process
for epoch in range(n_epochs):
    start_time = time.time()  # Start time for the current epoch

    # Train the model for one epoch and receive the training loss
    train_loss = train(model, train_data, optimizer, criterion, batch_size, seq_len, clip, device)

    # Evaluate the model on the validation data and receive the validation loss
    valid_loss = evaluate(model, val_data, criterion, batch_size, seq_len, device)

    # Update the learning rate based on the validation loss
    lr_scheduler.step(valid_loss)

    # Save the model if the validation loss is the best we've seen so far
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), './app/models/best-val-lstm_lm.pt')

    # Calculate and print the time taken for the epoch
    end_time = time.time()
    epoch_mins, epoch_secs = divmod(end_time - start_time, 60)
    print(f'Epoch: {epoch+1:02} | Epoch Time: {int(epoch_mins)}m {int(epoch_secs)}s')
    # Calculate and print the perplexity for the training and validation sets
    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

# End measuring the total time for the training process
total_end_time = time.time()
# Calculate and print the total time taken for training
total_mins, total_secs = divmod(total_end_time - total_start_time, 60)
print(f'Total Time: {int(total_mins)}m {int(total_secs)}s')



Epoch: 01 | Epoch Time: 0m 50s
	Train Perplexity: 619.018
	Valid Perplexity: 418.736




Epoch: 02 | Epoch Time: 0m 52s
	Train Perplexity: 286.095
	Valid Perplexity: 204.392




Epoch: 03 | Epoch Time: 0m 57s
	Train Perplexity: 168.001
	Valid Perplexity: 148.258




Epoch: 04 | Epoch Time: 0m 59s
	Train Perplexity: 130.812
	Valid Perplexity: 127.548




Epoch: 05 | Epoch Time: 0m 58s
	Train Perplexity: 112.063
	Valid Perplexity: 115.571




Epoch: 06 | Epoch Time: 0m 58s
	Train Perplexity: 100.080
	Valid Perplexity: 108.630




Epoch: 07 | Epoch Time: 0m 58s
	Train Perplexity: 91.434
	Valid Perplexity: 104.062




Epoch: 08 | Epoch Time: 0m 58s
	Train Perplexity: 84.656
	Valid Perplexity: 100.506




Epoch: 09 | Epoch Time: 0m 58s
	Train Perplexity: 79.161
	Valid Perplexity: 97.807




Epoch: 10 | Epoch Time: 0m 58s
	Train Perplexity: 74.636
	Valid Perplexity: 96.321




Epoch: 11 | Epoch Time: 0m 58s
	Train Perplexity: 70.607
	Valid Perplexity: 95.552




Epoch: 12 | Epoch Time: 0m 58s
	Train Perplexity: 67.087
	Valid Perplexity: 93.911




Epoch: 13 | Epoch Time: 0m 58s
	Train Perplexity: 64.236
	Valid Perplexity: 92.922




Epoch: 14 | Epoch Time: 0m 58s
	Train Perplexity: 61.586
	Valid Perplexity: 91.568




Epoch: 15 | Epoch Time: 0m 58s
	Train Perplexity: 59.156
	Valid Perplexity: 90.906




Epoch: 16 | Epoch Time: 0m 57s
	Train Perplexity: 56.918
	Valid Perplexity: 91.088




Epoch: 17 | Epoch Time: 0m 58s
	Train Perplexity: 53.641
	Valid Perplexity: 90.134




Epoch: 18 | Epoch Time: 0m 58s
	Train Perplexity: 52.074
	Valid Perplexity: 89.870




Epoch: 19 | Epoch Time: 0m 57s
	Train Perplexity: 50.974
	Valid Perplexity: 89.875




Epoch: 20 | Epoch Time: 0m 58s
	Train Perplexity: 49.541
	Valid Perplexity: 89.126




Epoch: 21 | Epoch Time: 0m 58s
	Train Perplexity: 48.779
	Valid Perplexity: 89.052




Epoch: 22 | Epoch Time: 0m 58s
	Train Perplexity: 48.144
	Valid Perplexity: 88.993




Epoch: 23 | Epoch Time: 0m 58s
	Train Perplexity: 47.650
	Valid Perplexity: 88.922




Epoch: 24 | Epoch Time: 0m 57s
	Train Perplexity: 47.042
	Valid Perplexity: 89.083




Epoch: 25 | Epoch Time: 0m 58s
	Train Perplexity: 46.432
	Valid Perplexity: 88.725




Epoch: 26 | Epoch Time: 0m 58s
	Train Perplexity: 46.040
	Valid Perplexity: 88.701




Epoch: 27 | Epoch Time: 0m 58s
	Train Perplexity: 45.597
	Valid Perplexity: 88.520




Epoch: 28 | Epoch Time: 0m 57s
	Train Perplexity: 45.442
	Valid Perplexity: 88.526




Epoch: 29 | Epoch Time: 0m 58s
	Train Perplexity: 45.168
	Valid Perplexity: 88.500




Epoch: 30 | Epoch Time: 0m 58s
	Train Perplexity: 45.028
	Valid Perplexity: 88.476




Epoch: 31 | Epoch Time: 0m 57s
	Train Perplexity: 45.008
	Valid Perplexity: 88.486




Epoch: 32 | Epoch Time: 0m 57s
	Train Perplexity: 44.845
	Valid Perplexity: 88.497




Epoch: 33 | Epoch Time: 0m 57s
	Train Perplexity: 44.855
	Valid Perplexity: 88.513




Epoch: 34 | Epoch Time: 0m 57s
	Train Perplexity: 44.808
	Valid Perplexity: 88.514




Epoch: 35 | Epoch Time: 0m 57s
	Train Perplexity: 44.843
	Valid Perplexity: 88.516




Epoch: 36 | Epoch Time: 0m 57s
	Train Perplexity: 44.857
	Valid Perplexity: 88.518




Epoch: 37 | Epoch Time: 0m 57s
	Train Perplexity: 44.750
	Valid Perplexity: 88.517




Epoch: 38 | Epoch Time: 0m 57s
	Train Perplexity: 44.798
	Valid Perplexity: 88.517




Epoch: 39 | Epoch Time: 0m 57s
	Train Perplexity: 44.763
	Valid Perplexity: 88.517




Epoch: 40 | Epoch Time: 0m 57s
	Train Perplexity: 44.804
	Valid Perplexity: 88.517




Epoch: 41 | Epoch Time: 0m 57s
	Train Perplexity: 44.805
	Valid Perplexity: 88.518




Epoch: 42 | Epoch Time: 0m 57s
	Train Perplexity: 44.838
	Valid Perplexity: 88.518




Epoch: 43 | Epoch Time: 0m 57s
	Train Perplexity: 44.836
	Valid Perplexity: 88.518




Epoch: 44 | Epoch Time: 0m 57s
	Train Perplexity: 44.848
	Valid Perplexity: 88.518




Epoch: 45 | Epoch Time: 0m 57s
	Train Perplexity: 44.788
	Valid Perplexity: 88.518




Epoch: 46 | Epoch Time: 0m 57s
	Train Perplexity: 44.792
	Valid Perplexity: 88.518




Epoch: 47 | Epoch Time: 0m 57s
	Train Perplexity: 44.828
	Valid Perplexity: 88.518




Epoch: 48 | Epoch Time: 0m 57s
	Train Perplexity: 44.845
	Valid Perplexity: 88.518




Epoch: 49 | Epoch Time: 0m 57s
	Train Perplexity: 44.800
	Valid Perplexity: 88.518




Epoch: 50 | Epoch Time: 0m 57s
	Train Perplexity: 44.859
	Valid Perplexity: 88.518
Total Time: 48m 12s


## 6. Testing

The best performing LSTM language model, determined based on the validation loss, was evaluated on the test dataset. The evaluation involves:

- **Model Loading**: The model state with the lowest validation loss is loaded, ensuring that the best performing model is used for evaluation.
- **Testing**: The model is evaluated on the test data using the `evaluate` function, which computes the loss on this unseen data.
- **Perplexity Calculation**: The perplexity, a measure of how well the model predicts the next word, is computed based on the test loss.

The final output, Test Perplexity, quantifies the model's performance on unseen data, providing an indication of how well the model can generalize and predict new sequences.



In [None]:
# Load the best model state (with the lowest validation loss) from the saved file
model.load_state_dict(torch.load('./app/models/best-val-lstm_lm.pt', map_location=device))

# Evaluate the model on the test data to get the test loss
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)

# Calculate and print the perplexity for the test set
# Perplexity is a common metric in language modeling, representing how well the model predicts a sample
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 106.331


#### Model Performance Report

This report outlines the training performance of a language model over 50 epochs, with a focus on minimizing Train and Valid Perplexity, and provides an assessment based on the Test Perplexity.

##### Results
- **Total Training Time:** 48 minutes and 12 seconds.
- **Optimal Epoch:** 30
  - **Train Perplexity:** 45.028
  - **Valid Perplexity:** 88.476
  - **Test Perplexity:** 106.331

##### Analysis
The model consistently improved across epochs, with Epoch 30 providing the best balance in performance. While the Test Perplexity is higher than the Validation Perplexity, it remains relatively close, indicating that the model generalizes well to unseen data. However, the gap between the Test and Validation Perplexity suggests room for further improvement.

##### Recommendations for Further Improvement
1. **Early Stopping:** Implement early stopping to prevent overfitting and unnecessary computation, especially as Validation Perplexity plateaus.
2. **Hyperparameter Tuning:** Experiment with different sets of hyperparameters to find a more optimal model configuration.
3. **Extended Dataset:** Incorporate more diverse or extensive training data to improve the model's learning capability and generalization.
4. **Advanced Architectures:** Explore more sophisticated model architectures that might capture the complexities of the dataset more effectively.



## 7. Saving Vocabulary and Configuration for Future Use

To ensure consistency and reproducibility in future text processing and model training, the model's vocabulary and configuration settings are serialized and saved.

#### Vocabulary Storage:
- **String-to-Index (stoi)**: The vocabulary's token-to-index mapping is serialized and saved as `vocab.json`. This facilitates consistent text processing by providing a standardized way to convert tokens to indices.

#### Model Configuration Storage:
- **Configuration Parameters**: The model's parameters, including sequence length, batch size, embedding dimension, hidden dimension, number of LSTM layers, and dropout rate, are stored in a configuration dictionary.
- **Configuration Serialization**: This configuration is serialized and saved as `config.json`, ensuring that the model's architectural and operational settings can be easily accessed and replicated in the future.

By storing the vocabulary and configuration, the setup can be consistently replicated or adjusted, promoting transparency and flexibility in model management and experimentation.


In [None]:
# Convert index-to-string (itos) vocabulary to string-to-index (stoi) for easier lookup
itos = vocab_obj.get_itos()  # Retrieve the list of tokens (index to string mapping)
stoi = {token: idx for idx, token in enumerate(itos)}  # Create the reverse mapping (string to index)

# Save the stoi dictionary to a JSON file for future use
with open('./app/models/vocab.json', 'w') as f:
    json.dump(stoi, f)

# Prepare a configuration dictionary with model and training parameters
config = {
    'seq_len': seq_len,  # Sequence length
    'batch_size': batch_size,  # Batch size
    'emb_dim': emb_dim,  # Embedding dimension
    'hid_dim': hid_dim,  # Hidden dimension
    'num_layers': num_layers,  # Number of LSTM layers
    'dropout_rate': dropout_rate  # Dropout rate
}

# Save the configuration dictionary to a JSON file for reproducibility and future reference
with open('./app/models/config.json', 'w') as f:
    json.dump(config, f)

In [25]:
vocab_obj['still'], stoi['still'] #Check equivalency of vocab_pbj and stoi dictionary 

(92, 92)

## 8. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

The `generate` function performs text generation based on a given prompt using the trained LSTM language model. Key steps include:

- **Initialization**: Sets the seed for reproducibility, tokenizes the prompt, and initializes the hidden state.
- **Generation Loop**: Iterates up to `max_seq_len`:
  - **Model Prediction**: Generates the next word based on the current sequence of indices.
  - **Temperature Scaling**: Applies temperature scaling to control the randomness in word selection.
  - **Word Selection**: Samples the next word from the probability distribution, handling special tokens like `<unk>` and `<eos>`.
- **Token Conversion**: Converts the generated indices back to tokens, forming the generated text.

This function allows for controlled, auto-regressive generation of text, creating coherent and contextually relevant extensions to the input prompt.


In [30]:

def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)

            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab

            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)
            prediction = torch.multinomial(probs, num_samples=1).item()

            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

### Generation Results:
Each temperature setting produces a unique extension to the prompt, demonstrating the model's ability to generate contextually relevant text.
- Lower temperatures tend to result in less diverse but more coherent text,
- while higher temperatures increase diversity at the cost of coherence.

The generated text for each temperature setting is shown below, illustrating the trade-offs and capabilities of the model in generating text under different levels of randomness.

In [None]:
# prompt = 'Write a story of war'
prompt = 'No one loves Harry'
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes

#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]

for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer,
                          stoi, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
no one loves harry , who had been working to the potters . when harry had found it , he had been trying to get out of the hospital wing , but the rest

0.7
no one loves harry , who had put the invisibility cloak over off his feet . harry had never been able to get past the snitch . even he thought he would have thought

0.75
no one loves harry , who had put the invisibility cloak over off his feet . harry had never been able to stare at the ministry of magic who had been given the rest

0.8
no one loves harry , who had put the invisibility cloak over off his feet . harry had never been able to stare at the ministry of magic who had been given the rest

1.0
no one loves harry , however , and the other thing was off a quibbler when they had been in the air with statues of hogwarts . even who was coming to ward when




The model's output varies with temperature, affecting coherence and creativity. Lower temperatures tend to produce more predictable and coherent text, while higher temperatures yield more diverse and creative outputs but risk losing coherence. The choice of temperature should align with the desired balance between creativity and coherence in the generated text.
