In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import PennTreebank
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import math

# Introduction to Sequence Modeling

Sequence modeling is a crucial task in various domains, such as natural language processing, speech recognition, and time series analysis. It involves predicting or generating sequences of data based on historical information. In this notebook, we will explore Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, which are powerful architectures for handling sequential data.

## Importance of Sequence Modeling

Sequence modeling enables us to:
- Predict the next element in a sequence (e.g., predicting the next word in a sentence)
- Generate new sequences (e.g., generating text or music)
- Classify sequences (e.g., sentiment analysis of text)
- Translate sequences (e.g., machine translation between languages)

Understanding and applying sequence modeling techniques is essential for building intelligent systems that can process and generate sequential data effectively.

## Recurrent Neural Networks (RNNs)

RNNs are a class of neural networks designed to handle sequential data. They maintain a hidden state that captures information from previous time steps, allowing them to capture dependencies and patterns in sequences.

### RNN Architecture

The basic architecture of an RNN consists of:
- Input layer: Receives the input at each time step
- Hidden layer: Maintains a hidden state that captures information from previous time steps
- Output layer: Produces the output at each time step

The hidden state is updated at each time step based on the current input and the previous hidden state. This allows RNNs to maintain a memory of past information and use it to make predictions or generate outputs.

### Challenges with RNNs

Despite their ability to handle sequential data, RNNs suffer from the vanishing and exploding gradient problems. These problems arise when training RNNs on long sequences, as the gradients can become extremely small (vanishing) or large (exploding) during backpropagation. This makes it difficult for RNNs to capture long-term dependencies effectively.

## Long Short-Term Memory (LSTM) Networks

LSTMs are a type of RNN architecture designed to address the limitations of traditional RNNs. They introduce a memory cell and gating mechanisms to regulate the flow of information, enabling them to capture long-term dependencies more effectively.

### LSTM Architecture

The key components of an LSTM cell are:
- Input gate: Controls the flow of new information into the memory cell
- Forget gate: Determines what information to discard from the memory cell
- Output gate: Controls the output of the memory cell
- Memory cell: Stores the long-term information

The gating mechanisms allow LSTMs to selectively update, forget, and output information, enabling them to capture complex patterns and dependencies in sequences.

### Advantages of LSTMs

LSTMs have several advantages over traditional RNNs:
- Ability to capture long-term dependencies
- Mitigation of the vanishing and exploding gradient problems
- Improved performance on tasks requiring long-range context

LSTMs have been widely adopted and have shown remarkable success in various sequence modeling tasks, such as language modeling, sentiment analysis, and speech recognition.

Now, let's dive into the implementation of RNNs and LSTMs using PyTorch and explore their application to a real-world dataset.

# Preparing the Penn Treebank Dataset

To demonstrate the application of RNNs and LSTMs, we will use the Penn Treebank dataset, which is a widely used dataset for language modeling tasks. The dataset consists of text from Wall Street Journal articles and is commonly used to evaluate the performance of language models.

We will preprocess the dataset by tokenizing the text and building a vocabulary from the training data. This will allow us to convert the text into numerical representations that can be fed into our models.


In [None]:
# Tokenization
tokenizer = get_tokenizer('basic_english')

# Load the Penn Treebank dataset
train_iter = PennTreebank(split='train')

# Build the vocabulary
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

In [None]:
# Preprocess the data
def data_process(raw_text_iter):
    """
    Preprocesses the raw text data by tokenizing and converting to numerical representations.
    """
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

In [None]:
# Load and preprocess the train, validation, and test sets
train_iter, val_iter, test_iter = PennTreebank()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

## Implementing the Language Model

Now that we have preprocessed the dataset, let's implement a language model using an LSTM network. The language model will learn to predict the next word in a sequence based on the previous words.

The model architecture consists of:
- Embedding layer: Converts the input words into dense vector representations
- LSTM layer: Processes the sequence of word embeddings and captures the long-term dependencies
- Linear layer: Transforms the LSTM outputs into probability distributions over the vocabulary

We will define the model using PyTorch's `nn.Module` class and specify the forward pass of the model.

In [None]:
class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LanguageModel, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, h):
        """
        Forward pass of the language model.

        Args:
            x: Input tensor of shape (batch_size, sequence_length)
            h: Hidden state tensor of shape (num_layers, batch_size, hidden_size)

        Returns:
            out: Output tensor of shape (batch_size * sequence_length, vocab_size)
            (h, c): Tuple of hidden state and cell state tensors of shape (num_layers, batch_size, hidden_size)
        """
        x = self.embed(x)
        out, (h, c) = self.lstm(x, h)
        out = out.reshape(out.size(0) * out.size(1), out.size(2))
        out = self.linear(out)
        return out, (h, c)

## Training the Language Model

With the language model implemented, let's define the training loop. We will use the cross-entropy loss as the criterion and the Adam optimizer to update the model parameters.

The training loop involves:
1. Initializing the hidden and cell states
2. Iterating over the training data in batches
3. Performing forward pass to get the model outputs
4. Computing the loss between the predicted outputs and the target words
5. Backpropagating the gradients and updating the model parameters
6. Printing the loss for each epoch to monitor the training progress

In [None]:
def train_model(model, data, epochs, batch_size, seq_length, lr, clip):
    """
    Trains the language model on the given data.

    Args:
        model: Language model to be trained
        data: Training data tensor
        epochs: Number of training epochs
        batch_size: Batch size for training
        seq_length: Sequence length for truncated backpropagation through time
        lr: Learning rate for the optimizer
        clip: Gradient clipping threshold
    """
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    for epoch in range(epochs):
        states = (torch.zeros(num_layers, batch_size, hidden_size),
                  torch.zeros(num_layers, batch_size, hidden_size))

        for i in range(0, data.size(0) - seq_length, seq_length * batch_size):
            batch_size_i = min(batch_size, (data.size(0) - i) // seq_length)
            inputs = data[i:i+seq_length * batch_size_i].view(batch_size_i, seq_length)
            targets = data[i+1:i+seq_length * batch_size_i+1].view(batch_size_i, seq_length)

            states = detach(states)
            states = (states[0][:, :batch_size_i, :], states[1][:, :batch_size_i, :])
            outputs, states = model(inputs, states)
            loss = criterion(outputs, targets.reshape(-1))

            model.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()

        print(f"Epoch: {epoch+1}, Loss: {loss.item():.4f}")

def detach(states):
    return [state.detach() for state in states]

## Evaluating the Language Model

After training the language model, it's important to evaluate its performance on unseen data. We will evaluate the model on the training, validation, and test sets using the cross-entropy loss metric.

The evaluation process involves:
1. Setting the model to evaluation mode
2. Initializing the hidden and cell states
3. Iterating over the evaluation data in batches
4. Performing forward pass to get the model outputs
5. Computing the loss between the predicted outputs and the target words
6. Accumulating the total loss
7. Returning the average loss over the evaluation data

In [None]:
def evaluate_model(model, data, batch_size, seq_length):
    """
    Evaluates the language model on the given data.

    Args:
        model: Language model to be evaluated
        data: Evaluation data tensor
        batch_size: Batch size for evaluation
        seq_length: Sequence length for truncated backpropagation through time

    Returns:
        Average loss over the evaluation data
        Perplexity: Measure of how well the model predicts the target words
    """
    with torch.no_grad():
        states = (torch.zeros(num_layers, batch_size, hidden_size),
                  torch.zeros(num_layers, batch_size, hidden_size))
        total_loss = 0
        criterion = nn.CrossEntropyLoss()

        for i in range(0, data.size(0) - seq_length, seq_length * batch_size):
            batch_size_i = min(batch_size, (data.size(0) - i) // seq_length)
            inputs = data[i:i+seq_length * batch_size_i].view(batch_size_i, seq_length)
            targets = data[i+1:i+seq_length * batch_size_i+1].view(batch_size_i, seq_length)

            states = (states[0][:, :batch_size_i, :], states[1][:, :batch_size_i, :])
            outputs, states = model(inputs, states)
            loss = criterion(outputs, targets.reshape(-1))
            total_loss += loss.item()

        average_loss = total_loss / (data.size(0) // seq_length)
        # Calculate perplexity
        perplexity = math.exp(average_loss)

        return average_loss, perplexity

## Putting It All Together

Now that we have defined the language model, training loop, and evaluation function, let's put everything together and train our model on the Penn Treebank dataset.

We will:
1. Set the hyperparameters for the model and training
2. Instantiate the language model
3. Train the model using the `train_model` function
4. Evaluate the trained model on the training, validation, and test sets using the `evaluate_model` function
5. Print the evaluation results

In [None]:
# Hyperparameters
vocab_size = len(vocab)
embed_size = 128
hidden_size = 256
num_layers = 2
epochs = 10
batch_size = 32
seq_length = 35
lr = 0.001
clip = 1

In [None]:
# Instantiate the language model
model = LanguageModel(vocab_size, embed_size, hidden_size, num_layers)

In [None]:
# Train the model
train_model(model, train_data, epochs, batch_size, seq_length, lr, clip)

Epoch: 1, Loss: 5.5414
Epoch: 2, Loss: 5.1255
Epoch: 3, Loss: 4.8867
Epoch: 4, Loss: 4.6955
Epoch: 5, Loss: 4.5191
Epoch: 6, Loss: 4.3592
Epoch: 7, Loss: 4.2185
Epoch: 8, Loss: 4.0856
Epoch: 9, Loss: 3.9541
Epoch: 10, Loss: 3.8305


Results:
* Epoch: 1, Loss: 5.5389
* Epoch: 2, Loss: 5.1305
* Epoch: 3, Loss: 4.8541
* Epoch: 4, Loss: 4.6449
* Epoch: 5, Loss: 4.4670
* Epoch: 6, Loss: 4.3003
* Epoch: 7, Loss: 4.1513
* Epoch: 8, Loss: 4.0137
* Epoch: 9, Loss: 3.8938
* Epoch: 10, Loss: 3.7792

In [None]:
# Save the trained model
torch.save(model.state_dict(), 'model.pth')

In [None]:
model = LanguageModel(vocab_size, embed_size, hidden_size, num_layers)
model.load_state_dict(torch.load('model.pth'))

<All keys matched successfully>

In [None]:
# Evaluate the model
train_loss, train_perplexity = evaluate_model(model, train_data, batch_size, seq_length)
val_loss, val_perplexity = evaluate_model(model, val_data, batch_size, seq_length)
test_loss, test_perplexity = evaluate_model(model, test_data, batch_size, seq_length)

Perplexity is a commonly used metric to evaluate language models. It measures how well the model predicts the target words in a sequence. A lower perplexity indicates better performance, as it means the model is more confident in its predictions.

Perplexity is calculated as the exponential of the average cross-entropy loss over the evaluation data. It can be interpreted as the average number of equally likely words that the model considers at each step of the sequence.

For example, a perplexity of 10 means that, on average, the model is considering 10 equally likely words at each step. A lower perplexity indicates that the model is more certain about its predictions and narrows down the choices more effectively.

In [None]:
# Print the evaluation results
print(f"Train Loss: {train_loss:.4f} | Train Perplexity: {train_perplexity:.4f}")
print(f"Validation Loss: {val_loss:.4f} | Validation Perplexity: {val_perplexity:.4f}")
print(f"Test Loss: {test_loss:.4f} | Test Perplexity: {test_perplexity:.4f}")

Train Loss: 0.1421 | Train Perplexity: 1.1527
Validation Loss: 0.1617 | Validation Perplexity: 1.1755
Test Loss: 0.1587 | Test Perplexity: 1.1719


* Train Loss: 0.1421 | Train Perplexity: 1.1527
* Validation Loss: 0.1617 | Validation Perplexity: 1.1755
* Test Loss: 0.1587 | Test Perplexity: 1.1719

Keep in mind this model is quite bad. To make a reasonably good model you need to train it a lot more (perhaps on a GPU to do it faster as well). An example of a type of LSTM is [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) which you could test out on Huggingface [here](https://huggingface.co/google-bert/bert-large-uncased).

## Why are RNNs and LSTMS important to you?

These are the building blocks to the transformer architecture, which we'll look at in a different notebook. The transformer architecture underlies the GPT models and other cutting edge models used these days.