# Assignment 2: Neural Language Model Training (PyTorch)

**Applicant:** Syed Khaja Fareeduddin  
**Project:** Neural Language Model Training (PyTorch) — Assignment 2  
**Date:** November 2025  

---

### Objective
The goal of this assignment is to **train a neural language model from scratch** using PyTorch,  
demonstrating understanding of:
- Sequence modeling (RNN/LSTM/Transformer)
- Model generalization and overfitting control
- Perplexity evaluation

The dataset used is *Pride and Prejudice* by Jane Austen, which is used here for character-level modeling.

We will:
1. Implement a character-level **LSTM language model**  
2. Show **underfitting**, **overfitting**, and **best-fit** scenarios  
3. Evaluate models using **perplexity**  
4. Implement a small **Transformer-based model** for **extra credit**  

In [1]:
import os
import torch
from src.data_preprocessing import get_data_loaders
from src.model_lstm import LSTMLanguageModel
from src.model_transformer import TransformerLanguageModel
from src.train import train_model
from src.evaluate import calculate_perplexity
from src.utils import setup_device, set_seed

set_seed(42)
device = setup_device()
print(f"✅ Using device: {device}")

os.makedirs("outputs/models", exist_ok=True)
os.makedirs("outputs/plots", exist_ok=True)

✅ Using device: cuda


In [2]:
dataset_path = "dataset\Pride_and_Prejudice-Jane_Austen.txt"

train_loader, val_loader, vocab_size, char2idx, idx2char = get_data_loaders(
    dataset_path, seq_length=100, batch_size=128, split_ratio=0.9
)

print(f"✅ Data loaded successfully!")
print(f"Vocabulary Size: {vocab_size}")
print(f"Training Batches: {len(train_loader)} | Validation Batches: {len(val_loader)}")

✅ Data loaded successfully!
Vocabulary Size: 61
Training Batches: 5001 | Validation Batches: 555


In [4]:
lstm_model = LSTMLanguageModel(
    vocab_size=vocab_size,
    embedding_dim=256,
    hidden_dim=512,
    num_layers=2,
    dropout=0.3
).to(device)

print(lstm_model)

LSTMLanguageModel(
  (embedding): Embedding(61, 256)
  (lstm): LSTM(256, 512, num_layers=2, batch_first=True, dropout=0.3)
  (fc): Linear(in_features=512, out_features=61, bias=True)
)


### Training Configurations

We will train the LSTM model under three different setups to demonstrate model capacity and generalization.

| Scenario | Description | Model Capacity | Epochs | Learning Rate | Notes |
|-----------|--------------|----------------|---------|----------------|--------|
| **Underfit** | Too small model or insufficient training | Low | 2 | 0.01 | Model fails to learn patterns |
| **Overfit** | Large model trained too long | High | 10 | 0.001 | Model memorizes training data |
| **Best Fit** | Balanced setup | Medium | 10 | 0.001 | Good generalization |


In [5]:
underfit_model = LSTMLanguageModel(
    vocab_size, embedding_dim=64, hidden_dim=128, num_layers=1, dropout=0.1
).to(device)

train_model(
    underfit_model,
    train_loader,
    val_loader,
    vocab_size,
    device,
    num_epochs=2,
    lr=0.01,
    save_path="outputs/models/lstm_underfit.pth"
)



Epoch [1/2] - Train Loss: 1.2364, Val Loss: 10.9278
Epoch [2/2] - Train Loss: 1.1489, Val Loss: 11.1735
✅ Model saved at outputs/models/lstm_underfit.pth


([1.2363974537474707, 1.1489114054821177],
 [10.927833142151703, 11.17349977751036])

In [7]:
overfit_model = LSTMLanguageModel(
    vocab_size, embedding_dim=384, hidden_dim=768, num_layers=3, dropout=0.1
).to(device)

train_model(
    overfit_model,
    train_loader,
    val_loader,
    vocab_size,
    device,
    num_epochs=10,
    lr=0.001,
    save_path="outputs/models/lstm_overfit.pth"
)

Epoch [1/10] - Train Loss: 0.7872, Val Loss: 14.8647
Epoch [2/10] - Train Loss: 0.2436, Val Loss: 17.5146
Epoch [3/10] - Train Loss: 0.2034, Val Loss: 18.2254
Epoch [4/10] - Train Loss: 0.1899, Val Loss: 19.0806
Epoch [5/10] - Train Loss: 0.1826, Val Loss: 18.9426
Epoch [6/10] - Train Loss: 0.1780, Val Loss: 19.4334
Epoch [7/10] - Train Loss: 0.1749, Val Loss: 19.6974
Epoch [8/10] - Train Loss: 0.1725, Val Loss: 19.7313
Epoch [9/10] - Train Loss: 0.1709, Val Loss: 19.3954
Epoch [10/10] - Train Loss: 0.1695, Val Loss: 19.5496
✅ Model saved at outputs/models/lstm_overfit.pth


([0.7871992266063713,
  0.24361783116787725,
  0.2034203131582422,
  0.18994292164237422,
  0.1825759515532778,
  0.17797398864269448,
  0.17485796712775442,
  0.17252679697169562,
  0.1708572585418019,
  0.1695114145503476],
 [14.864702106166531,
  17.514624542373795,
  18.22541817845525,
  19.08060347582843,
  18.942589828559946,
  19.43341186111038,
  19.69739401533797,
  19.731322647644593,
  19.395444735965214,
  19.54960112528758])

In [9]:
bestfit_model = LSTMLanguageModel(
    vocab_size, embedding_dim=256, hidden_dim=512, num_layers=2, dropout=0.3
).to(device)

train_model(
    bestfit_model,
    train_loader,
    val_loader,
    vocab_size,
    device,
    num_epochs=10,
    lr=0.001,
    save_path="outputs/models/lstm_bestfit.pth"
)

Epoch [1/10] - Train Loss: 0.9318, Val Loss: 12.3119
Epoch [2/10] - Train Loss: 0.5135, Val Loss: 15.2905
Epoch [3/10] - Train Loss: 0.4163, Val Loss: 16.3688
Epoch [4/10] - Train Loss: 0.3750, Val Loss: 16.9875
Epoch [5/10] - Train Loss: 0.3506, Val Loss: 17.4820
Epoch [6/10] - Train Loss: 0.3338, Val Loss: 18.0732
Epoch [7/10] - Train Loss: 0.3213, Val Loss: 18.2724
Epoch [8/10] - Train Loss: 0.3113, Val Loss: 18.1644
Epoch [9/10] - Train Loss: 0.3034, Val Loss: 18.8153
Epoch [10/10] - Train Loss: 0.2967, Val Loss: 19.0747
✅ Model saved at outputs/models/lstm_bestfit.pth


([0.9318091202034899,
  0.5135033684262179,
  0.41631231352558756,
  0.3750272436252572,
  0.350563401569011,
  0.33377648483298106,
  0.32129098201150824,
  0.3113477679937512,
  0.3034050701928363,
  0.2966784136077448],
 [12.311916408023318,
  15.290514650430765,
  16.368750042958304,
  16.98749641212257,
  17.48203630361471,
  18.073157750378858,
  18.27239241900745,
  18.164423643576132,
  18.815329158198725,
  19.07470783371109])

In [10]:
configs = [
    ("Underfit", "outputs/models/lstm_underfit.pth", dict(embedding_dim=64, hidden_dim=128, num_layers=1, dropout=0.1)),
    ("Overfit", "outputs/models/lstm_overfit.pth", dict(embedding_dim=384, hidden_dim=768, num_layers=3, dropout=0.1)),
    ("Best Fit", "outputs/models/lstm_bestfit.pth", dict(embedding_dim=256, hidden_dim=512, num_layers=2, dropout=0.3)),
]

for name, path, cfg in configs:
    print(f"\nEvaluating {name} model...")
    model = LSTMLanguageModel(
        vocab_size=vocab_size,
        embedding_dim=cfg["embedding_dim"],
        hidden_dim=cfg["hidden_dim"],
        num_layers=cfg["num_layers"],
        dropout=cfg["dropout"]
    ).to(device)

    model.load_state_dict(torch.load(path, map_location=device))
    ppl = calculate_perplexity(model, val_loader, vocab_size, device)
    print(f"{name} Model Perplexity: {ppl:.2f}")


Evaluating Underfit model...


  model.load_state_dict(torch.load(path, map_location=device))


Underfit Model Perplexity: 71217.92

Evaluating Overfit model...
Overfit Model Perplexity: 309231617.51

Evaluating Best Fit model...
Best Fit Model Perplexity: 192327043.73


## Extra Credit: Transformer Language Model

As part of the extra credit work, we implement a lightweight Transformer-based character level language model.  
The Transformer uses self-attention mechanisms to capture long-range dependencies in text, which traditional LSTMs struggle with.

### Why Transformer?
Unlike LSTMs that process input sequentially, Transformers process the entire sequence in parallel,  
making them efficient and effective at modeling contextual relationships, especially in longer sentences or documents.

### Model Configuration
- **Embedding Dimension:** 256  
- **Number of Heads:** 4  
- **Number of Layers:** 2  
- **Feedforward Network:** 512 hidden units  
- **Optimizer:** Adam (lr = 0.001)  
- **Epochs:** 8  

We'll train it similarly to the LSTM models and compare validation perplexity.

In [3]:
transformer_model = TransformerLanguageModel(
    vocab_size, embed_dim=256, num_heads=4, num_layers=2
).to(device)

train_model(
    transformer_model,
    train_loader,
    val_loader,
    vocab_size,
    device,
    num_epochs=8,
    lr=0.001,
    save_path="outputs/models/transformer.pth"
)

transformer_perplexity = calculate_perplexity(transformer_model, val_loader, vocab_size, device)
print(f"Transformer Model Perplexity: {transformer_perplexity:.2f}")



Epoch [1/8] - Train Loss: 0.5178, Val Loss: 0.3266
Epoch [2/8] - Train Loss: 0.0211, Val Loss: 0.2174
Epoch [3/8] - Train Loss: 0.0191, Val Loss: 0.2105
Epoch [4/8] - Train Loss: 0.0183, Val Loss: 0.1565
Epoch [5/8] - Train Loss: 0.0179, Val Loss: 0.1900
Epoch [6/8] - Train Loss: 0.0173, Val Loss: 0.0936
Epoch [7/8] - Train Loss: 0.0170, Val Loss: 0.0781
Epoch [8/8] - Train Loss: 0.0167, Val Loss: 0.0701
✅ Model saved at outputs/models/transformer.pth
Transformer Model Perplexity: 1.07


### Results Summary

| Model | Training Epochs | Parameters | Validation Perplexity | Observation |
|--------|-----------------|-------------|------------------------|-------------|
| **LSTM (Underfit)** | 2 | Small | ~71,000 | Strong underfitting, model failed to learn meaningful structure. |
| **LSTM (Overfit)** | 10 | Large | ~309,000,000 | Severe overfitting, model memorized training data and generalization collapsed. |
| **LSTM (Best Fit)** | 10 | Medium | ~192,000,000 | Better training stability but still poor generalization, validation loss increased. |
| **Transformer (Extra Credit)** | 8 | Medium | 1.07 | Excellent generalization, captured long-range dependencies and outperformed all LSTMs. |

---

### Key Takeaways

- The **underfitting LSTM** was too small to capture the text structure, resulting in extremely high perplexity.  
- The **overfitting LSTM** memorized the training data but failed to generalize, leading to exploding validation perplexity.  
- The **best-fit LSTM** improved training loss but still showed poor validation performance, indicating unstable learning.  
- The **Transformer model** achieved very low validation loss (0.0701) and perplexity ≈ 1.07, demonstrating:
  - Strong modeling of long-range dependencies  
  - Stable and efficient training  
  - Dramatic improvement compared to all LSTM variants  

The Transformer clearly provided the best performance among all tested models.

---

### Outputs

- All loss plots are saved in: `outputs/plots/`
- All trained model files are located in: `outputs/models/`
- The accompanying PDF report summarizes the methodology, experiments, and findings.