#Exercise: Training a RNN/LSTM/GRU model

## Task 1: Train RNN/LSTM/GRU model

*datasets* is a lightweight library from Hugging Face. It provides ready-to-use NLP datasets. You can load, process, and share datasets with just a few lines of code.

In [None]:
#load the ag_news dataset
from datasets import load_dataset
import numpy a
dataset = load_dataset("ag_news")

AG's corpus of news articles is a collection of more than 1 million news articles. The AG's news topic classification dataset [1] is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

[1]  Xiang Zhang, Junbo Zhao, Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. Proceedings in Advances in Neural Information Processing Systems 28 (NIPS 2015).

An example of dataset looks as follows:

{
    "label": 3,
    "text": "New iPad released Just like every other September, this one is no different. Apple is planning to release a bigger, heavier, fatter iPad that..."
}

text: a string feature.

label: a classification label, with possible values including World (0), Sports (1), Business (2), Sci/Tech (3).

We analyze the training split of the dataset:

- **Shape**: The dataset contains `(num_rows, num_features)` entries.  
- **Average text length**: On average, each sample contains around *X words*.  
- **Label distribution**: The dataset labels are imbalanced/balanced with counts as follows: `{label: count, ...}`.  

In [None]:

train_data = dataset["train"]

# Number of rows and columns (features)
shape = (len(train_data), len(train_data.features))

# Calculate average text length (in words)
avg_len = np.mean([len(x.split()) for x in train_data["text"]])

# Count label distribution
from collections import Counter
label_counts = Counter(train_data["label"])

print (shape, avg_len, label_counts)

(120000, 2) 37.84745 Counter({2: 30000, 3: 30000, 1: 30000, 0: 30000})


### A simple tokenizer

- **Step 1: Load dataset and tokenizer**  
  The AG News dataset is loaded, which contains short news articles classified into four categories.  
  At the same time, a pretrained tokenizer from huggingface *transformer* liberay (or you can also use anything you want) is loaded. This tokenizer breaks text into subword tokens and maps them to numerical IDs.  

- **Step 2: Tokenization and preprocessing**  
  Each piece of text is tokenized using the tokenizer.  
  The text is truncated or padded so that every sequence has exactly 50 tokens.  

- **Step 3: Convert to tensors**  
  The tokenized sequences (`input_ids`) and their labels are converted into PyTorch tensors.  
  These tensors form the inputs (news text as IDs) and outputs (labels 0–3) for the model.  

- **Step 4: Build DataLoaders**  
  The training and test sets are wrapped in PyTorch DataLoader objects.  
  Each DataLoader provides mini-batches of 64 samples at a time.  
  The training DataLoader also shuffles the data so that the model sees the samples in random order each epoch.  

In [None]:
# A simple Tokenizer
from torch.utils.data import DataLoader, TensorDataset
from transformers import AutoTokenizer
import torch
import torch.nn as nn
import torch.optim as optim

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def encode(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=50
    )

dataset = dataset.map(encode, batched=True)


def to_torch_dataset(split):
    return TensorDataset(
        torch.tensor(split["input_ids"], dtype=torch.long),
        torch.tensor(split["label"], dtype=torch.long)
    )

train_dataset = to_torch_dataset(dataset["train"])
test_dataset = to_torch_dataset(dataset["test"])

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)

##Task2: Train RNN Model

- **Class definition (`SimpleRNNClassifier`)**  
  A simple neural network model for text classification using a vanilla RNN.

- **Embedding layer (`nn.Embedding`)**  
  Converts token IDs into dense vectors of fixed size (`embed_dim`).  
  `padding_idx=0` means that the padding token will not affect training.  

- **RNN layer (`nn.RNN`)**  
  Processes the sequence of word embeddings one step at a time.  
  `batch_first=True` makes the input/output shape `[batch, seq_len, hidden_dim]`.  

- **Fully connected layer (`nn.Linear`)**  
  Maps the final hidden state of the RNN to the output classes (`num_classes`).  

- **Forward pass (`forward`)**  
  1. Convert token IDs → embeddings.  
  2. Pass embeddings through the RNN → get hidden states.  
  3. Take the last hidden state (`hidden.squeeze(0)`).  
  4. Pass it through the linear layer to get class scores.  

- **Model instantiation**  
  Creates the model with:
  - `vocab_size`: size of the tokenizer vocabulary.  
  - `embed_dim=128`: embedding dimension.  
  - `hidden_dim=128`: RNN hidden state size.  
  - `num_classes=4`: four news categories in AG News.  

In [5]:
class SimpleRNNClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.embedding(x)              # [B, L, E]
        output, hidden = self.rnn(x)       # hidden: [1, B, H]
        out = self.fc(hidden.squeeze(0))   # [B, num_classes]
        return out

vocab_size = tokenizer.vocab_size
model = SimpleRNNClassifier(vocab_size, embed_dim=128, hidden_dim=128, num_classes=4)

Train and evalate the model

You need to complete the code, including both the training and evaluation process.
The code evaluates model performance after each training epoch:

Per-class F1: Using the f1_score function with average=None, the F1 score is computed separately for each class, providing insight into how well the model performs across different labels.
Macro-F1: With average="macro", the mean of the per-class F1 scores is calculated, treating all classes equally regardless of their frequency.

In [6]:
from tqdm import tqdm
import torch
from sklearn.metrics import f1_score

def train_and_evaluate(model, train_loader, test_loader, num_epochs=3, lr=1e-3, device="cpu"):
    """
    Train a text classification model and evaluate with F1 per class + macro-F1.

    Args:
        model: PyTorch model (nn.Module)
        train_loader: DataLoader for training set
        test_loader: DataLoader for test/validation set
        num_epochs: number of epochs
        lr: learning rate
        device: "cpu" or "cuda"
    """
    model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    #train
    for epoch in range(num_epochs):
      model.train()
      total_loss = 0


      for X, y in tqdm(train_loader, desc=f"Epoch {epoch+1}", leave=False):
          X, y = X.to(device), y.to(device)
          optimizer.zero_grad()
          preds = model(X)
          loss = criterion(preds, y)
          loss.backward()
          optimizer.step()
          total_loss += loss.item()

      avg_loss = total_loss / len(train_loader)

      model.eval()
      all_preds, all_labels = [], []
      with torch.no_grad():
        for X, y in test_loader:
          X, y = X.to(device), y.to(device)
          preds = model(X)
          predicted = preds.argmax(dim=1)
          all_preds.extend(predicted.cpu().tolist())
          all_labels.extend(y.cpu().tolist())

      # F1 per class
      f1_per_class = f1_score(all_labels, all_preds, average=None)

      #Macro_F1
      macro_f1 = f1_score(all_labels, all_preds, average="macro")

      print(f"\nEpoch {epoch+1}/{num_epochs}")
      print(f"  Train Loss: {avg_loss:.4f}")
      for i, f1 in enumerate(f1_per_class):
          print(f"  Class {i} F1: {f1:.4f}")
      print(f"  Macro-F1: {macro_f1:.4f}")

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device is", device)
train_and_evaluate(model, train_loader, test_loader, num_epochs=5, device=device)

Following the structure of `SimpleRNNClassifier`, implement `SimpleLSTMClassifier` and `SimpleGRUClassifier`, and observe whether their performance shows improvements compared to the naive RNN.

In [None]:
class SimpleLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        output, (hidden, cell) = self.lstm(x)
        out = self.fc(hidden.squeeze(0))
        return out


vocab_size = tokenizer.vocab_size
model = SimpleLSTMClassifier(vocab_size, embed_dim=128, hidden_dim=128, num_classes=4)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device is", device)
train_and_evaluate(model, train_loader, test_loader, num_epochs=5, device=device)

In [None]:
class SimpleGRUClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.gru = nn.GRU(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        output, hidden = self.gru(x)
        out = self.fc(hidden.squeeze(0))
        return out


vocab_size = tokenizer.vocab_size
model = SimpleGRUClassifier(vocab_size, embed_dim=128, hidden_dim=128, num_classes=4)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device is", device)
train_and_evaluate(model, train_loader, test_loader, num_epochs=5, device=device)

# Task 2: Theoretical MCQs about LSTM / RNN / GRU

In each question, there may be multiple correct options.

### Q1. Consider an LSTM. If the forget gate value is set close to **0** for many time steps, what will likely happen to the cell state?
- A. It will grow unboundedly  
- B. It will be preserved across time steps  
- C. It will gradually reset/lose long-term memory  
- D. It will force the output gate to remain closed

Answer: C

### Q2. Which of the following statements correctly describes between GRU and LSTM?
- A. GRU merges the input and forget gates into a single update gate, while LSTM keeps them separate  
- B. GRU maintains both a hidden state and a distinct cell state, just like LSTM  
- C. LSTM uses a reset gate to directly control candidate hidden states, whereas GRU does not  
- D. GRU completely eliminates the use of nonlinear activations in its gating mechanisms  

**Answer:** A  

### Q3. In Backpropagation Through Time (BPTT) for RNNs, which specific factor is primarily responsible for the vanishing gradient problem?
- A. The gradient repeatedly involves products of Jacobians ($\frac{\partial h_t}{\partial h_{t-1}}$) of the recurrent transition function across many time steps  
- B. The gradient becomes dominated by the output layer’s softmax cross-entropy derivative when sequence length grows  
- C. The gradient decays mainly due to the bias terms accumulating through recurrent connections  
- D. The gradient is suppressed because the input-to-hidden transformation has fewer parameters compared to the hidden-to-hidden recurrence  

**Answer:** A

In practical RNN training, Truncated Backpropagation Through Time (Truncated BPTT) is often used. Compared with the standard method Backpropagation Through Time (BPTT), Truncated BPTT unrolls the RNN for only a fixed number of time steps (e.g., 20 or 50). During training, the sequence is processed in chunks, and gradients are backpropagated only within these truncated windows.

### Q4. In practice, truncated BPTT is often used when training RNNs. The main reason is:
- A. To avoid overfitting by randomly dropping parts of the sequence  
- B. To limit computational cost and mitigate gradient explosion/vanishing over very long sequences  
- C. To enforce fixed-length sequences for input  
- D. To increase parallelism in recurrent connections  

**Answer:** B  

### Q5. You are comparing RNN, LSTM, and GRU on a sequence modeling task with very long dependencies. Which of the following best explains why LSTMs often outperform vanilla RNNs?

- A. LSTMs completely eliminate the need for backpropagation through time, avoiding vanishing gradients  
- B. LSTMs introduce a separate memory cell with gates that regulate information flow, mitigating vanishing gradients  
- C. LSTMs use fewer parameters than RNNs, which prevents overfitting on long sequences  
- D. LSTMs replace nonlinear activations with linear updates, ensuring perfect gradient preservation  

**Answer:** B