# Explanation

## R-LSTM Text Classification (Final, No Optuna)

Trains a Residual LSTM (R-LSTM) text classifier on df_file.csv with five classes. It preprocesses the data, builds a frequency vocabulary, encodes texts as fixed-length sequences, trains with strong regularization and early stopping, evaluates on a held-out test set, and saves plots and a PyTorch checkpoint.

Data and labels

File: df_file.csv

Columns: Text (string), Label (int)

Example mapping (update if needed): 0: Politics, 1: Sport, 2: Technology, 3: Entertainment, 4: Business

Pipeline

Lowercase text, drop empty rows.

Build vocabulary of size 5000 with <PAD>=0 and <UNK>=1.

Convert each document to a sequence of token IDs of length 50 (truncate/right-pad).

Stratified split: 80% train, 10% validation, 10% test.

Train R-LSTM with Adam, label smoothing, ReduceLROnPlateau, gradient clipping, and early stopping.

Evaluate, print classification report, and save artifacts.

Model: Residual LSTM

Stacked LSTM layers. Each layer adds a linear projection of its input to the LSTM output (residual).

Optional bidirectionality per layer.

Uses the last time step as the sequence representation, followed by a linear classifier.

Training configuration (from best trial)

embedding_dim=128, hidden_size=128, num_layers=2, bidirectional=True

dropout≈0.336, batch_size=24, learning_rate≈3.11e-3, weight_decay≈9.93e-4

label_smoothing≈0.14, max_epochs=150, patience=20

Optimizer: Adam. Scheduler: ReduceLROnPlateau (factor 0.5). Gradient clipping: 5.0.

Reproducibility: global seeds, cuDNN deterministic where available.

Outputs saved

rlstm_training_curves.svg (loss and accuracy vs. epochs)

rlstm_confusion_matrix.svg (test confusion matrix)

rlstm_per_class_accuracy.svg (per-class test accuracy)

rlstm_lr_schedule.svg (learning rate per epoch)

rlstm_pytorch_model.pt (model state dict, vocab, metadata)

How to run

Place df_file.csv next to the notebook.

Install: pip install torch pandas scikit-learn matplotlib seaborn

Run cells in order. The script auto-detects CUDA, MPS, or CPU.

Tips and troubleshooting

On Windows, if DataLoader workers fail, set num_workers=0.

For GPU OOM, reduce batch_size or hidden_size.

Ensure labels are consecutive integers starting at 0 and match class_names.

If seaborn is missing, install it or adapt the confusion matrix to pure matplotlib.

# code

In [1]:
# -*- coding: utf-8 -*-
"""
PyTorch Residual LSTM (R-LSTM) Text Classification — Final (No Optuna)
Dataset: df_file.csv with columns ['Text', 'Label'] and 5 classes

Pipeline:
- Preprocess (lowercase, drop empties)
- Build vocabulary (size=5000) -> indexify -> pad to seq_len=50
- Model: Residual LSTM stack with per-layer linear projections (for residuals)
- Train with best hyperparameters (from your top trial)
- Early stopping on validation loss
- Save:
    * rlstm_training_curves.svg
    * rlstm_confusion_matrix.svg
    * rlstm_per_class_accuracy.svg
    * rlstm_lr_schedule.svg
    * rlstm_pytorch_model.pt
"""

import os
import json
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib; matplotlib.use("Agg")  # headless back-end for servers/CI
import matplotlib.pyplot as plt
from collections import Counter

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# -----------------------------
# Global config & reproducibility
# -----------------------------
GLOBAL_SEED   = 42
SEQ_LEN       = 50
VOCAB_SIZE    = 5000
PRINT_LINE    = "=" * 70

np.random.seed(GLOBAL_SEED)
torch.manual_seed(GLOBAL_SEED)

# (Optional) make cuDNN deterministic; slightly slower but reproducible on GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

def hr(msg: str):
    print("\n" + PRINT_LINE)
    print(msg)
    print(PRINT_LINE)

# -----------------------------
# Best hyperparameters (from your top Optuna trial)
# -----------------------------
BEST_HP = {
    "embedding_dim":   128,
    "hidden_size":     128,
    "num_layers":      2,
    "bidirectional":   True,
    "dropout":         0.3363570033,      # ~0.34
    "batch_size":      24,
    "learning_rate":   0.00311292936,     # ~3.11e-3
    "weight_decay":    0.00099272647,     # ~9.93e-4
    "label_smoothing": 0.1405728207,      # ~0.14
    "max_epochs":      150,
    "patience":        20,
}

# -----------------------------
# Device selection
# -----------------------------
def get_device():
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        print("Using Mac GPU (MPS)")
        return torch.device("mps")
    if torch.cuda.is_available():
        print(f"Using CUDA GPU: {torch.cuda.get_device_name(0)}")
        return torch.device("cuda")
    print("Using CPU")
    return torch.device("cpu")

device = get_device()
print(f"Device: {device}\n")

# -----------------------------
# Load data
# -----------------------------
hr("LOADING DATASET")
# Expecting df_file.csv with columns ['Text', 'Label']
df = pd.read_csv("df_file.csv")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Number of classes: {df['Label'].nunique()}")
print("\nLabel distribution:")
print(df['Label'].value_counts().sort_index())

# Change this mapping if your labels differ
class_names = {0: 'Politics', 1: 'Sport', 2: 'Technology', 3: 'Entertainment', 4: 'Business'}
print("\nClass mapping:")
for label, name in class_names.items():
    print(f"  {label}: {name} ({len(df[df['Label'] == label])} samples)")
print(PRINT_LINE)

# -----------------------------
# Preprocess
# -----------------------------
hr("PREPROCESSING")
print("[Step] Lowercasing and dropping empty rows ...")
df['Text'] = df['Text'].astype(str).str.lower()
df = df[df['Text'].str.len() > 0].reset_index(drop=True)
print(f"[Done] Dataset shape after preprocessing: {df.shape}")

# -----------------------------
# Vocabulary
# -----------------------------
hr("VOCABULARY")
print("[Step] Counting token frequencies ...")
all_words = []
for text in df['Text'].values:
    all_words.extend(text.split())

word_counts = Counter(all_words)
print(f"[Info] Total unique tokens: {len(word_counts)}")

print(f"[Step] Building vocab size={VOCAB_SIZE} with <PAD>=0, <UNK>=1 ...")
vocab = {'<PAD>': 0, '<UNK>': 1}
for w, _ in word_counts.most_common(VOCAB_SIZE - 2):
    vocab[w] = len(vocab)
coverage = (len(vocab) / max(1, len(word_counts))) * 100
print(f"[Done] Vocab size: {len(vocab)} | Coverage: {coverage:.2f}%")

# (Optional) inverse vocab for readable previews
inverse_vocab = {idx: word for word, idx in vocab.items()}

# -----------------------------
# Sequences
# -----------------------------
hr("SEQUENCE ENCODING")
print(f"[Info] Sequence length: {SEQ_LEN}")
print("[Step] Converting texts to index sequences ...")

X_sequences = []
for text in df['Text'].values:
    words = text.split()[:SEQ_LEN]  # truncate to SEQ_LEN
    seq = [vocab.get(w, vocab['<UNK>']) for w in words]
    if len(seq) < SEQ_LEN:          # right-pad with <PAD>
        seq += [vocab['<PAD>']] * (SEQ_LEN - len(seq))
    X_sequences.append(seq)

X_sequences = np.array(X_sequences, dtype=np.int64)
y_labels = df['Label'].astype(int).values

print(f"[Done] X_sequences shape: {X_sequences.shape}")
print(f"[Done] y_labels shape:  {y_labels.shape}")

# -----------------------------
# Split
# -----------------------------
hr("DATA SPLIT")
X_train, X_temp, y_train, y_temp = train_test_split(
    X_sequences, y_labels, test_size=0.2, random_state=GLOBAL_SEED, stratify=y_labels
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=GLOBAL_SEED, stratify=y_temp
)

print(f"[Info] Training set:   {X_train.shape[0]} samples ({X_train.shape[0]/len(X_sequences)*100:.1f}%)")
print(f"[Info] Validation set: {X_val.shape[0]} samples ({X_val.shape[0]/len(X_sequences)*100:.1f}%)")
print(f"[Info] Test set:       {X_test.shape[0]} samples ({X_test.shape[0]/len(X_sequences)*100:.1f}%)")

# -----------------------------
# Dataset class
# -----------------------------
class TextDataset(Dataset):
    """Simple tensorized dataset of (padded_int_sequence, label)."""
    def __init__(self, sequences, labels):
        self.sequences = torch.LongTensor(sequences)
        self.labels = torch.LongTensor(labels)
    def __len__(self):
        return len(self.sequences)
    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

train_dataset = TextDataset(X_train, y_train)
val_dataset   = TextDataset(X_val,   y_val)
test_dataset  = TextDataset(X_test,  y_test)

# -----------------------------
# R-LSTM model (Residual LSTM)
# -----------------------------
class ResLSTM(nn.Module):
    """
    Residual LSTM stack:
      For each layer i:
        y_i = LSTM_i(x_i)
        y_i = y_i + Proj_i(x_i)        # residual projection to match dims
        x_{i+1} = Dropout(y_i)
    After the last layer, we take the last time-step features and classify.
    Supports bidirectionality at each layer.
    """
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size,
                 num_layers=2, dropout=0.5, bidirectional=False, pad_idx=0):
        super().__init__()
        self.hidden_size   = hidden_size
        self.num_layers    = num_layers
        self.bidirectional = bidirectional
        self.num_dirs      = 2 if bidirectional else 1
        self.out_dim       = hidden_size * self.num_dirs

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        # Build per-layer LSTMs and residual projections
        self.lstm_layers = nn.ModuleList()
        self.proj_layers = nn.ModuleList()
        self.drop_layers = nn.ModuleList()

        in_dim = embedding_dim
        for _ in range(num_layers):
            lstm = nn.LSTM(
                input_size=in_dim,
                hidden_size=hidden_size,
                num_layers=1,
                batch_first=True,
                bidirectional=bidirectional
            )
            self.lstm_layers.append(lstm)
            self.proj_layers.append(nn.Linear(in_dim, self.out_dim, bias=False))
            self.drop_layers.append(nn.Dropout(dropout))  # dropout between layers
            in_dim = self.out_dim  # next layer input

        self.final_dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(self.out_dim, output_size)

    def forward(self, x):
        # x: (B, T)
        emb = self.embedding(x)  # (B, T, E)
        h = emb
        for li in range(self.num_layers):
            lstm = self.lstm_layers[li]
            proj = self.proj_layers[li]
            drop = self.drop_layers[li]

            y, _ = lstm(h)        # (B, T, H * dirs)
            res  = proj(h)        # (B, T, H * dirs)
            y    = y + res        # residual connection
            if li < self.num_layers - 1:
                y = drop(y)       # dropout only between layers
            h = y

        # Last time step representation
        last = h[:, -1, :]        # (B, H * dirs)
        last = self.final_dropout(last)
        logits = self.fc(last)    # (B, C)
        return logits

# -----------------------------
# Train / Eval helpers
# -----------------------------
def current_lr(optimizer):
    return float(optimizer.param_groups[0]['lr'])

def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    total_loss, correct, total = 0.0, 0, 0
    for sequences, labels in dataloader:
        sequences, labels = sequences.to(device), labels.to(device)
        optimizer.zero_grad()
        logits = model(sequences)
        loss = criterion(logits, labels)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        optimizer.step()

        total_loss += loss.item() * sequences.size(0)
        _, pred = torch.max(logits, 1)
        total += labels.size(0)
        correct += (pred == labels).sum().item()
    return total_loss / total, correct / total

def evaluate(model, dataloader, criterion, device):
    model.eval()
    total_loss, correct, total = 0.0, 0, 0
    with torch.no_grad():
        for sequences, labels in dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            logits = model(sequences)
            loss = criterion(logits, labels)
            total_loss += loss.item() * sequences.size(0)
            _, pred = torch.max(logits, 1)
            total += labels.size(0)
            correct += (pred == labels).sum().item()
    return total_loss / total, correct / total

def make_criterion(label_smoothing_value: float):
    """
    CrossEntropyLoss w/ label smoothing if supported by your PyTorch version.
    Falls back to standard CrossEntropyLoss for older versions.
    """
    try:
        return nn.CrossEntropyLoss(label_smoothing=float(label_smoothing_value))
    except TypeError:
        return nn.CrossEntropyLoss()

def train_model(model, train_loader, val_loader, criterion, optimizer,
                scheduler, num_epochs, device, patience=10, log_prefix="",
                record_lr=False):
    """
    Standard training loop with:
      - gradient clipping
      - ReduceLROnPlateau scheduler on val_loss
      - early stopping on val_loss
      - best-state restore
    Returns:
      history: dict of lists (train/val loss & acc)
      lr_history: list of LR values (after scheduler step) per epoch
    """
    history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}
    lr_history = []
    best_val_loss = float('inf')
    patience_counter = 0
    best_state = None

    print(f"{log_prefix}[Train] epochs={num_epochs}, batch_size={train_loader.batch_size}, "
          f"lr={current_lr(optimizer):.6f}, patience={patience}")

    for epoch in range(1, num_epochs + 1):
        tr_loss, tr_acc = train_epoch(model, train_loader, criterion, optimizer, device)
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)

        # Scheduler step
        lr_before = current_lr(optimizer)
        scheduler.step(val_loss)
        lr_after = current_lr(optimizer)
        if record_lr:
            lr_history.append(lr_after)
        lr_note = ""
        if lr_after < lr_before:
            lr_note = f" (LR reduced from {lr_before:.6f} to {lr_after:.6f})"

        history['train_loss'].append(tr_loss)
        history['train_acc'].append(tr_acc)
        history['val_loss'].append(val_loss)
        history['val_acc'].append(val_acc)

        print(f"{log_prefix}Epoch {epoch:3d}/{num_epochs} | "
              f"Train Loss: {tr_loss:.4f} | Train Acc: {tr_acc:.4f} | "
              f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f} | "
              f"LR: {current_lr(optimizer):.6f}{lr_note}")

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
        else:
            patience_counter += 1

        if patience_counter >= patience:
            print(f"{log_prefix}[EarlyStopping] triggered at epoch {epoch}")
            break

    if best_state is not None:
        model.load_state_dict(best_state)
        print(f"{log_prefix}[Info] Loaded best model weights")

    return history, lr_history

# -----------------------------
# Final training with the best hyperparameters
# -----------------------------
hr("FINAL TRAINING WITH BEST HYPERPARAMETERS")

embedding_dim = int(BEST_HP["embedding_dim"])
hidden_size   = int(BEST_HP["hidden_size"])
num_layers    = int(BEST_HP["num_layers"])
bidirectional = bool(BEST_HP["bidirectional"])
dropout       = float(BEST_HP["dropout"])
batch_size    = int(BEST_HP["batch_size"])
learning_rate = float(BEST_HP["learning_rate"])
weight_decay  = float(BEST_HP["weight_decay"])
label_smooth  = float(BEST_HP["label_smoothing"])
num_epochs    = int(BEST_HP["max_epochs"])
patience      = int(BEST_HP["patience"])

print("[Final Config]")
print(f"  Vocab size:       {len(vocab)}")
print(f"  Embedding dim:    {embedding_dim}")
print(f"  Hidden size:      {hidden_size}")
print(f"  Num layers:       {num_layers}")
print(f"  Bidirectional:    {bidirectional}")
print(f"  Dropout:          {dropout}")
print(f"  Output size:      {len(class_names)}")
print(f"  Batch size:       {batch_size}")
print(f"  Learning rate:    {learning_rate}")
print(f"  Weight decay:     {weight_decay}")
print(f"  Label smoothing:  {label_smooth}")
print(f"  Max epochs:       {num_epochs}")
print(f"  Sequence length:  {SEQ_LEN}")

# NOTE:
# On Windows + Jupyter, num_workers>0 may require running as a script (__main__ guard).
# If you face DataLoader issues, set num_workers=0.
num_workers = 2
pin_memory  = (device.type == "cuda")

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
                          num_workers=num_workers, pin_memory=pin_memory)
val_loader   = DataLoader(val_dataset,   batch_size=batch_size, shuffle=False,
                          num_workers=num_workers, pin_memory=pin_memory)
test_loader  = DataLoader(test_dataset,  batch_size=batch_size, shuffle=False,
                          num_workers=num_workers, pin_memory=pin_memory)

model = ResLSTM(
    vocab_size=len(vocab),
    embedding_dim=embedding_dim,
    hidden_size=hidden_size,
    output_size=len(class_names),
    num_layers=num_layers,
    dropout=dropout,
    bidirectional=bidirectional,
    pad_idx=vocab['<PAD>']
).to(device)

print("\nModel architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

criterion = make_criterion(label_smooth)
optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5)

history, lr_hist = train_model(
    model, train_loader, val_loader, criterion, optimizer,
    scheduler, num_epochs, device, patience=patience, log_prefix="[Final] ",
    record_lr=True
)

# -----------------------------
# Evaluation
# -----------------------------
hr("MODEL EVALUATION")
train_loss, train_acc = evaluate(model, train_loader, criterion, device)
val_loss, val_acc     = evaluate(model, val_loader,   criterion, device)
test_loss, test_acc   = evaluate(model, test_loader,  criterion, device)

print("\nFinal Accuracy Scores:")
print(f"  Training Accuracy:   {train_acc:.4f} ({train_acc*100:.2f}%)")
print(f"  Validation Accuracy: {val_acc:.4f} ({val_acc*100:.2f}%)")
print(f"  Test Accuracy:       {test_acc:.4f} ({test_acc*100:.2f}%)")

# Collect test predictions for reports/plots
model.eval()
all_predictions, all_labels = [], []
with torch.no_grad():
    for sequences, labels in test_loader:
        sequences = sequences.to(device)
        logits = model(sequences)
        _, pred = torch.max(logits, 1)
        all_predictions.extend(pred.cpu().numpy())
        all_labels.extend(labels.numpy())

y_test_pred = np.array(all_predictions)
y_test_true = np.array(all_labels)
target_names = [class_names[i] for i in range(len(class_names))]

print("\n" + PRINT_LINE)
print("DETAILED CLASSIFICATION REPORT (Test Set)")
print(PRINT_LINE + "\n")
print(classification_report(y_test_true, y_test_pred, target_names=target_names))

# -----------------------------
# Visualizations (SVG)
# -----------------------------
hr("PLOTTING & SAVING FIGURES (SVG)")

# 1) Training/Validation loss & accuracy curves
print("[Plot] Training/Validation curves -> rlstm_training_curves.svg")
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(history['train_loss'], label='Training Loss', linewidth=2)
plt.plot(history['val_loss'],   label='Validation Loss', linewidth=2)
plt.xlabel('Epoch'); plt.ylabel('Loss'); plt.title('Training and Validation Loss')
plt.legend(); plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(history['train_acc'], label='Training Accuracy', linewidth=2)
plt.plot(history['val_acc'],   label='Validation Accuracy', linewidth=2)
plt.xlabel('Epoch'); plt.ylabel('Accuracy'); plt.title('Training and Validation Accuracy')
plt.legend(); plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('rlstm_training_curves.svg', format='svg')
plt.close()

# 2) Confusion Matrix
print("[Plot] Confusion Matrix -> rlstm_confusion_matrix.svg")
cm = confusion_matrix(y_test_true, y_test_pred)
plt.figure(figsize=(7, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=target_names, yticklabels=target_names)
plt.xlabel('Predicted'); plt.ylabel('True'); plt.title('Confusion Matrix (Test Set)')
plt.xticks(rotation=45, ha='right'); plt.yticks(rotation=0)
plt.tight_layout()
plt.savefig('rlstm_confusion_matrix.svg', format='svg')
plt.close()

# 3) Per-class accuracy
print("[Plot] Per-class accuracy -> rlstm_per_class_accuracy.svg")
per_class_acc = []
for i in range(len(class_names)):
    mask = (y_test_true == i)
    acc = np.mean(y_test_pred[mask] == y_test_true[mask]) if np.sum(mask) > 0 else 0.0
    per_class_acc.append(acc)
plt.figure(figsize=(9, 5))
bars = plt.bar(range(len(class_names)), per_class_acc)
plt.xlabel('Class'); plt.ylabel('Accuracy'); plt.title('Per-Class Accuracy on Test Set')
plt.xticks(range(len(class_names)), target_names, rotation=45, ha='right')
plt.ylim([0, 1.1]); plt.grid(True, alpha=0.3, axis='y')
for bar, acc in zip(bars, per_class_acc):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
             f'{acc:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.tight_layout()
plt.savefig('rlstm_per_class_accuracy.svg', format='svg')
plt.close()

# 4) Final LR schedule
print("[Plot] Learning Rate schedule -> rlstm_lr_schedule.svg")
if len(lr_hist) > 0:
    plt.figure(figsize=(8, 4))
    plt.plot(range(1, len(lr_hist)+1), lr_hist, marker='o', linewidth=2)
    plt.xlabel('Epoch'); plt.ylabel('Learning rate'); plt.title('Final Training LR Schedule')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('rlstm_lr_schedule.svg', format='svg')
    plt.close()
else:
    print("[Info] No LR history recorded.")

print("[Output] Saved SVGs: rlstm_training_curves.svg, rlstm_confusion_matrix.svg, rlstm_per_class_accuracy.svg, rlstm_lr_schedule.svg")

# -----------------------------
# Sample predictions (quick sanity check)
# -----------------------------
hr("SAMPLE PREDICTIONS")
np.random.seed(GLOBAL_SEED)
k = min(5, len(X_test))
sample_indices = np.random.choice(len(X_test), size=k, replace=False)
model.eval()
with torch.no_grad():
    for idx in sample_indices:
        seq_tensor = torch.LongTensor(X_test[idx]).unsqueeze(0).to(device)
        logits = model(seq_tensor)
        probs = torch.softmax(logits, dim=1).cpu().numpy()[0]
        pred_label = int(np.argmax(probs))
        true_label = int(y_test[idx])

        tokens = [inverse_vocab.get(int(tok), "<UNK>") for tok in X_test[idx] if int(tok) != vocab['<PAD>']]
        text_preview = " ".join(tokens[:20]) if tokens else "N/A"

        print(f"Text preview: {text_preview}...")
        print(f"True Label: {class_names[true_label]}")
        print(f"Predicted:  {class_names[pred_label]}")
        print(f"Confidence: {probs[pred_label]*100:.2f}%")
        print("Result: " + ("Correct" if true_label == pred_label else "Incorrect"))
        print("-" * 70)

# -----------------------------
# Save model & metadata
# -----------------------------
print("\nSaving model and metadata ...")
torch.save({
    'model_state_dict': model.state_dict(),
    'vocab': vocab,
    'class_names': class_names,
    'hyperparameters': {
        'embedding_dim': embedding_dim,
        'hidden_size': hidden_size,
        'num_layers': num_layers,
        'bidirectional': bidirectional,
        'dropout': dropout,
        'batch_size': batch_size,
        'learning_rate': learning_rate,
        'weight_decay': weight_decay,
        'label_smoothing': label_smooth,
        'seq_len': SEQ_LEN,
        'vocab_size': len(vocab),
        'architecture': 'ResLSTM'
    }
}, 'rlstm_pytorch_model.pt')
print("Saved model as: rlstm_pytorch_model.pt")

print("\n" + PRINT_LINE)
print("TRAINING COMPLETED SUCCESSFULLY!")
print(PRINT_LINE)
print(f"\nFinal Test Accuracy: {test_acc*100:.2f}%")
print(f"Device used: {device}")


Using CUDA GPU: Tesla T4
Device: cuda


LOADING DATASET
Dataset shape: (2225, 2)
Columns: ['Text', 'Label']
Number of classes: 5

Label distribution:
Label
0    417
1    511
2    401
3    386
4    510
Name: count, dtype: int64

Class mapping:
  0: Politics (417 samples)
  1: Sport (511 samples)
  2: Technology (401 samples)
  3: Entertainment (386 samples)
  4: Business (510 samples)

PREPROCESSING
[Step] Lowercasing and dropping empty rows ...
[Done] Dataset shape after preprocessing: (2225, 2)

VOCABULARY
[Step] Counting token frequencies ...
[Info] Total unique tokens: 60616
[Step] Building vocab size=5000 with <PAD>=0, <UNK>=1 ...
[Done] Vocab size: 5000 | Coverage: 8.25%

SEQUENCE ENCODING
[Info] Sequence length: 50
[Step] Converting texts to index sequences ...
[Done] X_sequences shape: (2225, 50)
[Done] y_labels shape:  (2225,)

DATA SPLIT
[Info] Training set:   1780 samples (80.0%)
[Info] Validation set: 222 samples (10.0%)
[Info] Test set:       223 samples (10.0%)

FINAL TRAI