
# Deep Learning Text Classification with Pretrained Embeddings

This notebook extends the feedback classification example with a deep learning workflow that uses modern word embeddings and a multi-task neural model to predict:

- Whether a comment refers to a **teacher** or **course**
- The **sentiment** of the comment
- The **aspect** category (e.g., teaching skills, behaviour, relevancy)

It loads the text-based dataset (`data/data_feedback.csv`), optionally recreates an Excel file for compatibility, and trains a BiLSTM classifier using pretrained word vectors when available.



## Environment and Dependencies

This notebook relies on PyTorch and TorchText for sequence modeling. If you want to use pretrained GloVe embeddings, TorchText will attempt to download them the first time they are requested. If the environment lacks internet access, the notebook will transparently fall back to randomly initialized embeddings and note the limitation in the logs.

```
pip install torch torchtext scikit-learn pandas matplotlib seaborn
# Optional for gradient-based explanations
pip install captum
```


In [None]:

import os
import random
import math
import string
from collections import Counter

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

try:
    from captum.attr import IntegratedGradients
    CAPTUM_AVAILABLE = True
except Exception:
    CAPTUM_AVAILABLE = False
    print("Captum is not installed; gradient-based token attributions will be skipped.")

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import Vocab, GloVe


In [None]:

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
DEVICE



## Load the Dataset

The dataset is stored as a text-based CSV (`data/data_feedback.csv`) to avoid binary artifacts. If you need an Excel version, uncomment the export cell to generate `data/data_feedback.xlsx` dynamically.


In [None]:

DATA_PATH = os.path.join('..', 'data', 'data_feedback.csv')
df = pd.read_csv(DATA_PATH)
df.head()


In [None]:

# Uncomment to create an Excel file if you prefer working with XLSX in other tools.
# excel_path = os.path.join('..', 'data', 'data_feedback.xlsx')
# df.to_excel(excel_path, index=False)
# print(f"Wrote Excel export to {excel_path}")



## Label Encoding

We will map each target column to integer IDs for modeling. Dictionaries are kept to decode predictions back to readable strings.


In [None]:

label_columns = ['teacher/course', 'sentiment', 'aspect']
label_maps = {}
inv_label_maps = {}
for col in label_columns:
    uniques = sorted(df[col].unique())
    label_maps[col] = {label: idx for idx, label in enumerate(uniques)}
    inv_label_maps[col] = {v: k for k, v in label_maps[col].items()}

encoded_df = df.copy()
for col in label_columns:
    encoded_df[col] = encoded_df[col].map(label_maps[col])

encoded_df.head()



## Train/Validation Split

We create a stratified split on the sentiment column (the most balanced target) to keep distributions reasonable across splits.


In [None]:

train_df, val_df = train_test_split(
    encoded_df,
    test_size=0.2,
    random_state=SEED,
    stratify=encoded_df['sentiment']
)

len(train_df), len(val_df)



## Tokenization and Vocabulary

We tokenize using TorchText's `basic_english` tokenizer, build a vocabulary from the training data, and attempt to load pretrained GloVe vectors. If the vectors cannot be downloaded, the notebook falls back to a randomly initialized embedding matrix and logs the limitation.


In [None]:

tokenizer = get_tokenizer('basic_english')

train_tokens = [tokenizer(text) for text in train_df['comments']]

# Build vocabulary with a minimum frequency of 1 due to the small dataset
counter = Counter(token for tokens in train_tokens for token in tokens)
vocab = Vocab(counter, specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])

# Try to load pretrained vectors
embedding_dim = 100
vectors = None
try:
    print("Attempting to load GloVe embeddings (6B, 100d)...")
    glove = GloVe(name='6B', dim=embedding_dim)
    vectors = glove.get_vecs_by_tokens(vocab.get_itos())
    print("GloVe embeddings loaded.")
except Exception as exc:
    print(f"Could not load GloVe embeddings: {exc}\nUsing randomly initialized embeddings instead.")



## Sequence Encoding and DataLoaders

We convert tokens to integer IDs, pad batches dynamically, and create PyTorch `DataLoader` objects for training and validation.


In [None]:

def encode_text(text: str):
    return [vocab[token] for token in tokenizer(text)]


class FeedbackDataset(torch.utils.data.Dataset):
    def __init__(self, frame):
        self.comments = frame['comments'].tolist()
        self.labels = frame[label_columns].values.astype('int64')

    def __len__(self):
        return len(self.comments)

    def __getitem__(self, idx):
        tokens = encode_text(self.comments[idx])
        targets = torch.tensor(self.labels[idx])
        return torch.tensor(tokens, dtype=torch.long), targets


def collate_batch(batch):
    token_lists, targets = zip(*batch)
    lengths = torch.tensor([len(tokens) for tokens in token_lists], dtype=torch.long)
    padded_tokens = nn.utils.rnn.pad_sequence(token_lists, batch_first=True, padding_value=vocab['<pad>'])
    targets = torch.stack(targets)
    return padded_tokens, lengths, targets


BATCH_SIZE = 8
train_dataset = FeedbackDataset(train_df)
val_dataset = FeedbackDataset(val_df)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)



## Model Definition

We use a simple BiLSTM encoder with a shared embedding layer and three classification heads (one per task). The loss is the sum of the cross-entropy losses across heads, encouraging the model to learn a shared representation for all targets.


In [None]:

class MultiTaskBiLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, label_sizes, pad_idx, pretrained_vectors=None, freeze_embeddings=False):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        if pretrained_vectors is not None:
            with torch.no_grad():
                self.embedding.weight.data.copy_(pretrained_vectors)
            if freeze_embeddings:
                self.embedding.weight.requires_grad = False

        self.encoder = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
        )

        self.dropout = nn.Dropout(0.3)
        self.heads = nn.ModuleList([
            nn.Linear(hidden_dim * 2, out_dim) for out_dim in label_sizes
        ])

    def forward(self, tokens, lengths):
        embedded = self.embedding(tokens)
        packed = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu(), batch_first=True, enforce_sorted=False)
        packed_out, _ = self.encoder(packed)
        enc_out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
        mask = (tokens != vocab['<pad>']).unsqueeze(-1)
        enc_out = enc_out * mask
        summed = enc_out.sum(dim=1)
        denom = mask.sum(dim=1).clamp(min=1)
        pooled = summed / denom
        pooled = self.dropout(pooled)
        return [head(pooled) for head in self.heads]


In [None]:

label_sizes = [len(label_maps[col]) for col in label_columns]

pretrained_vectors = vectors if vectors is not None else None
model = MultiTaskBiLSTM(
    vocab_size=len(vocab),
    embed_dim=embedding_dim,
    hidden_dim=128,
    num_layers=1,
    label_sizes=label_sizes,
    pad_idx=vocab['<pad>'],
    pretrained_vectors=pretrained_vectors,
    freeze_embeddings=False,
).to(DEVICE)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)



## Training Loop

The training step sums the cross-entropy losses for each head. Because the dataset is small, we keep the number of epochs modest; adjust as needed for experimentation.


In [None]:

EPOCHS = 20


def run_epoch(loader, train=True):
    epoch_loss = 0.0
    if train:
        model.train()
    else:
        model.eval()

    for tokens, lengths, targets in loader:
        tokens = tokens.to(DEVICE)
        lengths = lengths.to(DEVICE)
        targets = targets.to(DEVICE)

        if train:
            optimizer.zero_grad()

        logits_list = model(tokens, lengths)
        losses = [criterion(logits, targets[:, idx]) for idx, logits in enumerate(logits_list)]
        loss = sum(losses)

        if train:
            loss.backward()
            optimizer.step()

        epoch_loss += loss.item() * tokens.size(0)

    return epoch_loss / len(loader.dataset)


train_history = []
val_history = []
for epoch in range(1, EPOCHS + 1):
    train_loss = run_epoch(train_loader, train=True)
    val_loss = run_epoch(val_loader, train=False)
    train_history.append(train_loss)
    val_history.append(val_loss)
    if epoch % 5 == 0 or epoch == 1:
        print(f"Epoch {epoch:02d}: train_loss={train_loss:.4f} val_loss={val_loss:.4f}")



### Loss Curves

Plot the training and validation loss to see if the model is converging.


In [None]:

plt.figure(figsize=(8, 4))
plt.plot(train_history, label='train')
plt.plot(val_history, label='val')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training vs Validation Loss')
plt.legend()
plt.show()



## Evaluation

We compute accuracy per task on the validation split and show confusion matrices to highlight common confusions.


In [None]:

def predict(loader):
    model.eval()
    all_targets = [[] for _ in label_columns]
    all_preds = [[] for _ in label_columns]
    with torch.no_grad():
        for tokens, lengths, targets in loader:
            tokens = tokens.to(DEVICE)
            lengths = lengths.to(DEVICE)
            logits_list = model(tokens, lengths)
            for idx, logits in enumerate(logits_list):
                preds = logits.argmax(dim=1).cpu().tolist()
                all_preds[idx].extend(preds)
                all_targets[idx].extend(targets[:, idx].tolist())
    return all_targets, all_preds


targets, preds = predict(val_loader)

for i, col in enumerate(label_columns):
    acc = accuracy_score(targets[i], preds[i])
    print(f"{col} accuracy: {acc:.3f}")


In [None]:

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, col in enumerate(label_columns):
    cm = confusion_matrix(targets[i], preds[i], labels=list(range(len(label_maps[col]))))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i])
    axes[i].set_title(f"Confusion Matrix: {col}")
    axes[i].set_xlabel('Predicted')
    axes[i].set_ylabel('True')
    axes[i].set_xticklabels(label_maps[col].keys(), rotation=45)
    axes[i].set_yticklabels(label_maps[col].keys(), rotation=45)
plt.tight_layout()
plt.show()



### Inspect Misclassifications

To better understand errors, list examples where predictions differ from the ground truth for each head.


In [None]:

val_df_with_preds = val_df.copy().reset_index(drop=True)
for i, col in enumerate(label_columns):
    val_df_with_preds[f"pred_{col}"] = preds[i]
    val_df_with_preds[f"pred_{col}_label"] = val_df_with_preds[f"pred_{col}"].map(inv_label_maps[col])
    val_df_with_preds[f"true_{col}_label"] = val_df_with_preds[col].map(inv_label_maps[col])

for col in label_columns:
    mismatches = val_df_with_preds[val_df_with_preds[col] != val_df_with_preds[f"pred_{col}"]]
    display_cols = ['comments', f"true_{col}_label", f"pred_{col}_label"]
    print(f"\nMisclassifications for {col} (showing up to 5):")
    display(mismatches[display_cols].head())



## Explainability (Optional)

If Captum is installed, we can compute Integrated Gradients over tokens for a sample prediction to highlight which words influenced the teacher/course decision. This provides qualitative insight into what the model is focusing on.


In [None]:

if CAPTUM_AVAILABLE:
    sample_text = val_df.iloc[0]['comments']
    token_ids = torch.tensor([encode_text(sample_text)], dtype=torch.long)
    lengths = torch.tensor([len(token_ids[0])], dtype=torch.long)

    def forward_teacher(input_tokens, input_lengths):
        logits_list = model(input_tokens.to(DEVICE), input_lengths.to(DEVICE))
        return logits_list[0]

    ig = IntegratedGradients(forward_teacher)

    model.eval()
    attributions, _ = ig.attribute(
        inputs=token_ids.to(DEVICE),
        baselines=torch.zeros_like(token_ids).to(DEVICE),
        additional_forward_args=lengths.to(DEVICE),
        target=val_df.iloc[0]['teacher/course'],
        return_convergence_delta=True,
    )

    token_list = tokenizer(sample_text)
    token_attr = attributions.sum(dim=-1).squeeze(0).detach().cpu().numpy()
    token_attr = token_attr / (np.abs(token_attr).max() + 1e-8)

    print(f"Text: {sample_text}")
    for token, score in zip(token_list, token_attr):
        print(f"{token:15s} -> {score:+.3f}")
else:
    print("Captum not available; install captum to run Integrated Gradients.")
