# Spam Email Classifier

End-to-end pipeline: download → preprocess → train → evaluate → compare.

| Model | Architecture |
|---|---|
| **Baseline** | TF-IDF (uni + bigrams) → Logistic Regression |
| **Advanced** | Learned Embedding → Bi-directional LSTM |

**Dataset:** [abdallahwagih/spam-emails](https://www.kaggle.com/datasets/abdallahwagih/spam-emails) (Kaggle)
**Libraries:** scikit-learn · PyTorch · pandas · matplotlib · seaborn


In [None]:
%matplotlib inline
import sys, os

# Ensure project root is on the path so config, preprocess, etc. are importable
sys.path.insert(0, os.getcwd())

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams.update({"figure.dpi": 100, "figure.figsize": (8, 5)})


## 1  Download & Explore the Dataset

The dataset is fetched from Kaggle via `kagglehub`. It contains SMS messages
labelled *ham* (legitimate) or *spam*. After de-duplication we get **5 572 messages**.

The data is split with **stratification** so every subset keeps the same
≈ 13 % spam / 87 % ham ratio.

> **First run:** requires a Kaggle account and an active `kagglehub` session.
> Subsequent runs use the cached copy.


In [None]:
from download_data import download_and_extract, load_and_clean, split_and_save

dataset_path = download_and_extract()
df           = load_and_clean(dataset_path)
split_and_save(df)


In [None]:
from config import DATA_DIR

train_df = pd.read_csv(os.path.join(DATA_DIR, "train.csv"))
val_df   = pd.read_csv(os.path.join(DATA_DIR, "validation.csv"))
test_df  = pd.read_csv(os.path.join(DATA_DIR, "test.csv"))

# Summary
print(f"{'Split':<14} {'Rows':>6}   {'Spam %':>7}")
print("-" * 32)
for name, split in [("Train", train_df), ("Validation", val_df), ("Test", test_df)]:
    print(f"{name:<14} {len(split):>6}   {split['label'].mean()*100:>6.1f} %")

# Sample rows rendered as an HTML table
sample = train_df.sample(4, random_state=42).copy()
sample["label"] = sample["label"].map({0: "ham", 1: "spam"})
sample[["label", "text"]]


In [None]:
labels  = ["Ham", "Spam"]
colors  = ["steelblue", "coral"]
splits  = [("Train", train_df), ("Validation", val_df), ("Test", test_df)]

fig, axes = plt.subplots(1, 3, figsize=(12, 3.5))
for ax, (name, split) in zip(axes, splits):
    counts = split["label"].value_counts().sort_index()
    ax.bar(labels, counts.values, color=colors, edgecolor="white")
    ax.set_title(name, fontsize=12)
    ax.set_ylabel("Count")
    for i, v in enumerate(counts.values):
        ax.text(i, v + 15, str(v), ha="center", fontsize=10)
    ax.grid(axis="y", alpha=0.3)

plt.suptitle("Class Distribution Across Splits", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()


## 2  Text Preprocessing

Every message goes through the same six-step pipeline **before** any model sees it.
The cleaned text is what both models operate on, so the comparison is fair.

| # | Step | Example effect |
|---|------|----------------|
| 1 | Lower-case | `FREE iPhone` → `free iphone` |
| 2 | Token substitution | URLs → `url` · phones → `phone` · `$500` → `money` |
| 3 | Strip punctuation | `click here!` → `click here` |
| 4 | Tokenise | Split on whitespace |
| 5 | Remove stop-words | Drop *the, is, a, …* and single-char tokens |
| 6 | Porter stem | `running` → `run` · `prizes` → `prize` |


In [None]:
from preprocess import clean_text

examples = [
    "Congratulations! You have WON a FREE iPhone! Click https://scam.com/prize now!",
    "Hey, are we still meeting at 3pm? Call me at 555-123-4567 if not.",
    "URGENT: You owe $500 in back taxes. Pay immediately or face penalties!",
    "Thanks for sending the Q3 report. I'll review it by Friday.",
]

pd.DataFrame({
    "Original": examples,
    "Cleaned" : [clean_text(t) for t in examples],
})


## 3  Baseline — TF-IDF + Logistic Regression

**TfidfVectorizer** converts cleaned text into a sparse numeric matrix
(up to 50 000 uni + bigram features, sub-linear TF scaling).

**LogisticRegression** with `class_weight='balanced'` compensates for the
13 / 87 class imbalance by up-weighting the minority (spam) class internally.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from preprocess import preprocess_df
from config import (SEED, TFIDF_MAX_FEATURES, TFIDF_MIN_DF,
                    TFIDF_NGRAM_RANGE, TFIDF_SUBLINEAR_TF, LR_C, LR_MAX_ITER)

# Preprocess all three splits
train_p = preprocess_df(train_df)
val_p   = preprocess_df(val_df)
test_p  = preprocess_df(test_df)

# TF-IDF — fit on training data only
tfidf = TfidfVectorizer(
    max_features=TFIDF_MAX_FEATURES,
    ngram_range=TFIDF_NGRAM_RANGE,
    sublinear_tf=TFIDF_SUBLINEAR_TF,
    min_df=TFIDF_MIN_DF,
)
X_train = tfidf.fit_transform(train_p["cleaned_text"])
X_val   = tfidf.transform(val_p["cleaned_text"])
X_test  = tfidf.transform(test_p["cleaned_text"])

y_train = train_p["label"].values
y_val   = val_p["label"].values
y_test  = test_p["label"].values

print(f"Feature matrix: {X_train.shape[0]} samples x {X_train.shape[1]} features")

# Logistic Regression
lr_model = LogisticRegression(
    C=LR_C, max_iter=LR_MAX_ITER,
    class_weight="balanced", solver="lbfgs", random_state=SEED,
)
lr_model.fit(X_train, y_train)
print("Logistic Regression trained.")


In [None]:
from sklearn.metrics import classification_report
from evaluate import compute_metrics

# Validation
print("=" * 50)
print("  Validation")
print("=" * 50)
print(classification_report(y_val, lr_model.predict(X_val), target_names=["Ham", "Spam"]))

# Test
test_preds_lr  = lr_model.predict(X_test)
test_probs_lr  = lr_model.predict_proba(X_test)[:, 1]
baseline_metrics = compute_metrics(y_test, test_preds_lr, test_probs_lr)

print("=" * 50)
print("  Test Set")
print("=" * 50)
print(classification_report(y_test, test_preds_lr, target_names=["Ham", "Spam"]))

pd.DataFrame(baseline_metrics, index=[0]).T.rename(columns={0: "Score"}).round(4)


In [None]:
from sklearn.metrics import confusion_matrix, roc_curve, auc

fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))

# Confusion matrix
sns.heatmap(
    confusion_matrix(y_test, test_preds_lr),
    annot=True, fmt="d", cmap="Blues", ax=axes[0],
    xticklabels=["Ham", "Spam"], yticklabels=["Ham", "Spam"],
    cbar=False, linewidths=1, linecolor="black",
)
axes[0].set_xlabel("Predicted")
axes[0].set_ylabel("Actual")
axes[0].set_title("Confusion Matrix – Baseline", fontsize=13)

# ROC
fpr, tpr, _ = roc_curve(y_test, test_probs_lr)
axes[1].plot(fpr, tpr, color="darkorange", lw=2, label=f"AUC = {auc(fpr, tpr):.3f}")
axes[1].plot([0, 1], [0, 1], color="navy", lw=1, linestyle="--")
axes[1].set_xlabel("False Positive Rate")
axes[1].set_ylabel("True Positive Rate")
axes[1].set_title("ROC Curve – Baseline", fontsize=13)
axes[1].legend(loc="lower right", fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()


## 4  Advanced — Bi-directional LSTM

Unlike the bag-of-words baseline, the LSTM **learns its own word embeddings** and
processes tokens **sequentially**, letting it capture word-order patterns.

| Component | Details |
|-----------|---------|
| Embedding | 128-dim, learned, PAD-masked |
| BiLSTM | 1 layer · 128 hidden units per direction |
| Dropout | 0.3 — after embedding & before the FC layer |
| Output | Linear(256 → 1) raw logit; sigmoid → P(spam) |
| Loss | `BCEWithLogitsLoss` with `pos_weight = ham_count / spam_count` |
| Optimiser | Adam (lr = 0.001), gradient clipping @ 1.0 |

**Best checkpoint:** the model state with the highest validation macro-F1
is kept and loaded before the final test evaluation.


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from train_lstm import SpamLSTM, Vocabulary, SpamDataset, train_one_epoch, eval_epoch
from config import VOCAB_SIZE, BATCH_SIZE, EPOCHS, LSTM_LR

torch.manual_seed(SEED)
np.random.seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

# Vocabulary — built on training text only
vocab = Vocabulary(max_size=VOCAB_SIZE)
vocab.build(train_p["cleaned_text"].tolist())
print(f"Vocabulary size: {len(vocab):,}")

# DataLoaders
def _ds(df):
    return SpamDataset(df["cleaned_text"].tolist(), df["label"].tolist(), vocab)

train_loader = DataLoader(_ds(train_p), batch_size=BATCH_SIZE, shuffle=True)
val_loader   = DataLoader(_ds(val_p),   batch_size=BATCH_SIZE)
test_loader  = DataLoader(_ds(test_p),  batch_size=BATCH_SIZE)

# pos_weight to compensate for class imbalance
n_spam     = int(train_p["label"].sum())
pos_weight = torch.tensor([(len(train_p) - n_spam) / n_spam],
                          dtype=torch.float, device=device)
print(f"pos_weight = {pos_weight.item():.2f}")

# Model / optimiser / loss
lstm_model = SpamLSTM(vocab_size=len(vocab)).to(device)
optimizer  = torch.optim.Adam(lstm_model.parameters(), lr=LSTM_LR)
criterion  = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

# Training loop with best-checkpoint tracking
history            = {"train_loss": [], "val_loss": [], "val_f1": []}
best_f1, best_state = 0.0, None

print(f"\nTraining for {EPOCHS} epochs ...")
for epoch in range(1, EPOCHS + 1):
    t_loss = train_one_epoch(lstm_model, train_loader, optimizer, criterion, device)
    v_loss, v_preds, v_probs, v_labels = eval_epoch(
        lstm_model, val_loader, criterion, device
    )
    v_f1 = compute_metrics(v_labels, v_preds, v_probs)["f1_macro"]

    history["train_loss"].append(t_loss)
    history["val_loss"].append(v_loss)
    history["val_f1"].append(v_f1)

    marker = ""
    if v_f1 > best_f1:
        best_f1    = v_f1
        best_state = {k: v.clone() for k, v in lstm_model.state_dict().items()}
        marker     = "  <- best"

    print(f"  Epoch {epoch:>2}/{EPOCHS}  "
          f"train_loss={t_loss:.4f}  "
          f"val_loss={v_loss:.4f}  "
          f"val_f1={v_f1:.4f}{marker}")

# Restore best checkpoint
lstm_model.load_state_dict(best_state)
print(f"\nBest validation F1: {best_f1:.4f}")


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
ep = range(1, len(history["train_loss"]) + 1)

axes[0].plot(ep, history["train_loss"], "b-o", markersize=4, label="Train")
axes[0].plot(ep, history["val_loss"],   "r-o", markersize=4, label="Val")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].set_title("Training & Validation Loss", fontsize=13)
axes[0].legend()
axes[0].grid(alpha=0.3)

axes[1].plot(ep, history["val_f1"], "g-o", markersize=4)
axes[1].axhline(best_f1, color="green", ls="--", alpha=0.5,
                label=f"best = {best_f1:.4f}")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("F1 (macro)")
axes[1].set_title("Validation F1 Score", fontsize=13)
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
_, test_preds_lstm, test_probs_lstm, _ = eval_epoch(
    lstm_model, test_loader, criterion, device
)

print("=" * 50)
print("  Test Set - Bi-LSTM")
print("=" * 50)
print(classification_report(y_test, test_preds_lstm, target_names=["Ham", "Spam"]))

lstm_metrics = compute_metrics(y_test, test_preds_lstm, test_probs_lstm)
pd.DataFrame(lstm_metrics, index=[0]).T.rename(columns={0: "Score"}).round(4)


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))

# Confusion matrix
sns.heatmap(
    confusion_matrix(y_test, test_preds_lstm),
    annot=True, fmt="d", cmap="Blues", ax=axes[0],
    xticklabels=["Ham", "Spam"], yticklabels=["Ham", "Spam"],
    cbar=False, linewidths=1, linecolor="black",
)
axes[0].set_xlabel("Predicted")
axes[0].set_ylabel("Actual")
axes[0].set_title("Confusion Matrix – Bi-LSTM", fontsize=13)

# ROC
fpr, tpr, _ = roc_curve(y_test, test_probs_lstm)
axes[1].plot(fpr, tpr, color="darkorange", lw=2, label=f"AUC = {auc(fpr, tpr):.3f}")
axes[1].plot([0, 1], [0, 1], color="navy", lw=1, linestyle="--")
axes[1].set_xlabel("False Positive Rate")
axes[1].set_ylabel("True Positive Rate")
axes[1].set_title("ROC Curve – Bi-LSTM", fontsize=13)
axes[1].legend(loc="lower right", fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()


## 5  Model Comparison

Both models were evaluated on the **same held-out test set** (15 % of the data).
The chart and table below give a side-by-side view of every metric.


In [None]:
KEYS   = ["accuracy", "precision_macro", "recall_macro", "f1_macro", "auc"]
CHART  = ["Accuracy", "Precision\n(macro)", "Recall\n(macro)", "F1\n(macro)", "AUC-ROC"]
TABLE  = ["Accuracy", "Precision (macro)", "Recall (macro)", "F1 (macro)", "AUC-ROC"]

x, w = np.arange(len(KEYS)), 0.33

fig, ax = plt.subplots(figsize=(10, 5))
bars_b = ax.bar(x - w/2, [baseline_metrics[k] for k in KEYS], w,
                label="TF-IDF + LR", color="steelblue", edgecolor="white")
bars_l = ax.bar(x + w/2, [lstm_metrics[k]     for k in KEYS], w,
                label="Bi-LSTM",     color="coral",    edgecolor="white")

for bars in (bars_b, bars_l):
    for bar in bars:
        ax.text(bar.get_x() + bar.get_width() / 2,
                bar.get_height() + 0.004,
                f"{bar.get_height():.3f}",
                ha="center", va="bottom", fontsize=8.5)

ax.set_xticks(x)
ax.set_xticklabels(CHART, fontsize=10)
ax.set_ylabel("Score", fontsize=11)
ax.set_title("Model Comparison – Test Set", fontsize=13)
ax.legend(fontsize=10)
ax.set_ylim(0.80, 1.04)
ax.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

# Summary table
pd.DataFrame({
    "Metric":     TABLE,
    "TF-IDF+LR":  [round(baseline_metrics[k], 4) for k in KEYS],
    "Bi-LSTM":    [round(lstm_metrics[k],     4) for k in KEYS],
}).set_index("Metric")


## 6  Live Predictions

Classify hand-written sample emails with both trained models.
*Confidence* is the probability the model assigns to its chosen label.


In [None]:
sample_emails = [
    "Congratulations! You have won a free iPhone! Click here to claim your prize immediately.",
    "Hey, are we still meeting for dinner tonight? Let me know what time works.",
    "URGENT: Your account has been suspended. Click the link below to verify now.",
    "Thanks for sending the meeting notes. I will review them by end of day.",
    "You owe $500 in back taxes. Call 555-123-4567 immediately to avoid penalties.",
    "Just checking in – hope your weekend was great! See you Monday.",
    "Win a FREE vacation to the Bahamas! Reply YES now to claim your reward.",
    "Can you send me the quarterly report when you get a chance? No rush.",
]

cleaned = [clean_text(t) for t in sample_emails]

# Baseline
X_sample  = tfidf.transform(cleaned)
bl_preds  = lr_model.predict(X_sample)
bl_probs  = lr_model.predict_proba(X_sample)[:, 1]

# Bi-LSTM
ids_t = torch.tensor([vocab.encode(t) for t in cleaned], dtype=torch.long)
with torch.no_grad():
    lstm_probs_s = torch.sigmoid(lstm_model(ids_t.to(device))).cpu().numpy()
lstm_preds_s = (lstm_probs_s > 0.5).astype(int)

# Results table
def _lbl(p):  return "SPAM" if p else "HAM"
def _conf(p): return f"{max(float(p), 1 - float(p)):.1%}"

pd.DataFrame({
    "Email":           [t[:58] + ("..." if len(t) > 58 else "") for t in sample_emails],
    "Baseline":        [_lbl(p) for p in bl_preds],
    "BL Confidence":   [_conf(p) for p in bl_probs],
    "Bi-LSTM":         [_lbl(p) for p in lstm_preds_s],
    "LSTM Confidence": [_conf(p) for p in lstm_probs_s],
})


## Summary

| | TF-IDF + LR | Bi-LSTM |
|---|---|---|
| **Approach** | Bag-of-words features + linear classifier | Learned embeddings + sequential model |
| **Strengths** | Fast, interpretable, strong on keyword-heavy spam | Captures word order and context |
| **Best for** | Short messages where key words dominate | Longer or more nuanced text |

Both models perform well on this dataset. The TF-IDF baseline is competitive because
spam emails tend to rely on distinctive keywords — exactly what TF-IDF captures.
The Bi-LSTM has an edge when sequential context matters, which becomes more
important on longer or more subtle messages.

---

*All hyper-parameters live in `config.py`. To retrain with different settings,
change a value there and re-run the relevant cell.*
