Logistic Regression Classifier
--------
This script trains a Logistic Regression classifier for SMS spam detection.

Pipeline steps:
1. TF-IDF Vectorizer (unigrams + bigrams)
   - Converts raw text messages into numerical features based on how important
     each word/phrase is across the dataset.
   - Includes both single words (e.g., "free") and pairs of words (e.g., "free entry").

2. RandomOverSampler
   - The dataset is imbalanced (many more ham than spam messages).
   - Oversampling creates balanced training data by duplicating spam examples,
     helping the model avoid bias toward predicting ham all the time.

3. Logistic Regression
   - A linear classifier that assigns weights to each TF-IDF feature.
   - Features common in spam (like "win prize" or "free entry") get positive weights,
     while features common in ham may get negative weights.

4. Grid Search with 5-Fold Cross-Validation
   - Instead of manually tuning hyperparameters, the script searches over:
       * min_df (ignore words that appear too rarely)
       * C (strength of regularization in Logistic Regression)
   - Uses 5-fold CV on the training data, optimizing for F1 score (a balance
     of precision and recall for the spam class).

5. Validation Evaluation
   - After training with the best parameters, the model is evaluated on the
     validation split (15% of data not seen in training).
   - Reports precision, recall, F1, and ROC-AUC.

6. Saving Results
   - The trained pipeline (TF-IDF + ROS + Logistic Regression) is saved to:
       ../OUTPUT/logreg.joblib
     This allows reusing the trained model later without retraining.
   - A JSON report with best parameters and validation metrics is saved to:
       ../OUTPUT/train_report.json

In [1]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from pathlib import Path
import json
import pandas as pd
import joblib

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score

# Paths
splits_dir = Path("../DATA/splits")
out_dir = Path("../OUTPUT/04_Training_Results")
out_dir.mkdir(parents=True, exist_ok=True)

model_path = out_dir / "logreg.joblib"
report_path = out_dir / "train_report.json"

In [2]:
# Load data
train = pd.read_csv(splits_dir / "train.csv")
val = pd.read_csv(splits_dir / "val.csv")

X_train = train["SMS_Message"].tolist()
y_train = (train["Label"].str.lower() == "spam").astype(int).values

X_val = val["SMS_Message"].tolist()
y_val = (val["Label"].str.lower() == "spam").astype(int).values


In [3]:
# Pipeline: TF-IDF -> ROS -> Logistic Regression
pipe = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2)),
    ("ros", RandomOverSampler(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000, random_state=42))
])

In [4]:
# Small grid
param_grid = {
    "tfidf__min_df": [1, 2, 3],
    "clf__C": [0.5, 1.0, 2.0, 4.0],
    "clf__penalty": ["l2"],
    "clf__solver": ["lbfgs"],
}

In [5]:
# Optimize F1 on spam (positive class)
gs = GridSearchCV(pipe, param_grid, scoring="f1", cv=5, n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)
best_model = gs.best_estimator_


Fitting 5 folds for each of 12 candidates, totalling 60 fits


In [6]:
# Validate on VAL (original distribution)
y_pred = best_model.predict(X_val)
y_prob = best_model.predict_proba(X_val)[:, 1]

prec, rec, f1, _ = precision_recall_fscore_support(
    y_val, y_pred, average="binary", zero_division=0
)
auc = roc_auc_score(y_val, y_prob)

report = {
    "best_params": gs.best_params_,
    "val_metrics": {
        "precision": float(prec),
        "recall": float(rec),
        "f1": float(f1),
        "roc_auc": float(auc),
    },
}

In [7]:
# Save artifacts
joblib.dump(best_model, model_path)
with open(report_path, "w") as f:
    json.dump(report, f, indent=2)

print("Saved model:", model_path)
print("Saved report:", report_path)
print("Best params:", gs.best_params_)
print("VAL metrics:", report["val_metrics"])

Saved model: ../OUTPUT/04_Training_Results/logreg.joblib
Saved report: ../OUTPUT/04_Training_Results/train_report.json
Best params: {'clf__C': 4.0, 'clf__penalty': 'l2', 'clf__solver': 'lbfgs', 'tfidf__min_df': 2}
VAL metrics: {'precision': 0.9191919191919192, 'recall': 0.9381443298969072, 'f1': 0.9285714285714286, 'roc_auc': 0.9933910977782515}
