# Spam Detection – Logistic Regression, Random Forest & Naive Bayes

**Objective:**
In this notebook, we implement spam detection using **Logistic Regression**, **Random Forest**, and **Naive Bayes (MultinomialNB)**. We evaluate all three models using Accuracy, Precision, Recall, F1-Score, and Confusion Matrix, and perform sanity checks on sample messages.

In [1]:
# --------------------------------
# 0) Imports
# --------------------------------
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
)

RANDOM_STATE = 42  # reproducibility

## 1. Load Dataset

We use the dataset provided in class (`mail_l7_dataset.csv`). We clean missing values and encode labels.

In [2]:
df = pd.read_csv("mail_l7_dataset.csv")

# Basic cleaning: replace NaNs with empty strings (text models can't handle NaN)
df = df.where(pd.notnull(df), "")

# Encode labels: spam -> 0, ham -> 1
df["Category"] = df["Category"].str.lower().str.strip().map({"spam": 0, "ham": 1})

print(df.head())

   Category                                            Message
0         1  Go until jurong point, crazy.. Available only ...
1         1                      Ok lar... Joking wif u oni...
2         0  Free entry in 2 a wkly comp to win FA Cup fina...
3         1  U dun say so early hor... U c already then say...
4         1  Nah I don't think he goes to usf, he lives aro...


## 2. Split Features & Target

We separate the messages (X) from the category labels (y).

In [3]:
X = df["Message"].astype(str)
y = df["Category"].astype(int)

## 3. Split Data

We split the data into 80% training and 20% testing sets.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

print("=== SPLIT SIZES ===")
print("Train:", X_train.shape[0], " | Test:", X_test.shape[0])

=== SPLIT SIZES ===
Train: 4457  | Test: 1115


## 4. Text Feature Extraction (TF-IDF)

We use **TfidfVectorizer** to convert text messages into numeric feature vectors.

In [5]:
tfidf = TfidfVectorizer(min_df=1, stop_words="english", lowercase=True)
X_train_features = tfidf.fit_transform(X_train)
X_test_features  = tfidf.transform(X_test)

print("\n=== TF-IDF SHAPES ===")
print("X_train:", X_train_features.shape, " | X_test:", X_test_features.shape)


=== TF-IDF SHAPES ===
X_train: (4457, 7440)  | X_test: (1115, 7440)


## 5. Train Models

We train three classifiers: **Logistic Regression**, **Random Forest**, and **Naive Bayes (MultinomialNB)**.

In [6]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
lr.fit(X_train_features, y_train)
lr_pred = lr.predict(X_test_features)

# Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE)
rf.fit(X_train_features, y_train)
rf_pred = rf.predict(X_test_features.toarray())

# Naive Bayes (MultinomialNB)
nb = MultinomialNB()
nb.fit(X_train_features, y_train)
nb_pred = nb.predict(X_test_features)

## 6. Evaluate Performance

We use helper functions to print evaluation metrics and confusion matrices for all three models.

In [7]:
def print_clf_metrics(name, y_true, y_pred, pos_label=0):
    """Print Accuracy, Precision, Recall, F1. pos_label=0 means 'spam' is positive."""
    acc  = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, pos_label=pos_label)
    rec  = recall_score(y_true, y_pred, pos_label=pos_label)
    f1   = f1_score(y_true, y_pred, pos_label=pos_label)
    print(f"\n{name} Performance:")
    print(f"  Accuracy : {acc:.3f}")
    print(f"  Precision: {prec:.3f}")
    print(f"  Recall   : {rec:.3f}")
    print(f"  F1-Score : {f1:.3f}")

def print_confmat(name, y_true, y_pred):
    """
    Confusion matrix with readable labels.
    Rows = Actual, Cols = Predicted
    """
    cm = confusion_matrix(y_true, y_pred, labels=[1, 0])
    cm_df = pd.DataFrame(
        cm,
        index   = ["Actual: Ham (1)",  "Actual: Spam (0)"],
        columns = ["Pred: Ham (1)",    "Pred: Spam (0)"]
    )
    print(f"\n{name} – Confusion Matrix:\n{cm_df}")

In [8]:
# Show results for all three models
print_clf_metrics("Logistic Regression", y_test, lr_pred, pos_label=0)
print_confmat("Logistic Regression", y_test, lr_pred)

print_clf_metrics("Random Forest", y_test, rf_pred, pos_label=0)
print_confmat("Random Forest", y_test, rf_pred)

print_clf_metrics("Naive Bayes", y_test, nb_pred, pos_label=0)
print_confmat("Naive Bayes", y_test, nb_pred)


Logistic Regression Performance:
  Accuracy : 0.968
  Precision: 1.000
  Recall   : 0.758
  F1-Score : 0.863

Logistic Regression – Confusion Matrix:
                  Pred: Ham (1)  Pred: Spam (0)
Actual: Ham (1)             966               0
Actual: Spam (0)             36             113

Random Forest Performance:
  Accuracy : 0.983
  Precision: 1.000
  Recall   : 0.872
  F1-Score : 0.932

Random Forest – Confusion Matrix:
                  Pred: Ham (1)  Pred: Spam (0)
Actual: Ham (1)             966               0
Actual: Spam (0)             19             130

Naive Bayes Performance:
  Accuracy : 0.977
  Precision: 1.000
  Recall   : 0.826
  F1-Score : 0.904

Naive Bayes – Confusion Matrix:
                  Pred: Ham (1)  Pred: Spam (0)
Actual: Ham (1)             966               0
Actual: Spam (0)             26             123


## 7. Sanity Checks

We test all three models on **3 specific example messages** and compare their predictions.

In [9]:
def lab2str(v):
    return "Spam (0)" if v == 0 else "Ham (1)"

test_messages = [
    "Free entry in 2 a weekly competition!",
    "I will meet you at the cafe tomorrow",
    "Congratulations, you won a free ticket"
]

for msg in test_messages:
    msg_features = tfidf.transform([msg])

    lr_pred_one = int(lr.predict(msg_features)[0])
    rf_pred_one = int(rf.predict(msg_features.toarray())[0])
    nb_pred_one = int(nb.predict(msg_features)[0])

    print(f"\n=== SANITY CHECK ===")
    print(f"Message: \"{msg}\"")
    print(f"  LR Prediction : {lab2str(lr_pred_one)}")
    print(f"  RF Prediction : {lab2str(rf_pred_one)}")
    print(f"  NB Prediction : {lab2str(nb_pred_one)}")


=== SANITY CHECK ===
Message: "Free entry in 2 a weekly competition!"
  LR Prediction : Ham (1)
  RF Prediction : Ham (1)
  NB Prediction : Spam (0)

=== SANITY CHECK ===
Message: "I will meet you at the cafe tomorrow"
  LR Prediction : Ham (1)
  RF Prediction : Ham (1)
  NB Prediction : Ham (1)

=== SANITY CHECK ===
Message: "Congratulations, you won a free ticket"
  LR Prediction : Ham (1)
  RF Prediction : Ham (1)
  NB Prediction : Ham (1)
