🌟 Week 3 — Day 7 Mini Project: Spam Email Classifier

Why this is perfect now:

It’s different from Titanic & Housing (new dataset, text instead of tabular).

Uses Naive Bayes, which shines in text classification.

You’ll combine:

Pipelines

Preprocessing (text → features with TF-IDF)

Cross-validation

Model compariso

📝 Notes

This project introduces text data, which is new compared to Titanic/Housing.

You practice pipelines + CV in a different domain.

It’s closer to real-world ML tasks (spam filters, sentiment analysis, etc.).


TfidfVectorizer turns text into numerical vectors.

TF = counts of words in a document.

IDF = downweights common words, upweights rare words.

Produces feature vectors where spammy words (“win”, “prize”, “free”) get high weights.

Without vectorization, models couldn’t process text at all.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

import zipfile, io, requests

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
with z.open("SMSSpamCollection") as f:
    df = pd.read_csv(f, sep="\t", header=None, names=["label", "message"], encoding="utf-8")

X = df["message"]
y = df["label"].map({"ham": 0, "spam": 1})

# Naive Bayes
nb_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english")),
    ("clf", MultinomialNB())
])

# Logistic Regression
logreg_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english")),
    ("clf", LogisticRegression(max_iter=1000))
])

#Evaluate with cross validation

scores_nb = cross_val_score(nb_pipeline, X, y, cv=5, scoring="accuracy")
scores_lr = cross_val_score(logreg_pipeline, X, y, cv=5, scoring="accuracy")

print("Naive Bayes mean accuracy:", scores_nb.mean())
print("Logistic Regression mean accuracy:", scores_lr.mean())
print("---------------------------")

#Train and Inspect

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

nb_pipeline.fit(X_train, y_train)
y_pred = nb_pipeline.predict(X_test)

print("Final Naive Bayes Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

logreg_pipeline.fit(X_train, y_train)
y_pred = logreg_pipeline.predict(X_test)

print("Final Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Naive Bayes mean accuracy: 0.96859183164132
Logistic Regression mean accuracy: 0.9644643389071821
---------------------------
Final Naive Bayes Accuracy: 0.97847533632287
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       1.00      0.84      0.91       149

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115

Final Logistic Regression Accuracy: 0.9695067264573991
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       966
           1       1.00      0.77      0.87       149

    accuracy                           0.97      1115
   macro avg       0.98      0.89      0.93      1115
weighted avg       0.97      0.97      0.97      1115




📊 Exercise of the Day
What mean accuracy did you get for Naive Bayes vs Logistic Regression with 5-fold CV?

Which one worked better? Why might that be?

Look at the precision vs recall in the classification report → which model catches more spam, and which avoids false alarms better?

1) Naive Bayes mean accuracy: 0.96859183164132
Logistic Regression mean accuracy: 0.9644643389071821

2) Naive Bayes worked slightly better

3) Bayes catches more spam as it has a 0.84 recall compared to logistic 0.77. Both are good at false alarms because they 1.00 in accuracy for spam meaning everytime they flag something as spam it is spam. 

🌟 Mini-Challenge

Change the TfidfVectorizer:

Use ngram_range=(1,2) (unigrams + bigrams).

Limit features with max_features=3000.

Compare accuracy with the default.
👉 Did richer text features improve performance?

1. What’s the “Mini-Challenge” asking?

You’re supposed to change how the TfidfVectorizer builds features:

Instead of just using unigrams (single words), also include bigrams (two consecutive words).

Example: "win money now" → unigrams = [win, money, now]; bigrams = [win money, money now].

This gives richer context — e.g., "free" alone vs. "free entry" or "free money".

Limit the vocabulary size to the 3,000 most important features (max_features=3000).

Prevents the model from exploding in size, keeps training faster.

Then compare the accuracy with the default setup (which only uses unigrams and no feature limit).

In [8]:
# Naive Bayes
nb_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=3000)),
    ("clf", MultinomialNB())
])

# Logistic Regression
logreg_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=3000)),
    ("clf", LogisticRegression(max_iter=1000))
])

#Evaluate with cross validation

scores_nb = cross_val_score(nb_pipeline, X, y, cv=5, scoring="accuracy")
scores_lr = cross_val_score(logreg_pipeline, X, y, cv=5, scoring="accuracy")

print("Naive Bayes mean accuracy:", scores_nb.mean())
print("Logistic Regression mean accuracy:", scores_lr.mean())
print("---------------------------")

#Train and Inspect

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

nb_pipeline.fit(X_train, y_train)
y_pred = nb_pipeline.predict(X_test)

print("Final Naive Bayes Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

logreg_pipeline.fit(X_train, y_train)
y_pred = logreg_pipeline.predict(X_test)

print("Final Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Naive Bayes mean accuracy: 0.9761298113693634
Logistic Regression mean accuracy: 0.9703871637777652
---------------------------
Final Naive Bayes Accuracy: 0.9838565022421525
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.99      0.89      0.94       149

    accuracy                           0.98      1115
   macro avg       0.99      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Final Logistic Regression Accuracy: 0.9802690582959641
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       1.00      0.85      0.92       149

    accuracy                           0.98      1115
   macro avg       0.99      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



The performance improved with richer text for naives bayes it went from 0.969 to 0.976.