# Reproduced Baseline — Classical ML on Enron Spam Dataset

This notebook reproduces a classical machine learning spam classifier using the
**Enron spam dataset** (`enron_spam_data.csv`).

Models used:
- TF–IDF + Logistic Regression (with class weighting)
- TF–IDF + Multinomial Naive Bayes

The goal is to:
1. Load and inspect the dataset.
2. Build a simple text representation from the `Subject` and `Message` fields.
3. Train/evaluate classical models on a train/test split.
4. Report precision, recall, F1, and confusion matrices.


In [6]:
!pip install pandas
!pip install scikit-learn
!pip install numpy


# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp312-cp312-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp312-cp312-macosx_12_0_arm64.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8

## 1. Load the Enron Spam Dataset

Adjust `DATA_PATH` if needed depending on where you place the CSV in your repo.
For example, you might set it to `"data/enron_spam_data.csv"`.


In [None]:
DATA_PATH = "data/enron_spam_data.csv"  # update if your CSV is in a subfolder like 'data/enron_spam_data.csv'

df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
df.head()


Shape: (33716, 5)


Unnamed: 0.1,Unnamed: 0,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14


Columns in the dataset:
- `Subject`: email subject line
- `Message`: email body text
- `Spam/Ham`: textual label (`"spam"` or `"ham"`)
- `Date`: timestamp (not used directly here)

We will:
- Combine `Subject` and `Message` into a single `text` field.
- Map `Spam/Ham` to a binary label: `1` = spam, `0` = ham.


In [9]:
# Basic cleaning and feature construction
df['Subject'] = df['Subject'].fillna("")
df['Message'] = df['Message'].fillna("")

df['text'] = df['Subject'].astype(str) + " " + df['Message'].astype(str)

# Map Spam/Ham labels to binary 0/1
label_map = {"ham": 0, "spam": 1}
df['label'] = df['Spam/Ham'].map(label_map)

print(df[['Spam/Ham', 'label']].value_counts())
df[['text', 'Spam/Ham', 'label']].head()


Spam/Ham  label
spam      1        17171
ham       0        16545
Name: count, dtype: int64


Unnamed: 0,text,Spam/Ham,label
0,christmas tree farm pictures,ham,0
1,"vastar resources , inc . gary , production fro...",ham,0
2,calpine daily gas nomination - calpine daily g...,ham,0
3,re : issue fyi - see note below - already done...,ham,0
4,meter 7268 nov allocation fyi .\n- - - - - - -...,ham,0


## 2. Train/test split

We will use an 80/20 split, stratified by the label to preserve the spam/ham ratio.


In [10]:
X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train size:", X_train.shape[0])
print("Test size:", X_test.shape[0])


Train size: 26972
Test size: 6744


## 3. TF–IDF Vectorization

We convert the raw text into TF–IDF features. The vectorizer is fit on the training set and
applied to both train and test sets.


In [11]:
vectorizer = TfidfVectorizer(max_features=20000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

X_train_tfidf.shape, X_test_tfidf.shape


((26972, 20000), (6744, 20000))

## 4. Logistic Regression Baseline

We use `class_weight='balanced'` to handle any spam/ham imbalance.


In [12]:
log_reg = LogisticRegression(max_iter=1000, class_weight='balanced')
log_reg.fit(X_train_tfidf, y_train)

y_pred_lr = log_reg.predict(X_test_tfidf)
print("Logistic Regression results:\n")
print(classification_report(y_test, y_pred_lr, target_names=['ham', 'spam']))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred_lr))


Logistic Regression results:

              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      3309
        spam       1.00      1.00      1.00      3435

    accuracy                           1.00      6744
   macro avg       1.00      1.00      1.00      6744
weighted avg       1.00      1.00      1.00      6744

Confusion matrix:
 [[3305    4]
 [   0 3435]]


## 5. Multinomial Naive Bayes Baseline


In [13]:
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)

y_pred_nb = nb.predict(X_test_tfidf)
print("Naive Bayes results:\n")
print(classification_report(y_test, y_pred_nb, target_names=['ham', 'spam']))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred_nb))


Naive Bayes results:

              precision    recall  f1-score   support

         ham       1.00      0.98      0.99      3309
        spam       0.98      1.00      0.99      3435

    accuracy                           0.99      6744
   macro avg       0.99      0.99      0.99      6744
weighted avg       0.99      0.99      0.99      6744

Confusion matrix:
 [[3242   67]
 [   0 3435]]


## 6. Summary

This notebook provides a reproducible classical baseline on the Enron spam dataset.
You can use the printed metrics (precision, recall, F1, confusion matrices) as baseline
performance for your capstone project and compare future models against these results.
