## Email Spam Detector Notebook

This notebook trains and evaluates a TF‑IDF + Linear SVM model for email spam detection.

- Core Project: [`viraj-sh/email-spam-detector`](https://github.com/viraj-sh/email-spam-detector)
- Browser integration: [`viraj-sh/email-spam-extension`](https://github.com/viraj-sh/email-spam-extension)

#### Details:
- Data handling: pandas
- Model persistence: joblib
- ML library: scikit-learn
- Text preprocessing: Custom
- Algorithm: TF‑IDF (Term Frequency–Inverse Document Frequency)
- Classifier: Linear Support Vector Machine (Linear SVM)
- Hyperparameter tuning: Grid Search with Cross‑Validation



### Import

In [None]:
import os
import sys
import joblib
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, confusion_matrix

sys.path.append(os.path.abspath("..")) # set path to root
from app.core.utils import clean_text

### Load

In [None]:
DATA_PATH = "../dataset/email_text.csv"

df = pd.read_csv(DATA_PATH)
df.head()

### Pre-Process

In [None]:
df["clean_text"] = df["text"].apply(clean_text)

In [None]:
X = df["clean_text"]
y = df["label"]

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=42
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.1765, stratify=y_temp, random_state=42
)

### Train

In [None]:
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        stop_words="english",
        max_df=0.95,
        min_df=5,
        ngram_range=(1, 2),
        sublinear_tf=True
    )),
    ("clf", LinearSVC(class_weight="balanced"))
])

In [None]:
param_grid = {
    "tfidf__max_df": [0.9, 0.95],
    "tfidf__min_df": [3, 5],
    "tfidf__ngram_range": [(1, 1), (1, 2)],
    "clf__C": [0.5, 1, 2],
}

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=cv,
    scoring="f1",
    n_jobs=-1,
    verbose=2
)

grid.fit(X_train, y_train)

best_model = grid.best_estimator_
print("Best Params:", grid.best_params_)

### Metrics

In [None]:
val_preds = best_model.predict(X_val)
print(classification_report(y_val, val_preds))
print(confusion_matrix(y_val, val_preds))

In [None]:
X_final = pd.concat([X_train, X_val])
y_final = pd.concat([y_train, y_val])

best_model.fit(X_final, y_final)

In [None]:
test_preds = best_model.predict(X_test)
print(classification_report(y_test, test_preds))
print(confusion_matrix(y_test, test_preds))

### Save Model & Label

In [None]:
os.makedirs("../app/model", exist_ok=True)

MODEL_PATH = "../app/model/spam_classifier_test.joblib"
joblib.dump(best_model, MODEL_PATH)

print("Model saved to:", MODEL_PATH)

In [None]:
label_mapping = {0: "ham", 1: "spam"}
joblib.dump(label_mapping, "../app/model/label_mapping_test.joblib")

### Test

In [None]:
loaded_model = joblib.load("../app/model/spam_classifier_v1.joblib")
label_mapping = joblib.load("../app/model/label_mapping_v1.joblib")

sample_emails = [
    "Congratulations! You have won a $1000 Walmart gift card. Click here.",
    "Hey John, can you review the attached report before Monday?",
    "Comet Just Landed On Myntra ☄️"
]

preds = loaded_model.predict(sample_emails)
labels = [label_mapping[p] for p in preds]
decision_score = loaded_model.decision_function(sample_emails)

print(preds)
print(labels)
print(decision_score)