# **Spam Email Detection — Naive Bayes (Machine Learning)**

This notebook contains a simple, reproducible implementation of an email spam detection pipeline using the Naive Bayes classifier. The project uses the "Spam Email Classification" dataset from Kaggle and demonstrates data loading, preprocessing, model training, evaluation, and basic model export.

## **Step 00** : Install nessessary packages

In [96]:
! pip install -r requirements.txt



## **Step 01** : Data loading and Processing

In [97]:

import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

# Load Data from CSV file
df = pd.read_csv("data/email.csv")


# Split into training (80%) and testing (20%)
train_size = int(0.8 * len(df))
train_data = df[:train_size]
test_data = df[train_size:]

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ezzoubair/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [98]:
 # Change the category to a binary values (0 or 1) based on the message is spam (1) or not spam (0) 
train_data["category"]=train_data["category"].map({'spam': 1, 'ham': 0})
test_data["category"]=test_data["category"].map({'spam': 1, 'ham': 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data["category"]=train_data["category"].map({'spam': 1, 'ham': 0})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["category"]=test_data["category"].map({'spam': 1, 'ham': 0})


In [99]:
# Data Cleaning and processing Function
def clean_text(text):
    text = text.lower()
    # Remove punctuation using regEx
    text = re.sub(r"[^\w\s]", "", text)
    words = text.split()
    # Remove stopwords and short words (optional: words <= 2 chars)
    words = [w for w in words if w not in stop_words and len(w) > 3]
    return words

train_data["message"] = train_data["message"].apply(clean_text)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data["message"] = train_data["message"].apply(clean_text)


**And that it for the data manipulation we need for now !!**

## **Step 02** : Feature Extraction for Text
in other words, we need to extract meaning from the data we made !

### Create vocabolary from messages :

In [100]:
# merge all tokens into one  big list
messages_tokens = sum(train_data["message"],[])
print(len(messages_tokens))

# Eleminate duplicate
vocabulary = list(set(messages_tokens))

vocab_array = np.array(vocabulary)

print(len(vocab_array))

30522
7270
7270


In [101]:
word_to_idx = {word: i for i, word in enumerate(vocab_array)}

def vectorize_message(message):
    vec = np.zeros(len(vocab_array), dtype=int)
    for w in message:
        if w in word_to_idx:
            vec[word_to_idx[w]] += 1
    return vec



train_data["vector"] = train_data["message"].apply(vectorize_message)
test_data["vector"] = test_data["message"].apply(vectorize_message)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data["vector"] = train_data["message"].apply(vectorize_message)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["vector"] = test_data["message"].apply(vectorize_message)


In [102]:
train_category = np.array(train_data["category"])
train_matrix = np.array(train_data["vector"])

def train_NB_0(train_matrix,train_category):
    numTrainDocs = len(train_matrix)
    numWords = len(train_matrix[0])
    pSpam = sum(train_category) / float(numTrainDocs)
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)
    p0Denom = 0.0
    p1Denom = 0.0

    for i in range(numTrainDocs):
        if train_category[i] == 1:
            p1Num += train_matrix[i]
            p1Denom += sum(train_matrix[i])
        else:
            p0Num += train_matrix[i]
            p0Denom += sum(train_matrix[i])

    p1Vect = (1 + p1Num) / (2 + p1Denom)
    p0Vect = (1 + p0Num) / (2 + p0Denom)
    return p0Vect, p1Vect, pSpam


def classify_NB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * np.log(p1Vec)) + np.log(pClass1)
    p0 = sum(vec2Classify * np.log(p0Vec)) + np.log(1.0 - pClass1)
    return 1 if p1 > p0 else 0


p0V, p1V, pSpam = train_NB_0(train_matrix, train_category)

In [103]:
# Extract test labels and vectors
test_labels = np.array(test_data["category"])
test_vectors = np.array(test_data["vector"])

# Classify each test message
predictions = np.array([classify_NB(vec, p0V, p1V, pSpam) for vec in test_vectors])

In [104]:
# Calculate True Positives, False Positives, True Negatives, False Negatives
tp = np.sum((predictions == 1) & (test_labels == 1))
fp = np.sum((predictions == 1) & (test_labels == 0))
tn = np.sum((predictions == 0) & (test_labels == 0))
fn = np.sum((predictions == 0) & (test_labels == 1))

# Accuracy
accuracy = (tp + tn) / len(test_labels)

# Precision
precision = tp / (tp + fp) if (tp + fp) > 0 else 0

# Recall
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

# F1-Score
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

# Print metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1_score:.2f}")

Accuracy: 0.87
Precision: 0.00
Recall: 0.00
F1-Score: 0.00


## **Step 03**: Model Evaluation and Diagnostics

In [105]:
# Diagnostic: Check what the model is predicting
print("Prediction distribution:")
print(f"Predicted Ham (0): {np.sum(predictions == 0)}")
print(f"Predicted Spam (1): {np.sum(predictions == 1)}")
print(f"Total predictions: {len(predictions)}")

print("\nActual distribution in test set:")
print(f"Actual Ham (0): {np.sum(test_labels == 0)}")
print(f"Actual Spam (1): {np.sum(test_labels == 1)}")

print("\nConfusion Matrix:")
print(f"True Positives (Spam correctly identified): {tp}")
print(f"False Positives (Ham incorrectly identified as Spam): {fp}")
print(f"True Negatives (Ham correctly identified): {tn}")
print(f"False Negatives (Spam incorrectly identified as Ham): {fn}")

# Check if the test data has the correct preprocessing
print(f"\nTest data shape: {test_data.shape}")
print(f"Test vectors shape: {test_vectors.shape}")
print(f"Vocabulary size: {len(vocab_array)}")

Prediction distribution:
Predicted Ham (0): 1115
Predicted Spam (1): 0
Total predictions: 1115

Actual distribution in test set:
Actual Ham (0): 970
Actual Spam (1): 145

Confusion Matrix:
True Positives (Spam correctly identified): 0
False Positives (Ham incorrectly identified as Spam): 0
True Negatives (Ham correctly identified): 970
False Negatives (Spam incorrectly identified as Ham): 145

Test data shape: (1115, 3)
Test vectors shape: (1115,)
Vocabulary size: 7270


In [106]:
# Check if test data preprocessing is missing
print("Sample test message before cleaning:")
print(test_data.iloc[0])

# Fix: Clean test data messages (this was missing!)
print("\nCleaning test data messages...")
test_data = test_data.copy()  # Avoid SettingWithCopyWarning
test_data["message"] = test_data["message"].apply(clean_text)

print("\nSample test message after cleaning:")
print(test_data.iloc[0])

# Re-vectorize test data with cleaned messages
print("\nRe-vectorizing test data...")
test_data["vector"] = test_data["message"].apply(vectorize_message)

# Update test vectors
test_vectors = np.array(test_data["vector"])
print(f"Updated test vectors shape: {test_vectors.shape}")

Sample test message before cleaning:
category                                                    0
message     If you want to mapquest it or something look u...
vector      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: 4457, dtype: object

Cleaning test data messages...

Sample test message after cleaning:
category                                                    0
message     [want, mapquest, something, look, dogwood, dri...
vector      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: 4457, dtype: object

Re-vectorizing test data...
Updated test vectors shape: (1115,)


In [107]:
# Re-run predictions with properly cleaned test data
predictions = np.array([classify_NB(vec, p0V, p1V, pSpam) for vec in test_vectors])

# Recalculate metrics
tp = np.sum((predictions == 1) & (test_labels == 1))
fp = np.sum((predictions == 1) & (test_labels == 0))
tn = np.sum((predictions == 0) & (test_labels == 0))
fn = np.sum((predictions == 0) & (test_labels == 1))

# Updated metrics
accuracy = (tp + tn) / len(test_labels)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print("="*50)
print("CORRECTED RESULTS:")
print("="*50)
print(f"Prediction distribution:")
print(f"Predicted Ham (0): {np.sum(predictions == 0)}")
print(f"Predicted Spam (1): {np.sum(predictions == 1)}")

print(f"\nConfusion Matrix:")
print(f"True Positives: {tp}")
print(f"False Positives: {fp}")
print(f"True Negatives: {tn}")
print(f"False Negatives: {fn}")

print(f"\nPerformance Metrics:")
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1_score:.3f}")

CORRECTED RESULTS:
Prediction distribution:
Predicted Ham (0): 952
Predicted Spam (1): 163

Confusion Matrix:
True Positives: 136
False Positives: 27
True Negatives: 943
False Negatives: 9

Performance Metrics:
Accuracy: 0.968
Precision: 0.834
Recall: 0.938
F1-Score: 0.883


In [108]:
# Test individual messages to verify the model works
test_messages = [
    "FREE! Claim your prize now! Limited time offer!",
    "Hey, want to grab lunch tomorrow?",
    "WINNER! You've won $1000! Call now!",
    "Meeting at 3pm in conference room B",
    "Congratulations! Click here to claim your reward"
]

print("Individual Message Testing:")
print("="*60)
for msg in test_messages:
    cleaned = clean_text(msg)
    vectorized = vectorize_message(cleaned)
    prediction = classify_NB(vectorized, p0V, p1V, pSpam)
    
    print(f"Message: {msg}")
    print(f"Cleaned words: {cleaned}")
    print(f"Vector sum: {vectorized.sum()}")
    print(f"Prediction: {'SPAM' if prediction == 1 else 'HAM'}")
    print("-" * 60)

Individual Message Testing:
Message: FREE! Claim your prize now! Limited time offer!
Cleaned words: ['free', 'claim', 'prize', 'limited', 'time', 'offer']
Vector sum: 6
Prediction: SPAM
------------------------------------------------------------
Message: Hey, want to grab lunch tomorrow?
Cleaned words: ['want', 'grab', 'lunch', 'tomorrow']
Vector sum: 4
Prediction: HAM
------------------------------------------------------------
Message: WINNER! You've won $1000! Call now!
Cleaned words: ['winner', 'youve', '1000', 'call']
Vector sum: 4
Prediction: SPAM
------------------------------------------------------------
Message: Meeting at 3pm in conference room B
Cleaned words: ['meeting', 'conference', 'room']
Vector sum: 3
Prediction: HAM
------------------------------------------------------------
Message: Congratulations! Click here to claim your reward
Cleaned words: ['congratulations', 'click', 'claim', 'reward']
Vector sum: 4
Prediction: SPAM
-----------------------------------------

## **Step 04**: Model Persistence

Save the trained model and components for later use in production or API deployment.

In [109]:
import pickle
import os

# Create models directory if it doesn't exist
os.makedirs('models', exist_ok=True)

# Save the trained model components
model_data = {
    'p0_vector': p0V,  # Ham probabilities
    'p1_vector': p1V,  # Spam probabilities
    'p_spam': pSpam,   # Prior probability of spam
    'vocabulary': vocab_array,
    'word_to_idx': word_to_idx,
    'stop_words': stop_words
}

# Save model to pickle file
with open('models/spam_classifier_model.pkl', 'wb') as f:
    pickle.dump(model_data, f)

print("Model saved successfully!")
print("Saved components:")
print(f"- Ham probability vector: {len(p0V)} features")
print(f"- Spam probability vector: {len(p1V)} features")
print(f"- Prior spam probability: {pSpam:.3f}")
print(f"- Vocabulary size: {len(vocab_array)}")
print(f"- Model file: models/spam_classifier_model.pkl")

Model saved successfully!
Saved components:
- Ham probability vector: 7270 features
- Spam probability vector: 7270 features
- Prior spam probability: 0.135
- Vocabulary size: 7270
- Model file: models/spam_classifier_model.pkl


In [110]:
# Test loading the saved model
print("Testing model loading...")

with open('models/spam_classifier_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

print("Model loaded successfully!")
print("Loaded components:")
for key, value in loaded_model.items():
    if isinstance(value, np.ndarray):
        print(f"- {key}: shape {value.shape}")
    elif isinstance(value, (dict, set)):
        print(f"- {key}: {len(value)} items")
    else:
        print(f"- {key}: {value}")

# Test the loaded model with a sample message
test_msg = "FREE! Win $1000 now! Limited time offer!"
print(f"\nTesting loaded model with: '{test_msg}'")

# Use loaded model components
loaded_p0V = loaded_model['p0_vector']
loaded_p1V = loaded_model['p1_vector']
loaded_pSpam = loaded_model['p_spam']
loaded_vocab = loaded_model['vocabulary']
loaded_word_to_idx = loaded_model['word_to_idx']
loaded_stop_words = loaded_model['stop_words']

# Clean and vectorize test message
def clean_text_loaded(text):
    text = text.lower()
    text = re.sub(r"[^\w\s]", "", text)
    words = text.split()
    words = [w for w in words if w not in loaded_stop_words and len(w) > 3]
    return words

def vectorize_message_loaded(message):
    vec = np.zeros(len(loaded_vocab), dtype=int)
    for w in message:
        if w in loaded_word_to_idx:
            vec[loaded_word_to_idx[w]] += 1
    return vec

cleaned = clean_text_loaded(test_msg)
vectorized = vectorize_message_loaded(cleaned)
prediction = classify_NB(vectorized, loaded_p0V, loaded_p1V, loaded_pSpam)

print(f"Prediction: {'SPAM' if prediction == 1 else 'HAM'}")
print("Model loading and testing successful!")

Testing model loading...
Model loaded successfully!
Loaded components:
- p0_vector: shape (7270,)
- p1_vector: shape (7270,)
- p_spam: 0.1350684316805026
- vocabulary: shape (7270,)
- word_to_idx: 7270 items
- stop_words: 198 items

Testing loaded model with: 'FREE! Win $1000 now! Limited time offer!'
Prediction: SPAM
Model loading and testing successful!
