# **Spam Email Detection — Naive Bayes (Machine Learning)**

This notebook contains a simple, reproducible implementation of an email spam detection pipeline using the Naive Bayes classifier. The project uses the "Spam Email Classification" dataset from Kaggle and demonstrates data loading, preprocessing, model training, evaluation, and basic model export.

## **Step 00** : Install nessessary packages

In [1]:
! pip install numpy pandas nltk



## **Step 01** : Data loading and Processing

In [2]:

import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

# Load Data from CSV file
df = pd.read_csv("data/email.csv")

# Get Some info about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ezzoubair/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# print the first and last 5 values to ch
print(df.head(5))
print(df.tail(5))

  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
     Category                                            Message
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name


In [4]:
# Change the category to a binary values (0 or 1) based on the message is spam (1) or not spam (0)
df.loc[df["Category"] == "ham","Category"] = 0
df.loc[df["Category"] == "spam","Category"] = 1


# print the first and last 5 values to ch
print(df.head(5))
print(df.tail(5))

  Category                                            Message
0        0  Go until jurong point, crazy.. Available only ...
1        0                      Ok lar... Joking wif u oni...
2        1  Free entry in 2 a wkly comp to win FA Cup fina...
3        0  U dun say so early hor... U c already then say...
4        0  Nah I don't think he goes to usf, he lives aro...
     Category                                            Message
5567        1  This is the 2nd time we have tried 2 contact u...
5568        0               Will ü b going to esplanade fr home?
5569        0  Pity, * was in mood for that. So...any other s...
5570        0  The guy did some bitching but I acted like i'd...
5571        0                         Rofl. Its true to its name


In [5]:
# Data Cleaning and processing Function
def clean_text(text):
    text = text.lower()
    # Remove punctuation using regEx
    text = re.sub(r"[^\w\s]", "", text)
    words = text.split()
    # Remove stopwords and short words (optional: words <= 2 chars)
    words = [w for w in words if w not in stop_words and len(w) > 3]
    return words

df["Message"] = df["Message"].apply(clean_text)


In [6]:
# Inspect Data
df.head(5)

Unnamed: 0,Category,Message
0,0,"[jurong, point, crazy, available, bugis, great..."
1,0,[joking]
2,1,"[free, entry, wkly, comp, final, tkts, 21st, 2..."
3,0,"[early, already]"
4,0,"[dont, think, goes, lives, around, though]"


**And that it for the data manipulation we need for now !!**

## **Step 02** : Feature Extraction for Text
in other words, we need to extract meaning from the data we made !

### Create vocabolary from messages :

In [7]:
# merge all tokens into one  big list
messages_tokens = sum(df["Message"],[])
print(len(messages_tokens))

# Eleminate duplicate
vocabulary = list(set(messages_tokens))

vocab_array = np.array(vocabulary)

print(len(vocab_array))

38061
8295


In [8]:
word_to_idx = {word: i for i, word in enumerate(vocab_array)}

def vectorize_message(message):
    vec = np.zeros(len(vocab_array), dtype=int)
    for w in message:
        if w in word_to_idx:
            vec[word_to_idx[w]] += 1
    return vec



df["vector"] = df["Message"].apply(vectorize_message)

df.tail(5)

Unnamed: 0,Category,Message,vector
5567,1,"[time, tried, contact, pound, prize, claim, ea...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5568,0,"[going, esplanade, home]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5569,0,"[pity, mood, soany, suggestions]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5570,0,"[bitching, acted, like, interested, buying, so...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5571,0,"[rofl, true, name]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [9]:
train_category = np.array(df["Category"])
train_matrix = np.array(df["vector"])

def trainNB0(train_matrix,train_category):
    numTrainDocs = len(train_matrix)
    numWords = len(train_matrix[0])
    pSpam = sum(train_category) / float(numTrainDocs)
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)
    p0Denom = 0.0
    p1Denom = 0.0

    for i in range(numTrainDocs):
        if train_category[i] == 1:
            p1Num += train_matrix[i]
            p1Denom += sum(train_matrix[i])
        else:
            p0Num += train_matrix[i]
            p0Denom += sum(train_matrix[i])

    p1Vect = (1 + p1Num) / (2 + p1Denom)
    p0Vect = (1 + p0Num) / (2 + p0Denom)
    return p0Vect, p1Vect, pSpam

p0V, p1V, pSpam = trainNB0(train_matrix, train_category)

In [None]:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * np.log(p1Vec)) + np.log(pClass1)
    p0 = sum(vec2Classify * np.log(p0Vec)) + np.log(1.0 - pClass1)
    return 1 if p1 > p0 else 0



test_email = """
Hello Valued Customer,

CONGRATULATIONS!!! You have been randomly SELECTED to receive an exclusive $2,000 shopping voucher to celebrate our anniversary. This is a limited time offer and will expire within 24 hours.

To claim your reward, simply confirm your account by clicking the link below and verifying your information. Failure to claim now will result in the voucher being awarded to another winner.

[CLAIM YOUR PRIZE — LINK REMOVED FOR SAFETY]

Benefits you get immediately:
• $2,000 voucher usable at hundreds of top stores
• Free expedited shipping for one year
• VIP support and bonus coupons

This is a one-time offer sent to a small number of customers. Don't miss out — act now!

Warm regards,
Rewards Team
Freetreats Rewards Dept.
Contact: support@freetreats-notice.com
"""


test_vector =vectorize_message(clean_text(test_email))


print(test_email, "→", classifyNB(test_vector, p0V, p1V, pSpam))





Hello Valued Customer,

CONGRATULATIONS!!! You have been randomly SELECTED to receive an exclusive $2,000 shopping voucher to celebrate our anniversary. This is a limited time offer and will expire within 24 hours.

To claim your reward, simply confirm your account by clicking the link below and verifying your information. Failure to claim now will result in the voucher being awarded to another winner.

[CLAIM YOUR PRIZE — LINK REMOVED FOR SAFETY]

Benefits you get immediately:
• $2,000 voucher usable at hundreds of top stores
• Free expedited shipping for one year
• VIP support and bonus coupons

This is a one-time offer sent to a small number of customers. Don't miss out — act now!

Warm regards,
Rewards Team
Freetreats Rewards Dept.
Contact: support@freetreats-notice.com
 → 1
