# **Spam Email Detection — Naive Bayes (Machine Learning)**

This notebook contains a simple, reproducible implementation of an email spam detection pipeline using the Naive Bayes classifier. The project uses the "Spam Email Classification" dataset from Kaggle and demonstrates data loading, preprocessing, model training, evaluation, and basic model export.

## **Step 00** : Install nessessary packages

In [107]:
! pip install numpy pandas nltk



## **Step 01** : Data loading and Processing

In [108]:

import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

# Load Data from CSV file
df = pd.read_csv("data/email.csv")

# Get Some info about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ezzoubair/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [109]:
# print the first and last 5 values to ch
print(df.head(5))
print(df.tail(5))

  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
     Category                                            Message
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name


In [110]:
# Change the category to a binary values (0 or 1) based on the message is spam (1) or not spam (0)
df.loc[df["Category"] == "ham","Category"] = 0
df.loc[df["Category"] == "spam","Category"] = 1


# print the first and last 5 values to ch
print(df.head(5))
print(df.tail(5))

  Category                                            Message
0        0  Go until jurong point, crazy.. Available only ...
1        0                      Ok lar... Joking wif u oni...
2        1  Free entry in 2 a wkly comp to win FA Cup fina...
3        0  U dun say so early hor... U c already then say...
4        0  Nah I don't think he goes to usf, he lives aro...
     Category                                            Message
5567        1  This is the 2nd time we have tried 2 contact u...
5568        0               Will ü b going to esplanade fr home?
5569        0  Pity, * was in mood for that. So...any other s...
5570        0  The guy did some bitching but I acted like i'd...
5571        0                         Rofl. Its true to its name


In [None]:
# Data Cleaning and processing Function
def clean_text(text):
    text = text.lower()
    # Remove punctuation using regEx
    text = re.sub(r"[^\w\s]", "", text)
    words = text.split()
    # Remove stopwords and short words (optional: words <= 2 chars)
    words = [w for w in words if w not in stop_words and len(w) > 3]
    return words

df["Message"] = df["Message"].apply(clean_text)


Unnamed: 0,Category,Message
0,0,"[jurong, point, crazy, available, bugis, great..."
1,0,[joking]
2,1,"[free, entry, wkly, comp, final, tkts, 21st, 2..."
3,0,"[early, already]"
4,0,"[dont, think, goes, lives, around, though]"


**And that it for the data manipulation we need for now !!**