# üìßüõ°Ô∏è Spam Email Detection Using NLP &  Naive Bayes Algorithm

### ‚úÖ Business Problem Statement

In today's digital communication landscape, spam emails pose a significant threat to productivity, privacy, and security. 

Filtering spam accurately helps organizations reduce phishing attacks, unnecessary data overload, and employee distractions. 

The goal of this case study is to build a machine learning model that can classify incoming emails as spam or not spam (ham) 

using the Naive Bayes algorithm and natural language processing (NLP) techniques.



In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB # for Naive bayes algorithm
from sklearn.metrics import accuracy_score, classification_report


### Step 1: Load dataset

In [40]:
data = pd.read_csv("spam.csv", encoding="latin-1")  

#üìå You need encoding="latin-1" because the file includes special characters

# that can't be correctly interpreted with the default utf-8.

# It tells pandas: ‚ÄúRead this file using a looser standard that won't crash on unknown characters.‚Äù


In [41]:
data.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will √å_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [42]:
data = data[["v1", "v2"]]  # Keep only relevant columns (label and text)

data.columns = ["label", "text"]


In [44]:
data.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [26]:
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Step 2: Preprocessing

In [45]:
data["label"] = data["label"].map({"ham": 0, "spam": 1})  

#Converting text labels to binary format (0 = not spam, 1 = spam) for model training.

In [46]:
data.head(10)

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
5,1,FreeMsg Hey there darling it's been 3 week's n...
6,0,Even my brother is not like to speak with me. ...
7,0,As per your request 'Melle Melle (Oru Minnamin...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...


### Step 3: Feature Extraction (TF-IDF)

In [47]:
vectorizer = TfidfVectorizer(stop_words="english")

#üîé Getting a list of common stopwords (e.g., "the", "is") to remove later.


X = vectorizer.fit_transform(data["text"])
y = data["label"]


In [51]:
y[0:10]

0    0
1    0
2    1
3    0
4    0
5    1
6    0
7    0
8    1
9    1
Name: label, dtype: int64

In [55]:
X[10:15].toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.20981775, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.21993545, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

### Step 4: Train-Test Split

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 5: Train Model (Naive Bayes)

In [56]:

model = MultinomialNB()
model.fit(X_train, y_train)
 
#Training a multinomial Naive Bayes classifier on the text features.

### Step 6: Evaluate Model

In [57]:
y_pred = model.predict(X_test)

In [58]:
y_pred[:5]

array([0, 0, 0, 0, 1], dtype=int64)

In [60]:
data.shape

(5572, 2)

In [63]:
data['label'].value_counts()

0    4825
1     747
Name: label, dtype: int64

In [61]:
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.968609865470852


In [62]:
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.77      0.87       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.93      1115
weighted avg       0.97      0.97      0.97      1115



### Observation:

Excellent Performance on Class 0 (Negative Sentiment):

The model achieves near-perfect precision (0.96) and recall (1.00) for class 0, indicating it reliably identifies negative reviews with minimal false positives/negatives.

Class Imbalance Impact on Class 1 (Positive Sentiment):


While precision is perfect (1.00) for class 1, recall drops to 0.77, suggesting the model misses 23% of positive cases - likely due to the smaller sample size (150 vs 965). The high weighted avg (0.97) confirms overall strong performance despite this imbalance.

