# <span style="color:blue; font-size:36px;">Spam Email Classification using NLP</span>

This notebook demonstrates a complete workflow for classifying emails as **spam** or **not spam** using Natural Language Processing (NLP) techniques. We utilize the `CountVectorizer` for feature extraction and `Multinomial Naive Bayes` for classification. Additionally, an alternative approach using a `Pipeline` is also explored.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

In [2]:
# Load the dataset
dataset = pd.read_csv("Email_Classification.csv")

In [3]:
# Display sample data
print("Sample data from the dataset:")
dataset.sample(5)

Sample data from the dataset:


Unnamed: 0,Category,Message
2686,spam,URGENT! We are trying to contact U. Todays dra...
3057,ham,Webpage s not available!
5350,ham,No one interested. May be some business plan.
3768,ham,"Sir Goodmorning, Once free call me."
3325,ham,I don wake since. I checked that stuff and saw...


In [4]:
# Analyze category distribution
print("Category Distribution:")
dataset['Category'].value_counts()

Category Distribution:


Category
ham     4825
spam     747
Name: count, dtype: int64

In [5]:
# Create a binary column for spam detection
dataset['Spam'] = dataset['Category'].apply(lambda x: 1 if x == 'spam' else 0)
print("Sample data with Spam column:")
dataset.sample(5)

Sample data with Spam column:


Unnamed: 0,Category,Message,Spam
5512,ham,"Just making dinner, you ?",0
3497,ham,Happy birthday... May u find ur prince charmin...,0
3658,ham,Studying. But i.ll be free next weekend.,0
4996,ham,"Just looked it up and addie goes back Monday, ...",0
32,ham,K tell me anything about you.,0


In [6]:
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(dataset['Message'], dataset['Spam'], test_size=0.2, random_state=0)

In [7]:
# Bag of words representation using CountVectorizer
cv = CountVectorizer()
x_train_cv = cv.fit_transform(x_train.values)
x_test_cv = cv.transform(x_test)

In [8]:
# Train the model using Multinomial Naive Bayes
model = MultinomialNB()
model.fit(x_train_cv, y_train)

In [9]:
# Evaluate the model
print("Classification Report for MultinomialNB:")
y_pred = model.predict(x_test_cv)
print(classification_report(y_test, y_pred))

Classification Report for MultinomialNB:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       955
           1       0.98      0.93      0.95       160

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



In [10]:
# Test the model with example emails
emails = [
    'Hey mohan, can we get together to watch football game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Don\'t miss this reward!'
]
emails_count = cv.transform(emails)
print("Predictions for sample emails:")
print(model.predict(emails_count))

Predictions for sample emails:
[0 1]


In [11]:
# Alternative approach using a Pipeline
clf = Pipeline([
    ('Vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [12]:
# Train the pipeline
clf.fit(x_train, y_train)

In [13]:
# Evaluate the pipeline
print("Classification Report for Pipeline:")
y_pred_pipeline = clf.predict(x_test)
print(classification_report(y_test, y_pred_pipeline))

Classification Report for Pipeline:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       955
           1       0.98      0.93      0.95       160

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115

