# Spam Classification
This project involves building an Email spam classifier using Machine Learning techniques. The classifier accurately distinguishes between spam and non-spam messages, achieving high precision and accuracy.
# About Dataset
The original dataset can be found here : https://www.kaggle.com/datasets/abdallahwagih/spam-emails

# Load the Data:

In [214]:
import pandas as pd

# Load the dataset
data = pd.read_csv('spam.csv', encoding='ISO-8859-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


# EDA : Exploratory Data Analysis

In [215]:
data.shape

(5572, 5)

In [216]:
data.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

the dataset contains 5572 messages (Rows) and 5 colone 3 are Unnamed

In [217]:
# Drop irrelevant columns

data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)

data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [218]:
# Clean up the dataset
data = data[['v1', 'v2']]
data.columns = ['label', 'email']

data.head()

Unnamed: 0,label,email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [219]:
data.duplicated().sum()

403

In [220]:
drop_duplicates = data.drop_duplicates()

**0 as 'Ham'**

**1 as 'Spam'**

In [221]:
# Map 'ham' to 0 and 'spam' to 1
data['label'] = data['label'].map({'ham': 0, 'spam': 1})
data.head()

Unnamed: 0,label,email
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [222]:
data['label'].value_counts()

label
0    4825
1     747
Name: count, dtype: int64

There are 4516 ham messages and 653 spam messages.

# Preprocess the Data:

In [223]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

**(I) Remove Square Brackets**

**(II) Remove Stop Words**

In [224]:
# Download stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [225]:
# Preprocess the data
def preprocess(text):
    # Remove special characters and convert to lowercase
    text = re.sub(r'[^a-zA-Z]', ' ', text).lower()
    text = text.split()
    # Remove stopwords and apply stemming
    text = [stemmer.stem(word) for word in text if word not in stop_words]
    return ' '.join(text)

In [226]:
data['cleaned_email'] = data['email'].apply(preprocess)
data.head()

Unnamed: 0,label,email,cleaned_email
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,0,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entri wkli comp win fa cup final tkt st m...
3,0,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah think goe usf live around though


**Text Vectorization Using TF-IDF**

We perform text vectorization on both the train and test sets because when evaluating the model on the test data, it needs to have the same number of columns as the training data.

In [227]:
# Text Vectorization using TF-IDF
vectorizer = TfidfVectorizer(max_features=3000)

# Creating independent & dependent variable 
X = vectorizer.fit_transform(data['cleaned_email'])
y = data['label']

In [228]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [229]:
print("X_train shape: {}\nX_test shape: {}\nY_train shape: {}\nY_test shape: {}".format(X_train.shape,X_test.shape,y_train.shape,y_test.shape))

X_train shape: (4457, 3000)
X_test shape: (1115, 3000)
Y_train shape: (4457,)
Y_test shape: (1115,)


# Train the Model:

I used the  RandomForestClasssifier, This algorithm is well-suited for text classification problems like spam detection

In [230]:
# Train the Random Forest model
model = RandomForestClassifier(n_jobs=-1, random_state=42)
model.fit(X_train, y_train)


In [231]:
# Save the model and vectorizer
import joblib

joblib.dump(model, 'models/spam_classifier.pkl')
joblib.dump(vectorizer, 'models/tfidf_vectorizer.pkl')

['models/tfidf_vectorizer.pkl']

In [232]:
# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


Accuracy: 0.98


In [233]:
# Test with a sample email
def predict(email):
    sample_vectorized = vectorizer.transform([email])
    prediction = model.predict(sample_vectorized)
    return 'Spam' if prediction[0] == 1 else 'Not Spam'
sample_email1 = ["Congratulations! You've won a $1,000 gift card. Click the link to claim your prize."]
sample_email2 = ["Hey, what are you up to?"]

print(predict(sample_email1[0]))  # Output: Spam
print(predict(sample_email2[0]))  # Output: Not Spam


Spam
Not Spam
