#3. Email Spam Detection

The goal of this project is to develop a robust email spam detection system using machine
learning techniques. By analyzing the content and characteristics of emails, the system should
be able to accurately classify incoming emails as either spam or legitimate (ham).

#Importing All Necessary Libraries:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#Importing Email Spam Detection dataset:

In [2]:
Email_dataset = pd.read_csv('Email_dataset.csv')
Email_dataset

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


#Checking shape of the dataset:
It will give you Number of columns and rows present in the dataset

In [3]:
Email_dataset.shape

(5728, 2)

#Finding to see the how many columns present in the dataset:

In [4]:
Email_dataset.columns

Index(['text', 'spam'], dtype='object')

#Checking Non-Null Count and Datatype of each column present in the Email_dataset dataset:

In [5]:
Email_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


#Checking null values present in the Email_dataset columnwise:

In [6]:
Email_dataset.isnull().sum()

text    0
spam    0
dtype: int64

#Checking Type of data present in each column:

In [7]:
Email_dataset.text.unique()

array(["Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : y

In [8]:
Email_dataset.spam.unique()

array([1, 0])

#Describe the dataset:

In [9]:
Email_dataset.describe()

Unnamed: 0,spam
count,5728.0
mean,0.238827
std,0.426404
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


#Finding duplicate rows in Table:

In [10]:
Email_dataset.duplicated().sum()

33

In [11]:
duplicate_rows = Email_dataset[Email_dataset.duplicated()]

# Print duplicate rows
print("Duplicate Rows except first occurrence:")
print(duplicate_rows)

Duplicate Rows except first occurrence:
                                                   text  spam
2155  Subject: research allocations to egm  hi becky...     0
2260  Subject: departure of grant masson  the resear...     0
2412  Subject: re : schedule and more . .  jinbaek ,...     0
2473  Subject: day off tuesday  stinson ,  i would l...     0
2763  Subject: re : your mail  zhendong ,  dr . kami...     0
3123  Subject: re : grades  pam ,  the students rese...     0
3152  Subject: tiger evals - attachment  tiger hosts...     0
3248  Subject: re : i am zhendong  zhendong ,  thank...     0
3249  Subject: hello from enron  dear dr . mcmullen ...     0
3387  Subject: term paper  dr . kaminski ,  attached...     0
3573  Subject: telephone interview with the enron re...     0
3660  Subject: re : summer work . .  jinbaek ,  this...     0
3690  Subject: re : weather and energy price data  m...     0
3823  Subject: research get - together at sandeep ko...     0
4203  Subject: re : willow and

#Deleting duplicated rows from dataset:

In [12]:
Email_dataset = Email_dataset.drop_duplicates()

#After deleting duplicated rows again checking duplicated rows in a dataset:

In [13]:
Email_dataset.duplicated().sum()

0

In [14]:
duplicate_rows = Email_dataset[Email_dataset.duplicated()]

# Print duplicate rows
print("Duplicate Rows except first occurrence:")
print(duplicate_rows)

Duplicate Rows except first occurrence:
Empty DataFrame
Columns: [text, spam]
Index: []


#Checking zeros in the dataset columnwise:

In [15]:
# Sum of counts of zeros for each column
zeros_sum = (Email_dataset == 0).sum()

print(zeros_sum)

#these are valid zeros.no need to replace with any other value.

text       0
spam    4327
dtype: int64


#Selecting Indepenent variables:

In [16]:
X = Email_dataset[['text']]
X

Unnamed: 0,text
0,Subject: naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...
4,"Subject: do not have money , get software cds ..."
...,...
5723,Subject: re : research and development charges...
5724,"Subject: re : receipts from visit jim , than..."
5725,Subject: re : enron case study update wow ! a...
5726,"Subject: re : interest david , please , call..."


#Performing Dummy Encoding for Categorical data columns:

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the 'text' column
X = vectorizer.fit_transform(Email_dataset['text'])

In [18]:
X

<5695x37303 sparse matrix of type '<class 'numpy.int64'>'
	with 704610 stored elements in Compressed Sparse Row format>

#Selecting Target Variable:

In [19]:
y = Email_dataset[['spam']]
y

Unnamed: 0,spam
0,1
1,1
2,1
3,1
4,1
...,...
5723,0
5724,0
5725,0
5726,0


#Split the dataset into X_train, X_test, y_train, y_test:

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [21]:
X_train

<4271x37303 sparse matrix of type '<class 'numpy.int64'>'
	with 526851 stored elements in Compressed Sparse Row format>

In [22]:
X_test

<1424x37303 sparse matrix of type '<class 'numpy.int64'>'
	with 177759 stored elements in Compressed Sparse Row format>

In [23]:
y_train

Unnamed: 0,spam
3639,0
3530,0
2949,0
5159,0
3609,0
...,...
4950,0
3273,0
1653,0
2611,0


In [24]:
y_test

Unnamed: 0,spam
977,1
3275,0
4163,0
751,1
3244,0
...,...
5661,0
4728,0
2522,0
3552,0


#1)LogisticRegression:

In [25]:
# Initialize your model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the training set
train_predictions = model.predict(X_train)

# Make predictions on the testing set
test_predictions = model.predict(X_test)

# Calculate training accuracy
LogisticRegression_train_accuracy = accuracy_score(y_train, train_predictions)
print(f'Training Accuracy: {LogisticRegression_train_accuracy}')

# Calculate testing accuracy
LogisticRegression_test_accuracy = accuracy_score(y_test, test_predictions)
print(f'Testing Accuracy: {LogisticRegression_test_accuracy}')

# Confusion matrix and classification report for training data
print("Confusion Matrix (Training Data):")
print(confusion_matrix(y_train, train_predictions))

print("\nClassification Report (Training Data):")
print(classification_report(y_train, train_predictions))

# Confusion matrix and classification report for testing data
print("\nConfusion Matrix (Testing Data):")
print(confusion_matrix(y_test, test_predictions))

print("\nClassification Report (Testing Data):")
print(classification_report(y_test, test_predictions))

  y = column_or_1d(y, warn=True)


Training Accuracy: 1.0
Testing Accuracy: 0.9915730337078652
Confusion Matrix (Training Data):
[[3243    0]
 [   0 1028]]

Classification Report (Training Data):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3243
           1       1.00      1.00      1.00      1028

    accuracy                           1.00      4271
   macro avg       1.00      1.00      1.00      4271
weighted avg       1.00      1.00      1.00      4271


Confusion Matrix (Testing Data):
[[1082    2]
 [  10  330]]

Classification Report (Testing Data):
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1084
           1       0.99      0.97      0.98       340

    accuracy                           0.99      1424
   macro avg       0.99      0.98      0.99      1424
weighted avg       0.99      0.99      0.99      1424



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#2)Support Vector Machine:

In [26]:
# Initialize your model
model = svm.SVC(kernel='linear')

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the training set
train_predictions = model.predict(X_train)

# Make predictions on the testing set
test_predictions = model.predict(X_test)

# Calculate training accuracy
SVM_train_accuracy = accuracy_score(y_train, train_predictions)
print(f'Training Accuracy: {SVM_train_accuracy}')

# Calculate testing accuracy
SVM_test_accuracy = accuracy_score(y_test, test_predictions)
print(f'Testing Accuracy: {SVM_test_accuracy}')
# Confusion matrix and classification report for training data
print("Confusion Matrix (Training Data):")
print(confusion_matrix(y_train, train_predictions))

print("\nClassification Report (Training Data):")
print(classification_report(y_train, train_predictions))

# Confusion matrix and classification report for testing data
print("\nConfusion Matrix (Testing Data):")
print(confusion_matrix(y_test, test_predictions))

print("\nClassification Report (Testing Data):")
print(classification_report(y_test, test_predictions))

  y = column_or_1d(y, warn=True)


Training Accuracy: 1.0
Testing Accuracy: 0.9817415730337079
Confusion Matrix (Training Data):
[[3243    0]
 [   0 1028]]

Classification Report (Training Data):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3243
           1       1.00      1.00      1.00      1028

    accuracy                           1.00      4271
   macro avg       1.00      1.00      1.00      4271
weighted avg       1.00      1.00      1.00      4271


Confusion Matrix (Testing Data):
[[1072   12]
 [  14  326]]

Classification Report (Testing Data):
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1084
           1       0.96      0.96      0.96       340

    accuracy                           0.98      1424
   macro avg       0.98      0.97      0.97      1424
weighted avg       0.98      0.98      0.98      1424



#3)RandomForestClassifier:

In [27]:
#Fitting Decision Tree classifier to the training set
from sklearn.ensemble import RandomForestClassifier
# Initialize your model (in this example, RandomForestClassifier is used)
model = RandomForestClassifier(n_estimators= 9, criterion="entropy")

# Initialize your model (in this example, RandomForestClassifier is used)
model = RandomForestClassifier()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the training set
train_predictions = model.predict(X_train)

# Make predictions on the testing set
test_predictions = model.predict(X_test)

# Calculate training accuracy
Random_Forest_train_accuracy = accuracy_score(y_train, train_predictions)
print(f'Training Accuracy: {Random_Forest_train_accuracy}')

# Calculate testing accuracy
Random_Forest_test_accuracy = accuracy_score(y_test, test_predictions)
print(f'Testing Accuracy: {Random_Forest_test_accuracy}')

# Confusion matrix and classification report for training data
print("Confusion Matrix (Training Data):")
print(confusion_matrix(y_train, train_predictions))

print("\nClassification Report (Training Data):")
print(classification_report(y_train, train_predictions))

# Confusion matrix and classification report for testing data
print("\nConfusion Matrix (Testing Data):")
print(confusion_matrix(y_test, test_predictions))

print("\nClassification Report (Testing Data):")
print(classification_report(y_test, test_predictions))

  model.fit(X_train, y_train)


Training Accuracy: 1.0
Testing Accuracy: 0.964185393258427
Confusion Matrix (Training Data):
[[3243    0]
 [   0 1028]]

Classification Report (Training Data):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3243
           1       1.00      1.00      1.00      1028

    accuracy                           1.00      4271
   macro avg       1.00      1.00      1.00      4271
weighted avg       1.00      1.00      1.00      4271


Confusion Matrix (Testing Data):
[[1084    0]
 [  51  289]]

Classification Report (Testing Data):
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      1084
           1       1.00      0.85      0.92       340

    accuracy                           0.96      1424
   macro avg       0.98      0.93      0.95      1424
weighted avg       0.97      0.96      0.96      1424



#4)GradientBoostingClassifier:

In [28]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Initialize your model (in this example, GradientBoostingClassifier is used)
model = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the training set
train_predictions = model.predict(X_train)

# Make predictions on the testing set
test_predictions = model.predict(X_test)

# Calculate training accuracy
GradientBoosting_train_accuracy = accuracy_score(y_train, train_predictions)
print(f'Training Accuracy: {GradientBoosting_train_accuracy}')

# Calculate testing accuracy
GradientBoosting_test_accuracy = accuracy_score(y_test, test_predictions)
print(f'Testing Accuracy: {GradientBoosting_test_accuracy}')
# Confusion matrix and classification report for training data
print("Confusion Matrix (Training Data):")
print(confusion_matrix(y_train, train_predictions))

print("\nClassification Report (Training Data):")
print(classification_report(y_train, train_predictions))

# Confusion matrix and classification report for testing data
print("\nConfusion Matrix (Testing Data):")
print(confusion_matrix(y_test, test_predictions))

print("\nClassification Report (Testing Data):")
print(classification_report(y_test, test_predictions))

  y = column_or_1d(y, warn=True)


Training Accuracy: 0.9887614141887145
Testing Accuracy: 0.9754213483146067
Confusion Matrix (Training Data):
[[3222   21]
 [  27 1001]]

Classification Report (Training Data):
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      3243
           1       0.98      0.97      0.98      1028

    accuracy                           0.99      4271
   macro avg       0.99      0.98      0.98      4271
weighted avg       0.99      0.99      0.99      4271


Confusion Matrix (Testing Data):
[[1074   10]
 [  25  315]]

Classification Report (Testing Data):
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1084
           1       0.97      0.93      0.95       340

    accuracy                           0.98      1424
   macro avg       0.97      0.96      0.97      1424
weighted avg       0.98      0.98      0.98      1424



In [29]:
# Print the results
print("Logistic Regression:")
print(f'Training Accuracy: {LogisticRegression_train_accuracy}')
print(f'Testing Accuracy: {LogisticRegression_test_accuracy}\n')

print("SVM:")
print(f'Training Accuracy: {SVM_train_accuracy}')
print(f'Testing Accuracy: {SVM_test_accuracy}\n')

print("Random Forest:")
print(f'Training Accuracy: {Random_Forest_train_accuracy}')
print(f'Testing Accuracy: {Random_Forest_test_accuracy}\n')

print("Gradient Boosting:")
print(f'Training Accuracy: {GradientBoosting_train_accuracy}')
print(f'Testing Accuracy: {GradientBoosting_test_accuracy}\n')

Logistic Regression:
Training Accuracy: 1.0
Testing Accuracy: 0.9915730337078652

SVM:
Training Accuracy: 1.0
Testing Accuracy: 0.9817415730337079

Random Forest:
Training Accuracy: 1.0
Testing Accuracy: 0.964185393258427

Gradient Boosting:
Training Accuracy: 0.9887614141887145
Testing Accuracy: 0.9754213483146067

