# INTRODUCTION

This project aims to classify emails as spam or not spam. The dataset the model is trained on is retrieved from the kaggle website.Throughout the project , I will be explaining my workflow through comments or text boxes . This is my second project in this field . Hopefully , it succeeds !

# ABOUT THE DATASET

The dataset can be easily accessed from the following link :
https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv/data

This data consists of 3002 columns and 5172 rows , each row representing an email.

The columns consist of the Email number , the various words used in the emails and finally a prediction column that decides whether an email is spam or not.



# STEPS OF EXECUTION OF THE PROJECT


1.   Import the modules
1.   Load the data
2.   Understand the data
3.   Prepare the data
4.   Split the data
5.   Scaling , training and evaluating models
6.   Selecting the model
8.   Deploy the model








# IMPORTING MODULES

In [16]:
#Importing all the necessary modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score , recall_score
from sklearn.metrics import f1_score , roc_auc_score
from sklearn.metrics import precision_recall_curve , roc_curve
import joblib

# LOADING , UNDERSTANDING , SPLITTING AND PREPARING THE DATA



In [7]:
#Loading the data

emails = pd.read_csv("Emails_Dataset.csv")

In [8]:
#Understanding the data

print ("Number of null values in each column :-")
print(emails.isnull().sum())


Number of null values in each column :-
Email No.     0
the           0
to            0
ect           0
and           0
             ..
military      0
allowing      0
ff            0
dry           0
Prediction    0
Length: 3002, dtype: int64


In [9]:
#Preprocessing and Splitting the data

#Cleaning and creating features and target values
emails_train = emails.drop(["Email No.","Prediction"],axis=1)
emails_prediction = emails["Prediction"]

strat_emails_train,strat_emails_test,strat_emails_prediction_train,strat_emails_prediction_test = train_test_split(emails_train,emails_prediction,test_size=0.2,random_state=42,stratify=emails_prediction)

# SCALING (ACCORDING TO MODEL) AND TRAINING ON DATA

In [10]:
#Scaling and training the logistic regression model

log_reg_pipeline = Pipeline([('std_scaler',StandardScaler()),
                            ('log_regression',LogisticRegression(C=0.35938137,max_iter = 10000))])
scores = cross_val_score(log_reg_pipeline,strat_emails_train,strat_emails_prediction_train,cv=5,scoring='accuracy',n_jobs=-1)
print(scores)
print ("Average Score : ",scores.mean())

[0.97342995 0.96980676 0.9637243  0.96977025 0.96251511]
Average Score :  0.9678492776989176


In [11]:
#Scaling and training the k neighbors classifier model

kn_classifier_pipeline = Pipeline([('std_scaler',StandardScaler()),
                            ('kn_classifier',KNeighborsClassifier(n_neighbors=2,weights='uniform'))])
scores = cross_val_score(kn_classifier_pipeline,strat_emails_train,strat_emails_prediction_train,cv=5,scoring='accuracy',n_jobs=-1)
print(scores)
print ("Average Score : ",scores.mean())

[0.87439614 0.8647343  0.88270859 0.86457074 0.87908102]
Average Score :  0.8730981546711529


In [12]:
#Scaling and training the random forest classifier model

random_forest_pipeline = Pipeline([
                            ('rf_classifier',RandomForestClassifier(random_state=42))])
scores = cross_val_score(random_forest_pipeline,strat_emails_train,strat_emails_prediction_train,cv=5,scoring='accuracy',n_jobs=-1)
print(scores)
print ("Average Score : ",scores.mean())

[0.96859903 0.97826087 0.96977025 0.96614268 0.95646917]
Average Score :  0.9678484014743939


In [13]:
#Scaling and training the stochastic gradient descent classifier model

sgd_classifier_pipeline = Pipeline([('std_scaler',StandardScaler()),
                            ('sgd_classifier',SGDClassifier(random_state=42))])
scores = cross_val_score(sgd_classifier_pipeline,strat_emails_train,strat_emails_prediction_train,cv=5,scoring='accuracy',n_jobs=-1)
print(scores)
print ("Average Score : ",scores.mean())

[0.96376812 0.93236715 0.91051995 0.96493349 0.94679565]
Average Score :  0.9436768717616204


# FINAL MODEL SELECTED : LOGISTIC REGRESSION

In [14]:
#Testing the Logistic Regression Model
log_reg_pipeline = Pipeline([('std_scaler',StandardScaler()),
                            ('log_regression',LogisticRegression(C=0.35938137,max_iter = 10000))])
log_reg_pipeline.fit(strat_emails_train,strat_emails_prediction_train)
predictions=log_reg_pipeline.predict(strat_emails_test)
accuracy = (predictions==strat_emails_prediction_test).mean()
print("Accuracy of the model :-",accuracy)


Accuracy of the model :- 0.9729468599033816


# EVALUATING PRECISION , RECALL ,F1 AND AUC SCORES OF THE MODEL


In [15]:
#Finding out the precision

precision = precision_score(strat_emails_prediction_test,predictions)
print("Precision:",precision)

#Finding out the recall of the model
recall = recall_score(strat_emails_prediction_test,predictions)
print("Recall:",recall)

#Finding out the area under the receiver operating characteristic curve (recall vs fpr)
auc = roc_auc_score(strat_emails_prediction_test,predictions)
print("Area under curve:",auc)

f1_score = f1_score(strat_emails_prediction_test,predictions)
print("F1 Score :",f1_score)

Precision: 0.9331210191082803
Recall: 0.9766666666666667
Area under curve: 0.9740476190476189
F1 Score : 0.9543973941368078


# DEPLOY THE TRAINED MODEL

In [17]:
#Deploying the model
joblib.dump(log_reg_pipeline , "Srihari_Spam_Classifier_1.pkl")


['Srihari_Spam_Classifier_1.pkl']

# CONCLUSION

This concludes the first round of classifying emails as spam or not spam. While this is just a starter , accuracies have been a complete satisfaction . Vectorization hasn't been explicitly shown here since data is already vectorized . TF-IDF weights could be easily added with python functions , but evaluations suggested poorer performance due to high regularization (overgeneralization). For example , the accuracy of the logistic regression model dropped from a whopping 0.97 to mere 0.68 . Automatic tuning of hyperparameters has also been done , but not depicted here for simplicity and cleanliness . Training has only been done on a few models . However , in the upcoming versions of this project , incorporations of neural networks and other complex ML models are definite . So , stay tuned !