Classify the email using the binary classification method. Email Spam detection has two
states: a) Normal State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and
Support Vector Machine for classification. Analyze their performance.
Dataset link: The emails.csv dataset on the Kaggle
https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [3]:
df=pd.read_csv("emails.csv")
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [4]:
# Data preprocessing
X = df.drop(columns=['Email No.', 'Prediction'])
y = df['Prediction']

In [6]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
# SVM Classifier
svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)

In [7]:
# Evaluation
knn_accuracy = accuracy_score(y_test, y_pred_knn)
svm_accuracy = accuracy_score(y_test, y_pred_svm)
print("KNN Classification Report: \n", classification_report(y_test,y_pred_knn))
print("SVM Classification Report: \n", classification_report(y_test,y_pred_svm))

KNN Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.87      0.90       739
           1       0.73      0.83      0.78       296

    accuracy                           0.86      1035
   macro avg       0.83      0.85      0.84      1035
weighted avg       0.87      0.86      0.87      1035

SVM Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.98      0.88       739
           1       0.91      0.40      0.56       296

    accuracy                           0.82      1035
   macro avg       0.86      0.69      0.72      1035
weighted avg       0.83      0.82      0.79      1035



In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
# Load the dataset
emails_df = pd.read_csv('emails.csv')
# Features (X) are all columns except the last one
X = emails_df.iloc[:, :-1] # All columns except the last one
y = emails_df['Prediction'] # The last column is the label
# Identify non-numeric columns
non_numeric_columns = X.select_dtypes(include=['object']).columns
# Drop non-numeric columns
X = X.drop(columns=non_numeric_columns)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)
# K-Nearest Neighbors Classifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
# Support Vector Machine Classifier
svm = SVC()
svm.fit(X_train, y_train)
# Model Evaluation
y_pred_knn = knn.predict(X_test)
y_pred_svm = svm.predict(X_test)



In [12]:
# Evaluate KNN
print("KNN Accuracy: ", accuracy_score(y_test, y_pred_knn))
print("KNN Classification Report: \n", classification_report(y_test,y_pred_knn))
# Evaluate SVM
print("SVM Accuracy: ", accuracy_score(y_test, y_pred_svm))
print("SVM Classification Report: \n", classification_report(y_test,y_pred_svm))
# Confusion matrices
print("KNN Confusion Matrix: \n", confusion_matrix(y_test, y_pred_knn))
print("SVM Confusion Matrix: \n", confusion_matrix(y_test, y_pred_svm))

KNN Accuracy:  0.8628019323671497
KNN Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.87      0.90       739
           1       0.73      0.83      0.78       296

    accuracy                           0.86      1035
   macro avg       0.83      0.85      0.84      1035
weighted avg       0.87      0.86      0.87      1035

SVM Accuracy:  0.8173913043478261
SVM Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.98      0.88       739
           1       0.91      0.40      0.56       296

    accuracy                           0.82      1035
   macro avg       0.86      0.69      0.72      1035
weighted avg       0.83      0.82      0.79      1035

KNN Confusion Matrix: 
 [[646  93]
 [ 49 247]]
SVM Confusion Matrix: 
 [[727  12]
 [177 119]]
