<h1>Spam Email Classifier</h1>
<p>
    In this project, I build a spam email classifier that can tell whether a given email is a spam email or not based on the email’s content
There four main parts:
Clean the data
build feature vectors
Use Naive Bayes Classifier, SVM(linear kernel) and Logistic Regression to training the data and make predictions
Compare these models and discuss
</p>

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [4]:
#Load the dataset
data = pd.read_csv('email_spam.csv', encoding='latin-1')
data = data[['v1', 'v2']]
data.columns = ['label', 'message']

In [5]:
#Map labels to binary values
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

In [6]:
#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['message'], data['label'], test_size=0.2, random_state=42)

In [7]:
#Feature extraction
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [8]:
#Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
nb_pred = nb_classifier.predict(X_test)

In [9]:
#SVM Classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
svm_pred = svm_classifier.predict(X_test)

In [10]:
#Logistic Regression Classifier
lr_classifier = LogisticRegression()
lr_classifier.fit(X_train, y_train)
lr_pred = lr_classifier.predict(X_test)

In [11]:
#Model evaluation
print('Naive Bayes Classifier:')
print('Accuracy:', accuracy_score(y_test, nb_pred))
print('Classification Report:')
print(classification_report(y_test, nb_pred))

print('\nSVM Classifier:')
print('Accuracy: ', accuracy_score(y_test, svm_pred))
print('Classification Report:')
print(classification_report(y_test, svm_pred))

print('\nLogistic Regression Classifier:')
print('Accuracy: ', accuracy_score(y_test, lr_pred))
print('Classification Report:')
print(classification_report(y_test, lr_pred))

Naive Bayes Classifier:
Accuracy: 0.9838565022421525
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115


SVM Classifier:
Accuracy:  0.979372197309417
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.97      0.87      0.92       150

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115


Logistic Regression Classifier:
Accuracy:  0.9775784753363229
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
      