Problem Statement:

We've all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content.


In this Project, use Python to build an email spam detector. Then, use machine learning to train the spam detector to recognize and classify emails into spam and non-spam. Lets get started!

1. Collecting data

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv("C:\\Users\\King\\Desktop\\Cipherbyte Internship\\Spam Email Detection - spam.csv")

# Display the first few rows of the dataset
print(data.head(10))


     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   
5  spam  FreeMsg Hey there darling it's been 3 week's n...        NaN   
6   ham  Even my brother is not like to speak with me. ...        NaN   
7   ham  As per your request 'Melle Melle (Oru Minnamin...        NaN   
8  spam  WINNER!! As a valued network customer you have...        NaN   
9  spam  Had your mobile 11 months or more? U R entitle...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  
5        NaN        NaN  
6        NaN  

In [2]:
# Keeping only relevant columns and rename them
data = data[['v1', 'v2']]
data.columns = ['label', 'text']

# Display the first few rows of the dataset
print(data.head(10))

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
5  spam  FreeMsg Hey there darling it's been 3 week's n...
6   ham  Even my brother is not like to speak with me. ...
7   ham  As per your request 'Melle Melle (Oru Minnamin...
8  spam  WINNER!! As a valued network customer you have...
9  spam  Had your mobile 11 months or more? U R entitle...


2. Data Preprocessing

In [3]:
import re

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Convert text to lowercase
    text = text.lower()
    return text

# Apply preprocessing to the 'text' column
data['clean_text'] = data['text'].apply(preprocess_text)

# Display the first few rows of the preprocessed text
print(data['clean_text'].head(10))


0    go until jurong point crazy available only in ...
1                              ok lar joking wif u oni
2    free entry in 2 a wkly comp to win fa cup fina...
3          u dun say so early hor u c already then say
4    nah i dont think he goes to usf he lives aroun...
5    freemsg hey there darling its been 3 weeks now...
6    even my brother is not like to speak with me t...
7    as per your request melle melle oru minnaminun...
8    winner as a valued network customer you have b...
9    had your mobile 11 months or more u r entitled...
Name: clean_text, dtype: object


3. Feature Extraction

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initializing the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the clean text data
tfidf_matrix = tfidf_vectorizer.fit_transform(data['clean_text'])

# Display the shape of the TF-IDF matrix
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)


TF-IDF Matrix Shape: (5572, 9442)


4. Model building

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, data['label'], test_size=0.2, random_state=42)

# Initializing the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Training the classifier
nb_classifier.fit(X_train, y_train)

# Predict on the testing set
y_pred = nb_classifier.predict(X_test)


5. Model evaluation

In [6]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Calculating accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.95695067264574
Classification Report:
              precision    recall  f1-score   support

         ham       0.95      1.00      0.98       965
        spam       1.00      0.68      0.81       150

    accuracy                           0.96      1115
   macro avg       0.98      0.84      0.89      1115
weighted avg       0.96      0.96      0.95      1115

Confusion Matrix:
[[965   0]
 [ 48 102]]


In [7]:
# Example: Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
rf_classifier
y_pred_rf = rf_classifier.predict(X_test)
y_pred_rf

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

6. Hyperparameter Tuning

In [8]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0]}

# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Initialize GridSearchCV
grid_search = GridSearchCV(nb_classifier, param_grid, cv=5, scoring='accuracy')

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Train the classifier with the best hyperparameters
best_nb_classifier = MultinomialNB(alpha=best_params['alpha'])
best_nb_classifier.fit(X_train, y_train)


Best Hyperparameters: {'alpha': 0.1}


In [9]:
# Evaluate the model on the entire dataset
y_pred_all = best_nb_classifier.predict(tfidf_matrix)

# Calculate overall accuracy
accuracy_all = accuracy_score(data['label'], y_pred_all)
print("Overall Accuracy:", accuracy_all)

# Print classification report and confusion matrix for the entire dataset
print("Classification Report:")
print(classification_report(data['label'], y_pred_all))

print("Confusion Matrix:")
print(confusion_matrix(data['label'], y_pred_all))


Overall Accuracy: 0.9944364680545585
Classification Report:
              precision    recall  f1-score   support

         ham       0.99      1.00      1.00      4825
        spam       0.99      0.97      0.98       747

    accuracy                           0.99      5572
   macro avg       0.99      0.98      0.99      5572
weighted avg       0.99      0.99      0.99      5572

Confusion Matrix:
[[4820    5]
 [  26  721]]


In [10]:
# Get the most informative words for each class
feature_names = tfidf_vectorizer.get_feature_names_out()
top_spam_words = sorted(zip(best_nb_classifier.feature_log_prob_[1], feature_names), reverse=True)[:10]
top_ham_words = sorted(zip(best_nb_classifier.feature_log_prob_[0], feature_names), reverse=True)[:10]

print("Top 10 Spam Words:")
for coef, word in top_spam_words:
    print(word)

print("\nTop 10 Ham Words:")
for coef, word in top_ham_words:
    print(word)


Top 10 Spam Words:
to
call
free
your
you
or
now
txt
for
mobile

Top 10 Ham Words:
you
to
the
me
in
my
and
is
ok
it


7. Deployment of a model

In [11]:
new_email = "Congratulations! You have done the spam email detection sucessfully."

# Preprocess the new email
clean_new_email = preprocess_text(new_email)

# Transform the preprocessed email using the TF-IDF vectorizer
tfidf_new_email = tfidf_vectorizer.transform([clean_new_email])

# Predict the label of the new email
predicted_label = best_nb_classifier.predict(tfidf_new_email)

print("Predicted Label for the New Email:", predicted_label[0])


Predicted Label for the New Email: ham
