## Spam Email Classification: Building an Efficient Filter to Identify Unwanted Emails

Spam email classification is a crucial task in the realm of email communication. This project aims to develop a robust and accurate spam email classifier using machine learning techniques. By leveraging the "SpamAssassin Public Corpus" (https://spamassassin.apache.org/old/publiccorpus/) dataset, comprising a diverse collection of labeled spam and ham emails, this project will explore various text classification algorithms and feature extraction methods. The goal is to build a powerful spam email filter capable of accurately distinguishing between legitimate emails and unwanted spam messages. Through the implementation, training, and evaluation of different models, this project seeks to provide insights into effective techniques for email filtering, ultimately improving user experience and email security.

### Data Sourcing

In [7]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
import nltk

# Download the WordNet resource
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...


True

In [4]:
import nltk

# Download the Open Multilingual WordNet resource
nltk.download('omw-1.4')


[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...


True

### Data Preprocessing

In [5]:
import os
import email
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Define paths to the downloaded dataset folders
ham_folder = 'C:\\Users\\chris\\DSC680-T301\\emailham'
spam_folder = 'C:\\Users\\chris\\DSC680-T301\\emailspam'


# Initialize lists to store preprocessed email data
emails = []
labels = []

# Preprocessing function
def preprocess_text(text):
    # Remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text(separator=' ')

    # Convert to lowercase and remove punctuation
    text = re.sub(r'[^\w\s]', ' ', text.lower())

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Return preprocessed text as a single string
    return ' '.join(tokens)

# Iterate through ham folder
for filename in os.listdir(ham_folder):
    with open(os.path.join(ham_folder, filename), 'r', encoding='latin1') as file:
        # Parse email content
        content = file.read()
        msg = email.message_from_string(content)

        # Extract subject and body
        subject = msg.get('Subject', '')
        body = ''
        if msg.is_multipart():
            for part in msg.walk():
                content_type = part.get_content_type()
                if content_type == 'text/plain':
                    body = part.get_payload()
        else:
            body = msg.get_payload()

        # Preprocess email text
        preprocessed_text = preprocess_text(subject + ' ' + body)

        # Append preprocessed email and label to lists
        emails.append(preprocessed_text)
        labels.append(0)  # 0 for ham

# Iterate through spam folder
for filename in os.listdir(spam_folder):
    with open(os.path.join(spam_folder, filename), 'r', encoding='latin1') as file:
        # Parse email content
        content = file.read()
        msg = email.message_from_string(content)

        # Extract subject and body
        subject = msg.get('Subject', '')
        body = ''
        if msg.is_multipart():
            for part in msg.walk():
                content_type = part.get_content_type()
                if content_type == 'text/plain':
                    body = part.get_payload()
        else:
            body = msg.get_payload()

        # Preprocess email text
        preprocessed_text = preprocess_text(subject + ' ' + body)

        # Append preprocessed email and label to lists
        emails.append(preprocessed_text)
        labels.append(1)  # 1 for spam

# Print a few preprocessed emails and their corresponding labels
for i in range(5):
    print("Email:", emails[i])
    print("Label:", "Spam" if labels[i] == 1 else "Ham")
    print()





Email: new sequence window date wed 21 aug 2002 10 54 46 0500 chris garrigues message id 1029945287 4797 tmda deepeddy vircio com reproduce error repeatable like every time without fail debug log pick happening 18 19 03 pick_it exec pick inbox list lbrace lbrace subject ftp rbrace rbrace 4852 4852 sequence mercury 18 19 03 exec pick inbox list lbrace lbrace subject ftp rbrace rbrace 4852 4852 sequence mercury 18 19 04 ftoc_pickmsgs 1 hit 18 19 04 marking 1 hit 18 19 04 tkerror syntax error expression int note run pick command hand delta pick inbox list lbrace lbrace subject ftp rbrace rbrace 4852 4852 sequence mercury 1 hit 1 hit come obviously version nmh using delta pick version pick nmh 1 0 4 compiled fuchsia c mu oz au sun mar 17 14 55 56 ict 2002 relevant part mh_profile delta mhparam pick seq sel list since pick command work sequence actually one explicit command line search popup one come mh_profile get created kre p still using version code form day ago able reach cv repository

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Convert the preprocessed emails to numerical features
X = vectorizer.fit_transform(emails)

# Print the shape of the feature matrix
print("Shape of feature matrix:", X.shape)


Shape of feature matrix: (3052, 37356)


There are 3052 preprocessed emails, and each email is represented by a vector of length 37356, where each element of the vector corresponds to a unique feature.

### Train a machine learning model using the preprocessed emails and corresponding labels

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(emails, labels, test_size=0.2, random_state=42)

# Create a TF-IDF vectorizer and fit it on the training data
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the test data using the fitted vectorizer
X_test_tfidf = vectorizer.transform(X_test)

# Initialize and train a Support Vector Machine (SVM) classifier
svm_classifier = SVC()
svm_classifier.fit(X_train_tfidf, y_train)

# Predict the labels for the test data
y_pred = svm_classifier.predict(X_test_tfidf)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.972176759410802


SVM classifier achieved a high accuracy of 97.22% on the test data. This indicates that the model is performing well in distinguishing between spam and ham emails.

In [8]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Calculate precision
precision = precision_score(y_test, y_pred)

# Calculate recall
recall = recall_score(y_test, y_pred)

# Calculate F1 score
f1 = f1_score(y_test, y_pred)

print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Precision: 1.0
Recall: 0.8152173913043478
F1 Score: 0.8982035928143712


Based on the evaluation metrics , the email classification model achieved the following results:

A precision of 1.0 indicates that all the emails predicted as spam were indeed spam. The recall value of 0.8152173913043478 suggests that the model correctly identified approximately 81.52% of the actual spam emails. The F1 score of 0.8982035928143712 represents a balanced measure of precision and recall.

Overall, these metrics indicate that the model is performing well in terms of precision, but there is some room for improvement in terms of recall. It's important to find the right balance between precision and recall based on your specific requirements and the costs associated with false positives and false negatives in email classification.

If you have a large number of false negatives (spam emails classified as ham), you may want to focus on improving recall to catch more spam emails. On the other hand, if you have a high number of false positives (ham emails classified as spam), you may want to focus on improving precision to reduce the number of legitimate emails mistakenly classified as spam.

In [11]:
from sklearn.metrics import confusion_matrix

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print("Confusion Matrix:")
print(cm)


Confusion Matrix:
[[519   0]
 [ 17  75]]


1. True Positives (TP): 519 - The model correctly classified 519 emails as spam.
2. True Negatives (TN): 75 - The model correctly classified 75 emails as ham.
3. False Positives (FP): 0 - The model incorrectly classified 0 ham emails as spam.
4. False Negatives (FN): 17 - The model incorrectly classified 17 spam emails as ham.

## Test the model 

In [14]:
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
    # Remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text(separator=' ')

    # Convert to lowercase and remove punctuation
    text = re.sub(r'[^\w\s]', ' ', text.lower())

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Return preprocessed text as a single string
    return ' '.join(tokens)

# Preprocess the email text
email_content = """
Hi, 

Here are a few reasons why 2,300+ developers have signed up for ProjectPro :

1) This is the only product in the world that provides pre-built, verified, end-to-end project recipes in Machine Learning and Big Data.

2) Impress your boss by having on-demand access to pre-built, reusable project solutions using the latest frameworks like Tensorflow, PySpark, BERT etc. 

3) Get assigned to hot projects in Machine Learning and Big Data in your company and have the confidence to work on these projects with the help of our reusable solutions. 

4) Impress your job interviewers with implementation knowledge on a variety of real world live projects. 
"""

# Preprocess the email text
preprocessed_email = preprocess_text(email_content)

# Transform the preprocessed email text using the fitted vectorizer
email_tfidf = vectorizer.transform([preprocessed_email])

# Predict the label for the email
email_prediction = svm_classifier.predict(email_tfidf)

# Print the predicted label
if email_prediction == 1:
    print("The email is predicted as spam.")
else:
    print("The email is predicted as ham.")


The email is predicted as ham.
