#Week Ten - Assignment: Document Classification

## Prompt:

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

# Import Packages

In [56]:
import nltk
import random
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os
import tarfile
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import shuffle

# Loading Dataset

In [57]:
def extract_tar_bz2(file_path, extract_dir):
    with tarfile.open(file_path, "r:bz2") as tar:
        tar.extractall(path=extract_dir)

ham_path = "/content/20021010_easy_ham.tar.bz2"
spam_path = "/content/20021010_spam.tar.bz2"

extract_tar_bz2(ham_path, "/content/ham")
extract_tar_bz2(spam_path, "/content/spam")

I first started by extracting data from the spamassassin public corpus which can be found here: https://spamassassin.apache.org/old/publiccorpus/. I extracted data from one ham ("20021010_easy_ham.tar.bz2") and one spam ("20021010_spam.tar.bz2") folder and created two separate directories -- one for each. I set a seed beforehand for reproducibility.

# Summary Statistics

In [58]:
def count_emails_in_directory(directory):
    total_count = 0
    for root, dirs, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            with open(file_path, "r", encoding="latin1") as f:
                email_count = f.read().count("From:")
                total_count += email_count
    return total_count
ham_dir = "/content/ham"
spam_dir = "/content/spam"
ham_count = count_emails_in_directory(ham_dir)
spam_count = count_emails_in_directory(spam_dir)
print("Number of ham emails:", ham_count)
print("Number of spam emails:", spam_count)

Number of ham emails: 2889
Number of spam emails: 512


Here, I have counted the number of emails in each directory by iterating through all files from each directory, and counting the occurrences of the string "From:" in each file. This gets applied to both ham and spam directories to obtain the counts of ham and spam emails. We can see that there are 2889 ham emails and 512 spam emails.

# Creating a Combined Dataframe

In [59]:
def read_emails_in_directory(directory):
    emails = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            with open(file_path, "r", encoding="latin1") as f:
                email_content = f.read()
                emails.append({"File": file, "Content": email_content})
    df = pd.DataFrame(emails)
    return df
ham_df = read_emails_in_directory(ham_dir)
spam_df = read_emails_in_directory(spam_dir)

ham_df.head()
spam_df.head()

Unnamed: 0,File,Content
0,0300.fa3ece84a195f3d36a70f2550824071f,From news@risingtidestudios.com Fri Sep 13 13...
1,0095.e1db2d3556c2863ef7355faf49160219,From sitescooper-talk-admin@lists.sourceforge....
2,0021.15185fdb3fb02dffd041fa8f70d19791,From ilug-admin@linux.ie Fri Aug 23 11:07:47 ...
3,0267.0bf79a17115bffdf00bb0997f773dfc5,Return-Path: ler@lerami.lerctr.org\nDelivery-D...
4,0323.badf0273f656afd0dfebaa63af1c81f6,From webmake-talk-admin@lists.sourceforge.net ...


We can then read all email files within each directory and create a dataframe. In the dataframe, each row contains the file name and its corresponding email content.

In [60]:
def read_emails_in_directory(directory, label):
    emails = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            with open(file_path, "r", encoding="latin1") as f:
                email_content = f.read()
                if 'ham' in email_content.lower():
                    label = 'ham'
                elif 'spam' in email_content.lower():
                    label = 'spam'
                emails.append({"File": file, "Content": email_content, "Label": label})
    df = pd.DataFrame(emails)
    return df

ham_dir = "/content/ham"
spam_dir = "/content/spam"
ham_df = read_emails_in_directory(ham_dir, 'ham')
spam_df = read_emails_in_directory(spam_dir, 'spam')
print("Ham DataFrame:")
print(ham_df.head())
print("\nSpam DataFrame:")
print(spam_df.head())

Ham DataFrame:
                                    File  \
0  1058.f26006b375cfbfb03f4903683f585808   
1  1408.c202263092b223a607078977ed7aa6c3   
2  2222.21412f6d911e6718ab62011cbc6d9eea   
3  0051.9281d3f8a3faf47d09a7fafdf2caf26e   
4  0152.10d3220188413990b1deb862c509c818   

                                             Content Label  
0  From exmh-users-admin@redhat.com  Mon Sep  9 2...  spam  
1  From spamassassin-talk-admin@lists.sourceforge...  spam  
2  From rssfeeds@jmason.org  Tue Oct  1 10:36:56 ...  spam  
3  From ilug-admin@linux.ie  Fri Aug 23 11:07:52 ...  spam  
4  From sentto-2242572-55941-1034006157-zzzz=exam...  spam  

Spam DataFrame:
                                    File  \
0  0300.fa3ece84a195f3d36a70f2550824071f   
1  0095.e1db2d3556c2863ef7355faf49160219   
2  0021.15185fdb3fb02dffd041fa8f70d19791   
3  0267.0bf79a17115bffdf00bb0997f773dfc5   
4  0323.badf0273f656afd0dfebaa63af1c81f6   

                                             Content Label  
0  From new

We can label each email based on whether the email was already labeled as 'spam' or 'ham' from the corpus.  

# Creating Testing and Training Sets

In [63]:
combined_df_shuffled = shuffle(combined_df, random_state=42)
train_df, test_df = train_test_split(combined_df_shuffled, test_size=0.20, random_state=42)
print("Training set shape:", train_df.shape)
print("Testing set shape:", test_df.shape)

Training set shape: (2441, 3)
Testing set shape: (611, 3)


We can then combine the databases and shuffle them before splitting. We want to split the data such that a good aount of it can be used for machine learning, and such that there is still enough to test. I decided on an 80-20 split, making the test size 20%. This means that there are 2441 randomized emails in the training set, and 611 in the testing set.

# Data Analysis

In [65]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_df['Content'])
y_train = train_df['Label']

X_test = vectorizer.transform(test_df['Content'])
y_test = test_df['Label']

clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Data Analysis Summary:")
print(classification_report(y_test, y_pred))

Accuracy: 0.8363338788870703
Data Analysis Summary:
              precision    recall  f1-score   support

         ham       1.00      0.01      0.02       101
        spam       0.84      1.00      0.91       510

    accuracy                           0.84       611
   macro avg       0.92      0.50      0.47       611
weighted avg       0.86      0.84      0.76       611



The accuracy of the model is 83.63%, indicating that it correctly classified approximately 83.63% of the test data. However, the precision for the 'ham' class is unusually high at 100%, suggesting that all instances classified as 'ham' were correctly predicted. The precision for the 'spam' class is 84%, indicating that 84% of the instances classified as 'spam' were correctly predicted. Additionally, despite the high precision for 'ham', the extremely low recall indicates that the model failed to capture most of the 'ham' instances, leading to an F1-score of only 0.02. The macro and weighted average metrics show an overall imbalance in the model's performance between. This may be due to the unequal distribution of spam and ham email counts.

# Conclusion

In conclusion, the model achieved an overall accuracy of 83.63%, with particularly high precision for the 'ham' class at 100%. However, its performance on 'spam' classification, which was still decent at about 84% precision, experienced low recall for 'ham' instances. This led to an F1-score of only 0.02, indicating poor balance in the model's ability to correctly classify both classes. The macro and weighted average metrics also point to this imbalance, likely stemming from the unequal distribution of spam and ham emails in the dataset.

# Presentation Link