Stephanie Chiang  
DATA 620 Summer 2025  
### Assignment Week 5 Part 2:
# Document Classification

In this project, I will use the [UCI Machine Learning Repository: Spambase Data Set](https://archive.ics.uci.edu/dataset/94/spambase) corpus of labeled spam and non-spam e-mails to predict whether or not a new document is spam.

There are 58 features and 4601 instances in the data. Each column/feature in X is a word or character frequency or stats on the lengths of continuous capital letters. The target variable is binary, spam (1) or not spam (0). 

In [89]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features
y = spambase.data.targets 

print(len(spambase.variables))
print(len(X))

58
4601


### Data Preparation

First, I will combine the columns into a single dataframe in order to shuffle the rows randomly. This is to ensure that the training and test sets are representative of the overall data. After shuffling, I will split the data back into features (X) and target variable (y).

Next, I will set aside 500 instances for testing later, with the rest used for training. Below, I can confirm a fair distribution of the target variable. 

In [91]:
import pandas as pd

# Combine X and y into one DataFrame
combined_df = pd.concat([X, y], axis=1)

# Shuffle rows
shuffled_df = combined_df.sample(frac=1, random_state=101).reset_index(drop=True)

# Split back into X and y
X_shuffled = shuffled_df[X.columns]
y_shuffled = shuffled_df[y.columns[0]]

# Split into training and test sets
train_X, test_X = X_shuffled[:-500].reset_index(drop=True), X_shuffled[-500:].reset_index(drop=True)
train_y, test_y = y_shuffled[:-500].reset_index(drop=True), y_shuffled[-500:].reset_index(drop=True)

print(train_y.value_counts())
print(test_y.value_counts())

Class
0    2463
1    1638
Name: count, dtype: int64
Class
0    325
1    175
Name: count, dtype: int64


Here we can build the classifer in NLTK using Naive Bayes. This function requires the data to be ireformatted to a list of tuples, where each tuple contains a dictionary of features and the numeric label.

In [95]:
import nltk

train_set = [
    (train_X.iloc[i].to_dict(), train_y.iloc[i])
    for i in range(len(train_X))
]

test_set = [
    (test_X.iloc[i].to_dict(), test_y.iloc[i])
    for i in range(len(test_X))
]

classifier = nltk.NaiveBayesClassifier.train(train_set)

We can see that the accuracy is quite high at 0.882. Some of the most inofrmative features include the number of consecutive capital letters and the frequency of words like "free" and "receive". This indicates that these features are strong indicators of whether an email is spam or not.

In [97]:
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {accuracy}")

classifier.show_most_informative_features(10)

Accuracy: 0.882
Most Informative Features
capital_run_length_total = 5.0                 0 : 1      =     49.4 : 1.0
          word_freq_free = 0.32                1 : 0      =     28.9 : 1.0
       word_freq_receive = 0.1                 1 : 0      =     24.3 : 1.0
           word_freq_000 = 0.34                1 : 0      =     24.2 : 1.0
capital_run_length_total = 6.0                 0 : 1      =     20.8 : 1.0
capital_run_length_total = 4.0                 0 : 1      =     19.3 : 1.0
          word_freq_will = 0.7                 1 : 0      =     19.0 : 1.0
         word_freq_order = 0.09                1 : 0      =     18.3 : 1.0
        word_freq_remove = 0.05                1 : 0      =     18.2 : 1.0
        word_freq_remove = 0.32                1 : 0      =     18.2 : 1.0
