Veer Khosla

CS 251: Data Analysis and Visualization

Fall 2023

Project 6: Supervised learning

In [80]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Task 3: Preprocess full spam email dataset 

Before you build a Naive Bayes spam email classifier, run the full spam email dataset through your preprocessing code.

Download and extract the full **Enron** emails (*zip file should be ~29MB large*). You should see a base `enron` folder, with `spam` and `ham` subfolders when you extract the zip file (these are the 2 classes).

Run the test code below to check everything over.

### 3a. Preprocess dataset

In [81]:
import email_preprocessor as epp

#### Test `count_words` and `find_top_words`

In [82]:
word_freq, num_emails = epp.count_words()

In [83]:
print(f'You found {num_emails} emails in the datset. You should have found 32625.')

You found 32625 emails in the datset. You should have found 32625.


In [84]:
top_words, top_counts = epp.find_top_words(word_freq)
print(f"Your top 10 words are\n{top_words[:10]}\nand they should be\n['the', 'to', 'and', 'of', 'a', 'in', 'for', 'you', 'is', 'enron']")
print(f"The associated counts are\n{top_counts[:10]}\nand they should be\n[277459, 203659, 148873, 139578, 111796, 100961, 80765, 77592, 68097, 60852]")

Your top 10 words are
['the', 'to', 'and', 'of', 'a', 'in', 'for', 'you', 'is', 'enron']
and they should be
['the', 'to', 'and', 'of', 'a', 'in', 'for', 'you', 'is', 'enron']
The associated counts are
[277459, 203659, 148873, 139578, 111796, 100961, 80765, 77592, 68097, 60852]
and they should be
[277459, 203659, 148873, 139578, 111796, 100961, 80765, 77592, 68097, 60852]


### 3b. Make feature and class vectors

In [85]:
features, y = epp.make_feature_vectors(top_words, num_emails)

#### Verify class label coding

There are
- 16544 `ham` emails
- 16081 `spam` emails.

In the cell below, print out the number of emails that have class label `0` and the number that have class label `1`.
- The count for class label `0` should be 16544
- The count for class label `1` should be 16081.

If the counts across the labels are reversed, recode class label `0` as `1` and class label `1` as `0` in the label vector `y`. If you do this, print out the counts for each label again and verify you get the above counts.

In [86]:
count_class_0 = np.sum(y == 0)
count_class_1 = np.sum(y == 1)

print("Count for class label 0:", count_class_0)
print("Count for class label 1:", count_class_1)

if count_class_0 > count_class_1:
    y = 1 - y

    count_class_0 = np.sum(y == 0)
    count_class_1 = np.sum(y == 1)

    print("\nAfter recoding:")
    print("Count for class label 0:", count_class_0)
    print("Count for class label 1:", count_class_1)


Count for class label 0: 16544
Count for class label 1: 16081

After recoding:
Count for class label 0: 16081
Count for class label 1: 16544


### 3b. Make train and test splits of the dataset

Here we divide the email features into a 80/20 train/test split (80% of data used to train the supervised learning model, 20% we withhold and use for testing / prediction).

In [87]:
np.random.seed(0)
x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(features, y)

print('Shapes for train/test splits:')
print(f'Train {x_train.shape}, classes {y_train.shape}')
print(f'Test {x_test.shape}, classes {y_test.shape}')
print('\nThey should be:\nTrain (26100, 200), classes (26100,)\nTest (6525, 200), classes (6525,)')

Shapes for train/test splits:
Train (26100, 200), classes (26100,)
Test (6525, 200), classes (6525,)

They should be:
Train (26100, 200), classes (26100,)
Test (6525, 200), classes (6525,)


### 3c. Save data in binary format

It adds a lot of overhead to have to run through your raw email -> train/test feature split every time you wanted to work on your project! In this step, you will export the data in memory to disk in a binary format. That way, you can quickly load all the data back into memory (directly in ndarray format) whenever you want to work with it again. No need to parse from text files!

Running the following cell uses numpy's `save` function to make six files in `.npy` format (e.g. `email_train_x.npy`, `email_train_y.npy`, `email_train_inds.npy`, `email_test_x.npy`, `email_test_y.npy`, `email_test_inds.npy`).

In [88]:
np.save('data/email_train_x.npy', x_train)
np.save('data/email_train_y.npy', y_train)
np.save('data/email_train_inds.npy', inds_train)
np.save('data/email_test_x.npy', x_test)
np.save('data/email_test_y.npy', y_test)
np.save('data/email_test_inds.npy', inds_test)

## Task 4: Naive Bayes Classifier

After finishing your email preprocessing pipeline, implement the one other supervised learning algorithm we we will use to classify email, **Naive Bayes**.

### 4a. Implement Naive Bayes

In `naive_bayes.py`, implement the following methods:
- Constructor
- get methods
- `train(data, y)`: Train the Naive Bayes classifier so that it records the "statistics" of the training set: class priors (i.e. how likely an email is in the training set to be spam or ham?) and the class likelihoods (the probability of a word appearing in each class — spam or ham).
- `predict(data)`: Combine the class likelihoods and priors to compute the posterior distribution. The predicted class for a test sample is the class that yields the highest posterior probability.
- `accuracy(y, y_pred)`: The usual definition :)


#### Bayes rule ingredients: Priors and likelihood (`train`)

To compute class predictions (probability that a test example belong to either spam or ham classes), we need to evaluate **Bayes Rule**. This means computing the priors and likelihoods based on the training data.

**Prior:** $$P_c = \frac{N_c}{N}$$ where $P_c$ is the prior for class $c$ (spam or ham), $N_c$ is the number of training samples that belong to class $c$ and $N$ is the total number of training samples.

**Likelihood:** $$L_{c,w} = \frac{T_{c,w} + 1}{T_{c} + M}$$ where
- $L_{c,w}$ is the likelihood that word $w$ belongs to class $c$ (*i.e. what we are solving for*)
- $T_{c,w}$ is the total count of **word $w$** in emails that are only in class $c$ (*either spam or ham*)
- $T_{c}$ is the total count of **all words** that appear in emails of the class $c$ (*total number of words in all spam emails or total number of words in all ham emails*)
- $M$ is the number of features (*number of top words*).

#### Bayes rule ingredients: Posterior (`predict`)

To make predictions, we now combine the prior and likelihood to get the posterior:

**Log Posterior:** $$Log(\text{Post}_{i, c}) = Log(P_c) + \sum_{j \in J_i}x_{i,j}Log(L_{c,j})$$

 where
- $\text{Post}_{i,c}$ is the posterior for class $c$ for test sample $i$(*i.e. evidence that email $i$ is spam or ham*). We solve for its logarithm.
- $Log(P_c)$ is the logarithm of the prior for class $c$.
- $x_{i,j}$ is the number of times the jth word appears in the ith email.
- $Log(L_{c,j})$: is the log-likelihood of the jth word in class $c$.

In [89]:
from naive_bayes import NaiveBayes

#### Test `train`

###### Class priors and likelihoods

The following test should be used only if storing the class priors and likelihoods directly.

In [90]:
num_test_classes = 4
np.random.seed(0)
data_test = np.random.randint(low=0, high=20, size=(100, 6))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_test, y_test)

print(f'Your class priors are: {nbc.get_priors()}\nand should be          [0.28 0.22 0.32 0.18].')
print(f'Your class likelihoods shape is {nbc.get_likelihoods().shape} and should be (4, 6).')
print(f'Your likelihoods are:\n{nbc.get_likelihoods()}')

print(f'and should be')
print('''[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]''')

Your class priors are: [0.28 0.22 0.32 0.18]
and should be          [0.28 0.22 0.32 0.18].
Your class likelihoods shape is (4, 6) and should be (4, 6).
Your likelihoods are:
[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]
and should be
[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]


###### Log of class priors and likelihoods

This test should be used only if storing the log of the class priors and likelihoods.

In [91]:
num_test_classes = 4
np.random.seed(0)
data_test = np.random.randint(low=0, high=20, size=(100, 6))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_test, y_test)

print(f'Your log class priors are: {nbc.get_priors()}\nand should be              [-1.27297 -1.51413 -1.13943 -1.7148 ].')
print(f'Your log class likelihoods shape is {nbc.get_likelihoods().shape} and should be (4, 6).')
print(f'Your log likelihoods are:\n{nbc.get_likelihoods()}')


print(f'and should be')
print('''[[-1.83274 -1.89109 -1.57069 -1.65516 -1.95306 -1.90841]
 [-2.13211 -1.78255 -1.71958 -1.77756 -1.71023 -1.6918 ]
 [-1.77881 -1.75342 -1.93136 -1.94266 -1.67217 -1.70448]
 [-1.82475 -1.77132 -1.84321 -1.96879 -1.66192 -1.70968]]''')

Your log class priors are: [0.28 0.22 0.32 0.18]
and should be              [-1.27297 -1.51413 -1.13943 -1.7148 ].
Your log class likelihoods shape is (4, 6) and should be (4, 6).
Your log likelihoods are:
[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]
and should be
[[-1.83274 -1.89109 -1.57069 -1.65516 -1.95306 -1.90841]
 [-2.13211 -1.78255 -1.71958 -1.77756 -1.71023 -1.6918 ]
 [-1.77881 -1.75342 -1.93136 -1.94266 -1.67217 -1.70448]
 [-1.82475 -1.77132 -1.84321 -1.96879 -1.66192 -1.70968]]


#### Test `predict`

In [92]:
num_test_classes = 4
np.random.seed(0)
data_train = np.random.randint(low=0, high=15, size=(100, 10))
data_test = np.random.randint(low=0, high=15, size=(15, 10))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_train, y_test)
test_y_pred = nbc.predict(data_test)

print(f'Your predicted classes are\n{test_y_pred}\nand should be\n[2 0 0 3 0 3 2 1 2 3 1 0 0 1 0]]')

Your predicted classes are
[2 0 0 3 0 3 2 1 2 3 1 0 0 1 0]
and should be
[2 0 0 3 0 3 2 1 2 3 1 0 0 1 0]]


### 4b. Spam filtering

Use your Naive Bayes classifier to predict whether emails in the Enron email dataset are spam! Start by running the following code that uses `np.load` to load in the train/test split that you created last week.


In [93]:
import email_preprocessor as ep

In [98]:
x_train = np.load('data/email_train_x.npy')
y_train = np.load('data/email_train_y.npy')
inds_train = np.load('data/email_train_inds.npy')
x_test = np.load('data/email_test_x.npy')
y_test = np.load('data/email_test_y.npy')
inds_test = np.load('data/email_test_inds.npy')

In [102]:
num_classes = 2
enron_nbc = NaiveBayes(num_classes=num_classes)

enron_nbc.train(x_train, y_train)

test_y_pred = enron_nbc.predict(x_test)

accuracy = enron_nbc.accuracy(y_test, test_y_pred)

print(f"Accuracy: {accuracy*100:.2f}%")


Accuracy: 96.18%


### 4c. Questions

**Question 7:** What accuracy do you get on the test set with Naive Bayes. It should be roughly 89%.

**Answer 7:** 96.18 ?????

### 4d. Confusion matrix

To get a better sense of the errors that the Naive Bayes classifier makes, create a confusion matrix. 

- Implement `confusion_matrix` in `naive_bayes.py`.
- Print out a confusion matrix of the spam classification results. Assign the confusion matrix below to the variable `conf_matrix_nb`.
- Run below to help test your confusion matrix

In [113]:
nb = NaiveBayes(num_classes=2)

nb.train(data=x_train, y=y_train)

y_test_pred = nb.predict(data=x_test)

conf_matrix_nb = nb.confusion_matrix(y_test, y_test_pred)

print(conf_matrix_nb)


[[2951.  242.]
 [   7. 3325.]]


#### Test confusion matrix

In [114]:
print(f'The total number of entries in your confusion matrix is {int(conf_matrix_nb.sum())} and should be {len(y_test)}.')
print(f'The total number of ham entries in your confusion matrix is {int(conf_matrix_nb[0].sum())} and should be {int(np.sum(y_test == 0))}.')
print(f'The total number of spam entries in your confusion matrix is {int(conf_matrix_nb[1].sum())} and should be {int(np.sum(y_test == 1))}.')

The total number of entries in your confusion matrix is 6525 and should be 6525.
The total number of ham entries in your confusion matrix is 3193 and should be 3193.
The total number of spam entries in your confusion matrix is 3332 and should be 3332.


### 4e. Questions

**Question 8:** Interpret the confusion matrix, using the convention that positive detection means spam (*e.g. a false positive means classifying a ham email as spam*). What types of errors are made more frequently by the classifier? What does this mean (*i.e. X (spam/ham) is more likely to be classified than Y (spam/ham) than the other way around*)?

**Answer 8:** The classifier more often makes errors of false positives, mistaking ham for spam, than false negatives, mistaking spam for ham.

## Task 5: Comparison with KNN

In [116]:
from knn import KNN

### 5a. KNN spam email classification accuracy
Run a similar analysis to what you did with Naive Bayes above. When computing accuracy on the test set, you may want to reduce the size of the test set (e.g. to the first 500 emails in the test set).

In [120]:
x_train = np.load('data/email_train_x.npy')
y_train = np.load('data/email_train_y.npy')
inds_train = np.load('data/email_train_inds.npy')
x_test = np.load('data/email_test_x.npy')[:500]
y_test = np.load('data/email_test_y.npy')[:500]
inds_test = np.load('data/email_test_inds.npy')
num_classes = 2

enron_knn = KNN(num_classes)
enron_knn.train(x_train, y_train)

test_y_pred = enron_knn.predict(x_test, k=3)

accuracy = enron_knn.accuracy(y_test, test_y_pred)
print(f"Accuracy: {accuracy*100:.2f}%")

confusion_matrix = enron_knn.confusion_matrix(y_test, test_y_pred)
print(confusion_matrix)

Accuracy: 99.40%
[[267.   0.]
 [  3. 230.]]


### 5b. KNN spam email confusion matrix
Copy-paste your `confusion_matrix` method into `knn.py` so that you can run the same analysis on a KNN classifier.

### 5c. Questions

**Question 9:** What accuracy did you get on the test set (potentially reduced in size)?

**Question 10:** How does the confusion matrix compare to that obtained by Naive Bayes (*If you reduced the test set size, keep that in mind*)?

**Question 11:** Briefly describe at least one pro/con of KNN compared to Naive Bayes on this dataset.

**Question 12:** When potentially reducing the size of the test set here, why is it important that we shuffled our train and test set?

**Answer 9:** I got 99% accuracy

**Answer 10:** the confusion matrix shows 3 false positives and 0 false negatives, which is very low, showing that the accuracy is correct in predicting 

**Answer 11:** KNN works well with datasets that have decision boundaries like this one. However, it can also take long computing time on larger data sets.

**Answer 12:** It is important to shuffle the train and test sets because it ensures representativeness and reduces biases

### 6. Sources and AI

Sources:
1. Ruby Nunez (Friend). Helped me on some KNN problems I had and some naive bayes tests.
2. Also: https://www.geeksforgeeks.org/ml-naive-bayes-scratch-implementation-using-python/ - link to naive bayes
3. class notes and pictures from website

AI Used: None

## Extensions

### 0. Classify your own datasets

- Find datasets that you find interesting and run classification on them using your KNN algorithm (and if applicable, Naive Bayes). Analysis the performance of your classifer.

### 1. Better text preprocessing

- If you look at the top words extracted from the email dataset, many of them are common "stop words" (e.g. a, the, to, etc.) that do not carry much meaning when it comes to differentiating between spam vs. non-spam email. Improve your preprocessing pipeline by building your top words without stop words. Analyze performance differences.

### 2. Feature size

- Explore how the number of selected features for the email dataset influences accuracy and runtime performance.

### 3. Distance metrics
- Compare KNN performance with the $L^2$ and $L^1$ distance metrics

### 4. K-Fold Cross-Validation

- Research this technique and apply it to data and your KNN and/or Naive Bayes classifiers.

### 5. Email error analysis

- Dive deeper into the properties of the emails that were misclassified (FP and/or FN) by Naive Bayes or KNN. What is their word composition? How many words were skipped because they were not in the training set? What could plausibly account for the misclassifications?

### 6. Investigate the misclassification errors

Numbers are nice, but they may not the best for developing your intuition. Sometimes, you want to see what an misclassification *actually looks like* to help you improve your algorithm. Retrieve the actual text of some example emails of false positive and false negative misclassifications to see if helps you understand why the misclassification occurred. Here is an example workflow:

- Decide on how many FP and FN emails you would like to retrieve. Find the indices of this many false positive and false negative misclassification. Remember to use your `test_inds` array to look up the index of the emails BEFORE shuffling happened.
- Implement the function `retrieve_emails` in `email_preprocessor.py` to return the string of the raw email at the error indices.
- Call your function to print out the emails that produced misclassifications.

Do the FP and FN emails make sense? Why? Do the emails have properties in common? Can you quantify and interpret them?

### 7. KNN for regression

KNN can also be used to perform regression between one or more independent variables and a dependent variable. The potential advantage of this approach is that the regression performed by KNN does not assume any specific form of the regression curve (e.g. line, polynomial, etc.) — the regression is entirely training data-dependent.

KNN for regression is largely the same as for classification except for the following change during prediction:
- For each test sample (validation data), the predicted "y value" is the average "y value" of the K nearest training samples.
- You can use MSE to evaluate how well the regression fits the test samples that you plug in.