# CA-02 Naive Bayes
### BSAN 6070<br>Spring 2025

##### Tina Brauneck <br>02/04/2025

CA02: This is an E-mail Spam Classifer that uses the Naive Bayes supervised machine learning algorithm. 

This was coded for a graduate business analytics course. Sample code was provided for reading and vectorizing the emails. I have provided annotation, made updates, and added the naive bayes algorithm to predict spam.

IMPORTANT NOTE:

The path of your data folders 'train-mails' and 'test-mails' must be './train-mails' and './test-mails'. This means you must have your .ipynb file and these folders in the SAME FOLDER in your laptop or Google Drive. The reason for doing this is, this way the peer reviewes and I would be able to run your code from our computers using this exact same relative path, irrespective of our folder hierarchy.

## Setup

In [6]:
# import packages

import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [7]:
# print(os.getcwd()) # optional: uncomment and run to check the current working directory

## Step 1: Word Frequency Dictionary

Using a custom function, we can create a dictionary of the most common words found in our emails.

In [10]:
# This is a custom function that:
    #1) counts the words in each email
    #2) removes words of length 1 and words that are not alpha
    #3) generates a dictionary of the most common words (limited to 3000 words)

word_limit = 3000 # this will set the number of most comon words used for training

def make_Dictionary(root_dir, word_limit):
  all_words = [] # create empty list "all_words"
  emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)] # creates a list of paths to all files in root_dir
  for mail in emails: # loop on each file
    with open(mail) as m: # opens the file path
      for line in m:
        words = line.split() # splits all text in the file by the " " delimiter
        all_words += words # adds each word to the all_words list
  dictionary = Counter(all_words) # creates a distionary of the word counts, with the word as the key and the count as the value
  list_to_remove = list(dictionary) # converts the disctionary to a list

  for item in list_to_remove:
    if item.isalpha() == False: # checks if the list item contains only letters
      del dictionary[item] # removes the item form the dictionary if it is not alpha
    elif len(item) == 1: 
      del dictionary[item] # removes the item form the dictionary if it is only one letter long, such as 'a' or 'I'
  dictionary = dictionary.most_common(word_limit) # limits the disctionary to only a set number of the most common words in the dictionary
  return dictionary # dictionary will be returned whenever this function is called
            

## Step 2: Features Matrix

Next, we set up a custom function to generate a matrix of word counts and corresponding labels. Emails marked as spam in the filename
will be given a label of 1. All others will be given a label of 0.

In [13]:
def extract_features(mail_dir, word_limit):
  files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)] #for the path passed as 'mail_dir', the filename for each file is joined to the folder path, creating a list of all filepaths within the folder
  features_matrix = np.zeros((len(files),word_limit)) # makes a matrix of zeros with the number of rows relative to the number of files and columns relative to 'word_limit', defined earlier
  train_labels = np.zeros(len(files)) # makes a one-dimensional matrix of zeros relative to the number of files
  count = 1; # initializes count at 1
  docID = 0; # initializes ID at 0
  for fil in files: # loops through all files
    with open(fil) as fi: # file is opened; open file is aliased as "fi"
      for i, line in enumerate(fi): # iterates over fi and returns the index (i) and content of each line (line) in the open file
        if i ==2:
          words = line.split()
          for word in words:
            wordID = 0
            for i, d in enumerate(dictionary):
              if d[0] == word:
                wordID = i
                features_matrix[docID,wordID] = words.count(word)
      train_labels[docID] = 0; # assigns a default label of 0 to the email
      filepathTokens = fil.split('\\') # changed delimiter from the sample code
      lastToken = filepathTokens[len(filepathTokens)-1] # isolates the filename
      if lastToken.startswith("spmsg"): # checks if the filename starts with "spmsg", which indicates it is spam
        train_labels[docID] = 1; # assigns a label of 1 when the message is spam, otherwise the default is 0
        count = count + 1
      docID = docID + 1
  return features_matrix, train_labels                

## Step 3: Execute Training

In [15]:
# Enter the "path" of your "train_mails" and "test-mails" FOLDERS in this cell ...
# for example: TRAIN_DIR = '../../train-mails'
#              TEST_DIR = '../../test-mails'
# Make all paths relative

TRAIN_DIR = 'Data/train-mails'
TEST_DIR = 'Data/test-mails'

In [16]:
dictionary = make_Dictionary(TRAIN_DIR, 
                             word_limit = 
                             word_limit) # passes the global variable word_limit into the custom function as the second argument

In [17]:
print ("reading and processing emails from TRAIN and TEST folders")
features_matrix, labels = extract_features(TRAIN_DIR, word_limit) # creates the matrix and labels vector for the training dataset
test_features_matrix, test_labels = extract_features(TEST_DIR, word_limit) # creates the matrix and labels vector for the test dataset

reading and processing emails from TRAIN and TEST folders


In [18]:
# print(test_features_matrix) # Option to print the matrix

In [19]:
model = MultinomialNB() # Defines the type of Naive Bayes algorithm
model.fit(features_matrix, labels) # Fits the model to our training data

In [20]:
print(test_labels) # Check that the labels are accurate; 0 = normal, 1 = spam

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [21]:
len(test_labels) # Check that the correct number of labels are generated; this should match the number of emails in your test dataset.

260

In [22]:
print(labels) # Check that the labels are accurate; 0 = normal, 1 = spam

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

## Step 4: Generate Predictions

In [24]:
# Adapted from code at https://www.geeksforgeeks.org/multinomial-naive-bayes/

y_pred = model.predict(test_features_matrix) # generates predictions for the test dataset
accuracy = accuracy_score(test_labels, y_pred) # compares the predictions to the target feature (i.e., spam) to evaluate accuracy

print("Completed classification of the Test Data...now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:",
      f"\n{accuracy * 100:.2f}%\n")

Completed classification of the Test Data...now printing Accuracy Score by comparing the Predicted Labels with the Test Labels: 
96.15%



======================= END OF PROGRAM =========================