CA02: This is a eMail Spam Classifers that uses Naive Bayes supervised machine learning algorithm. 

In this assignment you will ...
1. Complete the code such a way that it works correctly with this given parts of the program.
2. Explain as clearly as possible what each part of the code is doing. Use "Markdown" texts and code commenting to explain the code

IMPORTANT NOTE:

The path of your data folders 'train-mails' and 'test-mails' must be './train-mails' and './test-mails'. This means you must have your .ipynb file and these folders in the SAME FOLDER in your laptop or Google Drive. The reason for doing this is, this way the peer reviewes and I would be able to run your code from our computers using this exact same relative path, irrespective of our folder hierarchy.

## Import necessary packages

In [1]:
import os
import numpy as np
from collections import Counter

# import GaussianNB from sklearn which implements the Naive Bayes model
from sklearn.naive_bayes import GaussianNB

# import accuracy_score from sklearn to measure the accuracy of the model
from sklearn.metrics import accuracy_score


## Useful functions for processing data
`make_Dictionary` creates a dictionary of the 3000 most common words and their frequency counts from a directory of text files.

`extract_features` returns feature matrices and labels from emails in a directory.

In [2]:
def make_Dictionary(root_dir):
  """
    This function creates a dictionary of words from the emails in the specified directory.
    It reads all the emails, tokenizes them into words, and counts the frequency of each word.
    Non-alphabetic characters and single-character words are removed.
    Finally, it returns a dictionary of the 3000 most common words.
    
    Args:
    root_dir (str): The directory containing the email files.
    
    Returns:
    list: A dictionary of the 3000 most common words and their frequency counts.
  """
  all_words = []
  # get the path of all emails
  emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
  # loop through each email and read it
  for mail in emails:
    with open(mail) as m:
      # loop through each line of the email
      for line in m:
        # split the line into words and add them to the list of all words
        words = line.split()
        all_words += words
  # create a dictionary of the words and their frequency counts
  dictionary = Counter(all_words)
  # initialize a list to store the words that will be removed
  list_to_remove = list(dictionary)

  for item in list_to_remove:
    # remove non-alphabetic characters and single-character words
    if item.isalpha() == False:
      del dictionary[item]
    elif len(item) == 1:
      del dictionary[item]
  # only keep the 3000 most common words
  dictionary = dictionary.most_common(3000)
  return dictionary
            

In [3]:
def extract_features(mail_dir):
  """
    This function extracts feature matrix and labels from emails in the given directory.
    It reads all the files, extracts features based on the frequency of each word in the dictionary,
    and assigns a label (0 for ham, 1 for spam).
    
    Args:
    mail_dir (str): The directory containing the email files.
    
    Returns:
    tuple: A tuple containing the features matrix and the labels array.
  """
  # List all the files in the directory
  files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  # Initialize a matrix with zeros, with rows as files and columns as dictionary words
  features_matrix = np.zeros((len(files),3000))
  # Initialize an array with zeros to hold the label of each email (spam or ham)
  train_labels = np.zeros(len(files))
  count = 1;
  # Index to keep track of which document is being processed
  docID = 0;
  # Loop through each file in the directory
  for fil in files:
    with open(fil) as fi:
      # Read each line in the email file
      for i, line in enumerate(fi):
        # The content of the email is expected to be at the 3rd line
        if i ==2:
          # Split the line into words
          words = line.split()
          # Iterate through each word in the line
          for word in words:
            # Find the word ID if the word is in the dictionary
            wordID = 0
            for i, d in enumerate(dictionary):
              if d[0] == word:
                wordID = i
                # Increment the count of the word in the features matrix
                features_matrix[docID,wordID] = words.count(word)
      # By default, label the email as ham (0)
      train_labels[docID] = 0;
      # Extract the file name and check if it starts with 'spmsg'
      filepathTokens = fil.split('/')
      lastToken = filepathTokens[len(filepathTokens)-1]
      # If the file name starts with 'spmsg', label the email as spam (1)
      if lastToken.startswith("spmsg"):
        train_labels[docID] = 1;
        count = count + 1
      # Move to the next document
      docID = docID + 1
  return features_matrix, train_labels                

### Set training and test directory

In [4]:
TRAIN_DIR = './train-mails'
TEST_DIR = './test-mails'


## Create dictionary from training directory and transform training and test data to feature matrices and labels

In [5]:
dictionary = make_Dictionary(TRAIN_DIR)

print ("reading and processing emails from TRAIN and TEST folders")
features_matrix, labels = extract_features(TRAIN_DIR)
test_features_matrix, test_labels = extract_features(TEST_DIR)

reading and processing emails from TRAIN and TEST folders


## Train the Naive Bayes model and measure its accuracy on the test set

In [6]:
# Training the Naive Bayes model
model = GaussianNB()
print ('Training Model using Gaussian Naive Bayes algorithm .....')
model.fit(features_matrix, labels)
print ('Training completed')

# Predicting the labels for the test data
print ('testing trained model to predict Test Data labels')
predicted_labels = model.predict(test_features_matrix)

# Calculating the accuracy of the model
print ('Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:')
accuracy = accuracy_score(test_labels, predicted_labels)
print(f'Accuracy: {accuracy}')


Training Model using Gaussian Naive Bayes algorithm .....
Training completed
testing trained model to predict Test Data labels
Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
Accuracy: 0.9653846153846154


======================= END OF PROGRAM =========================