<a href="https://colab.research.google.com/github/sudeademogullari/CA-2/blob/main/CA02_NB_assignment_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

CA02: Spam eMail Detection using Naive Bayes

Goal: Train a Naive Bayes classifier to predict whether an email is Spam (1) or Not Spam (0).

Data:
- Training folder: "./train-mails"
- Test folder: "./test-mails"

In [None]:
#Import required libraries
import os                        #To list files in folders
import numpy as np               #To store feature matrices
from collections import Counter  #To count word frequencies

#Machine Learning (Naive Bayes) & evaluation
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

Step 1: Build the dictionary

We read all training emails and count token frequency.

Then, we remove:

- Tokens that are not alphabetic (numbers or punctuation)
- Single-character tokens

Lastly, we keep the 3000 most common words, which become our feature list.

In [None]:
#Function (make_Dictionary) build a dictionary of the 3000 most common words from the training email dataset

def make_Dictionary(root_dir):
    all_words = []

    #List all files in the training directory
    emails = [os.path.join(root_dir, f) for f in sorted(os.listdir(root_dir))]

    #Read each email and collect words
    for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words

    #Count word frequencies
    dictionary = Counter(all_words)

    #Remove non-alphabetic & single-letter words
    list_to_remove = list(dictionary)
    for item in list_to_remove:
        if not item.isalpha() or len(item) == 1:
            del dictionary[item]

    #Keep the 3000 most common words
    dictionary = dictionary.most_common(3000)

    return dictionary

Step 2: Extract features & labels

Each email becomes a vector of length 3000 (a row in a matrix).

- Column j represents the j-th dictionary word.
- The value stored in that column is how many times that dictionary word appears in the email.
  
Label rule:
- If the filename starts with "spmsg", the label = 1 (Spam)
- Otherwise, the label = 0 (Not Spam)

In [None]:
#Function (extract_features) converts each email into a numeric feature vector & assigns spam/ham labels

def extract_features(mail_dir):
    #List files inside the folder
    files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)]

    #Matrix: rows = emails & columns = 3000 dictionary words
    features_matrix = np.zeros((len(files), 3000))
    labels = np.zeros(len(files))
    #Going through each email one by one
    for docID, fil in enumerate(files):
        with open(fil) as fi:
            for i, line in enumerate(fi):
                if i >= 2:  #Skipping the first 2 lines and start from line 3
                    words = line.split()
                    for word in words: #going through each word in the email
                        wordID = 0
                        for i, d in enumerate(dictionary):
                            if d[0] == word:
                                wordID = i
                                features_matrix[docID, wordID] = words.count(word)

        filename = os.path.basename(fil) #getting the file name only (without the path)

        if filename.startswith("spmsg"):  #If the file name starts with "spmsg", it is spam (1), otherwise not spam (0)
            labels[docID] = 1
        else:
            labels[docID] = 0

    return features_matrix, labels

Step 3: Train and evaluate (Naive Bayes)

1. Build dictionary from training emails
2. Convert training emails into feature matrix + labels  
3. Convert test emails into feature matrix + labels  
4. Train Naive Bayes model  
5. Predict and print accuracy

In [None]:
#Set relative paths to training and testing folders
TRAIN_DIR = "./train-mails"
TEST_DIR  = "./test-mails"

In [None]:
#Build dictionary from TRAIN data
dictionary = make_Dictionary(TRAIN_DIR)
print("reading and processing emails from TRAIN and TEST folders")

#Extract features & labels for TRAIN and TEST
features_matrix, labels = extract_features(TRAIN_DIR)
test_features_matrix, test_labels = extract_features(TEST_DIR)

FileNotFoundError: [Errno 2] No such file or directory: './train-mails'

In [None]:
#Train the Naive Bayes model,  predict labels on TEST data & evaluate performance using accuracy

print("Training Model using Gaussian Naive Bayes algorithm .....")

model = GaussianNB()

#Train the model
model.fit(features_matrix, labels)

print("Training completed")
print("testing trained model to predict Test Data labels")

#Predict test labels
predicted_labels = model.predict(test_features_matrix)

print("Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:")

#Calculate accuracy
accuracy = accuracy_score(test_labels, predicted_labels)
print(accuracy)

======================= END OF PROGRAM =========================

Weaknesses of current design + possible improvements

Weaknesses:
- The feature extraction step uses only 1 line of each email (the 3rd line), so most of the email content is left out
- Text preprocessing is very minimal (no stopword removal, stemming or lemmatization)
- The model uses raw word counts and that can give too much weight to very common words
  
Improvements:
- Use all lines after the email header when extracting features
- Add more text preprocessing (ex. removing stopwords & normalizing words using stemming or lemmatization)
- Replace raw word counts with TF-IDF features to capture word importance