# Phase A: The MLE Approach

Here, we use the MLE approach to detect the likelihood of getting a word in a spam or ham message using [the SMS spam dataset collection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection).

## Limitations
Vocabulary Bias: By using .split(), every unique string is treated (including punctuation attached to words like "win!") as a unique feature.

Bag-of-Words Assumption: This model assumes that the order of words doesn't matter, only their frequency.

Zero-multiplier: If a sentence containing many spam words has a single word which is not in our dataset of spam words, because
we multiply probabilities, it will render the probability of the whole message "0", even though the presence of multiple spam words
makes it very likely that it is spam.

In [12]:
import pandas as pd
from collections import defaultdict
import os

# Get the path relative to the notebook location
notebook_dir = os.path.dirname(os.path.abspath('__file__'))
data_path = os.path.join('..', 'data', 'sms_spam', 'SMSSpamCollection')

# Load the training data
df = pd.read_csv(data_path, sep='\t', header=None, names=['label', 'message'])

# Separate spam and ham messages
spam_messages = df[df['label'] == 'spam']['message']
ham_messages = df[df['label'] == 'ham']['message']

# Count word occurrences per category
def count_words_by_category(messages):
    word_counts = defaultdict(int)
    for message in messages:
        words = message.lower().split()
        for word in words:
            word_counts[word] += 1
    return word_counts

spam_word_counts = count_words_by_category(spam_messages)
ham_word_counts = count_words_by_category(ham_messages)

# Calculate theta_word for each category
total_spam_words = sum(spam_word_counts.values())
total_ham_words = sum(ham_word_counts.values())

theta_spam = {word: count / total_spam_words for word, count in spam_word_counts.items()}
theta_ham = {word: count / total_ham_words for word, count in ham_word_counts.items()}

print(theta_ham, theta_spam)


