# Phase A: The MLE Approach

Here, we use the MLE approach to detect the likelihood of getting a word in a spam or ham message using [the SMS spam dataset collection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection).

## Limitations
Vocabulary Bias: By using .split(), every unique string is treated (including punctuation attached to words like "win!") as a unique feature.

Bag-of-Words Assumption: This model assumes that the order of words doesn't matter, only their frequency.

Zero-multiplier: If a sentence containing many spam words has a single word which is not in our dataset of spam words, because
we multiply probabilities, it will render the probability of the whole message "0", even though the presence of multiple spam words
makes it very likely that it is spam.

In [3]:
import pandas as pd
from collections import defaultdict
import os

# Get the path relative to the notebook location
notebook_dir = os.path.dirname(os.path.abspath('__file__'))
data_path = os.path.join('..', 'data', 'sms_spam', 'SMSSpamCollection')

# Load the training data
df = pd.read_csv(data_path, sep='\t', header=None, names=['label', 'message'])

# Separate spam and ham messages
spam_messages = df[df['label'] == 'spam']['message']
ham_messages = df[df['label'] == 'ham']['message']

# Count word occurrences per category
def count_words_by_category(messages):
    word_counts = defaultdict(int)
    for message in messages:
        words = message.lower().split()
        for word in words:
            word_counts[word] += 1
    return word_counts

spam_word_counts = count_words_by_category(spam_messages)
ham_word_counts = count_words_by_category(ham_messages)

# Calculate theta_word for each category
total_spam_words = sum(spam_word_counts.values())
total_ham_words = sum(ham_word_counts.values())

theta_spam = {word: count / total_spam_words for word, count in spam_word_counts.items()}
theta_ham = {word: count / total_ham_words for word, count in ham_word_counts.items()}

top_n = 5

top_ham = sorted(theta_ham.items(), key=lambda x: x[1], reverse=True)[:top_n]
top_spam = sorted(theta_spam.items(), key=lambda x: x[1], reverse=True)[:top_n]

print(f"Top {top_n} ham words:")
for word, prob in top_ham:
    print(f"{word}: {prob:.6f}")

print(f"\nTop {top_n} spam words:")
for word, prob in top_spam:
    print(f"{word}: {prob:.6f}")


Top 5 ham words:
i: 0.031587
you: 0.024172
to: 0.022477
the: 0.016293
a: 0.015323

Top 5 spam words:
to: 0.038350
a: 0.020994
call: 0.019147
your: 0.014724
you: 0.014108


# Phase B: Bayesian Inference (Laplace Smoothing)

## Problem with MLE Approach
The MLE approach has a critical flaw: **zero-probability trap**. If a word never appears in spam messages, its probability is exactly 0, making any message containing that word have zero probability of being spam (since we multiply probabilities).

## Solution: Bayesian Inference with Laplace Smoothing

Instead of using raw counts, we add **pseudo-counts** (α, β) representing our prior belief:

$$\hat{\theta}_{word, Bayesian} = \frac{\text{count} + \alpha}{\text{total count} + \alpha \cdot V}$$

Where:
- **count**: Number of times the word appears in the category
- **α**: Pseudo-count for each word (typically 1)
- **V**: Vocabulary size (number of unique words)
- **total count + α·V**: Normalization factor

This ensures every word has a small but non-zero probability, preventing the zero-probability trap.


In [6]:
# Laplace Smoothing parameters (pseudo-counts)
alpha = 1  # Pseudo-count for words
beta = 1   # Pseudo-count for vocabulary size

# Get vocabulary (all unique words)
vocabulary = set(spam_word_counts.keys()) | set(ham_word_counts.keys())
vocab_size = len(vocabulary)

# Implement Bayesian with Laplace Smoothing
# Formula: theta_word_bayesian = (count + alpha) / (total_count + alpha * vocab_size)
theta_spam_bayes = {}
theta_ham_bayes = {}

for word in vocabulary:
    # Spam probabilities with smoothing
    spam_count = spam_word_counts.get(word, 0)
    theta_spam_bayes[word] = (spam_count + alpha) / (total_spam_words + alpha * vocab_size)
    
    # Ham probabilities with smoothing
    ham_count = ham_word_counts.get(word, 0)
    theta_ham_bayes[word] = (ham_count + alpha) / (total_ham_words + alpha * vocab_size)

print("Vocabulary size:", vocab_size)
print("\n--- Comparison (MLE vs Bayesian Smoothing) ---")
print("\nExample words that don't appear in one category (would be zero without smoothing):")

# Find words that appear in one category but not the other
unique_spam_words = set(spam_word_counts.keys()) - set(ham_word_counts.keys())
unique_ham_words = set(ham_word_counts.keys()) - set(spam_word_counts.keys())

sample_unique = list(unique_spam_words)[:3]
for word in sample_unique:
    mle_prob = theta_ham.get(word, 0)  # Would be 0 in MLE
    bayes_prob = theta_ham_bayes.get(word, 0)
    print(f"  Word '{word}' (spam-only): MLE P(word|ham)={mle_prob}, Bayesian P(word|ham)={bayes_prob:.6f}")

sample_unique = list(unique_ham_words)[:3]
for word in sample_unique:
    mle_prob = theta_spam.get(word, 0)  # Would be 0 in MLE
    bayes_prob = theta_spam_bayes.get(word, 0)
    print(f"  Word '{word}' (ham-only): MLE P(word|spam)={mle_prob}, Bayesian P(word|spam)={bayes_prob:.6f}")

print("\n✓ Notice: Laplace smoothing prevents zero-probability trap!")
print(f"✓ Smallest non-zero probability (Bayesian): {min(min(theta_spam_bayes.values()), min(theta_ham_bayes.values())):.10f}")


Vocabulary size: 13579

--- Comparison (MLE vs Bayesian Smoothing) ---

Example words that don't appear in one category (would be zero without smoothing):
  Word '87077' (spam-only): MLE P(word|ham)=0, Bayesian P(word|ham)=0.000012
  Word '80878.' (spam-only): MLE P(word|ham)=0, Bayesian P(word|ham)=0.000012
  Word '2yr' (spam-only): MLE P(word|ham)=0, Bayesian P(word|ham)=0.000012
  Word 'mo' (ham-only): MLE P(word|spam)=0, Bayesian P(word|spam)=0.000032
  Word 'ystrday.ice' (ham-only): MLE P(word|spam)=0, Bayesian P(word|spam)=0.000032
  Word 'loose' (ham-only): MLE P(word|spam)=0, Bayesian P(word|spam)=0.000032

✓ Notice: Laplace smoothing prevents zero-probability trap!
✓ Smallest non-zero probability (Bayesian): 0.0000121027
