# Phase A: The MLE Approach

Here, we use the MLE approach to detect the likelihood of getting a word in a spam or ham message using [the SMS spam dataset collection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection).

## Limitations
Vocabulary Bias: By using .split(), every unique string is treated (including punctuation attached to words like "win!") as a unique feature.

Bag-of-Words Assumption: This model assumes that the order of words doesn't matter, only their frequency.

Zero-multiplier: If a sentence containing many spam words has a single word which is not in our dataset of spam words, because
we multiply probabilities, it will render the probability of the whole message "0", even though the presence of multiple spam words
makes it very likely that it is spam.

In [2]:
import pandas as pd
from collections import defaultdict
import os

# Get the path relative to the notebook location
notebook_dir = os.path.dirname(os.path.abspath('__file__'))
data_path = os.path.join('..', 'data', 'sms_spam', 'SMSSpamCollection')

# Load the training data
df = pd.read_csv(data_path, sep='\t', header=None, names=['label', 'message'])

# Separate spam and ham messages
spam_messages = df[df['label'] == 'spam']['message']
ham_messages = df[df['label'] == 'ham']['message']

# Count word occurrences per category
def count_words_by_category(messages):
    word_counts = defaultdict(int)
    for message in messages:
        words = message.lower().split()
        for word in words:
            word_counts[word] += 1
    return word_counts

spam_word_counts = count_words_by_category(spam_messages)
ham_word_counts = count_words_by_category(ham_messages)

# Calculate theta_word for each category
total_spam_words = sum(spam_word_counts.values())
total_ham_words = sum(ham_word_counts.values())

theta_spam = {word: count / total_spam_words for word, count in spam_word_counts.items()}
theta_ham = {word: count / total_ham_words for word, count in ham_word_counts.items()}

top_n = 5

top_ham = sorted(theta_ham.items(), key=lambda x: x[1], reverse=True)[:top_n]
top_spam = sorted(theta_spam.items(), key=lambda x: x[1], reverse=True)[:top_n]

print(f"Top {top_n} ham words:")
for word, prob in top_ham:
    print(f"{word}: {prob:.6f}")

print(f"\nTop {top_n} spam words:")
for word, prob in top_spam:
    print(f"{word}: {prob:.6f}")


Top 5 ham words:
i: 0.031587
you: 0.024172
to: 0.022477
the: 0.016293
a: 0.015323

Top 5 spam words:
to: 0.038350
a: 0.020994
call: 0.019147
your: 0.014724
you: 0.014108


# Phase B: Bayesian Inference (Laplace Smoothing)

## Problem with MLE Approach
The MLE approach has a critical flaw: **zero-probability trap**. If a word never appears in spam messages, its probability is exactly 0, making any message containing that word have zero probability of being spam (since we multiply probabilities).

## Solution: Bayesian Inference with Laplace Smoothing

Instead of using raw counts, we add **pseudo-counts** (α, β) representing our prior belief:

$$\hat{\theta}_{word, Bayesian} = \frac{\text{count} + \alpha}{\text{total count} + \alpha \cdot V}$$

Where:
- **count**: Number of times the word appears in the category
- **α**: Pseudo-count for each word (typically 1)
- **V**: Vocabulary size (number of unique words)
- **total count + α·V**: Normalization factor

This ensures every word has a small but non-zero probability, preventing the zero-probability trap.


In [4]:
# Laplace Smoothing parameters (pseudo-counts)
alpha = 1  # Pseudo-count for words
beta = 1   # Pseudo-count for vocabulary size

# Get vocabulary (all unique words)
vocabulary = set(spam_word_counts.keys()) | set(ham_word_counts.keys())
vocab_size = len(vocabulary)

# Implement Bayesian with Laplace Smoothing
# Formula: theta_word_bayesian = (count + alpha) / (total_count + alpha * vocab_size)
theta_spam_bayes = {}
theta_ham_bayes = {}

for word in vocabulary:
    # Spam probabilities with smoothing
    spam_count = spam_word_counts.get(word, 0)
    theta_spam_bayes[word] = (spam_count + alpha) / (total_spam_words + alpha * vocab_size)
    
    # Ham probabilities with smoothing
    ham_count = ham_word_counts.get(word, 0)
    theta_ham_bayes[word] = (ham_count + alpha) / (total_ham_words + alpha * vocab_size)

print("Vocabulary size:", vocab_size)
print("\n--- Comparison (MLE vs Bayesian Smoothing) ---")
print("\nExample words that don't appear in one category (would be zero without smoothing):")

# Find words that appear in one category but not the other
unique_spam_words = set(spam_word_counts.keys()) - set(ham_word_counts.keys())
unique_ham_words = set(ham_word_counts.keys()) - set(spam_word_counts.keys())

sample_unique = list(unique_spam_words)[:3]
for word in sample_unique:
    mle_prob = theta_ham.get(word, 0)  # Would be 0 in MLE
    bayes_prob = theta_ham_bayes.get(word, 0)
    print(f"  Word '{word}' (spam-only): MLE P(word|ham)={mle_prob}, Bayesian P(word|ham)={bayes_prob:.6f}")

sample_unique = list(unique_ham_words)[:3]
for word in sample_unique:
    mle_prob = theta_spam.get(word, 0)  # Would be 0 in MLE
    bayes_prob = theta_spam_bayes.get(word, 0)
    print(f"  Word '{word}' (ham-only): MLE P(word|spam)={mle_prob}, Bayesian P(word|spam)={bayes_prob:.6f}")

print("\n✓ Notice: Laplace smoothing prevents zero-probability trap!")
print(f"✓ Smallest non-zero probability (Bayesian): {min(min(theta_spam_bayes.values()), min(theta_ham_bayes.values())):.10f}")


Vocabulary size: 13579

--- Comparison (MLE vs Bayesian Smoothing) ---

Example words that don't appear in one category (would be zero without smoothing):
  Word '09099726429' (spam-only): MLE P(word|ham)=0, Bayesian P(word|ham)=0.000012
  Word '84128' (spam-only): MLE P(word|ham)=0, Bayesian P(word|ham)=0.000012
  Word 'posh' (spam-only): MLE P(word|ham)=0, Bayesian P(word|ham)=0.000012
  Word '3230' (ham-only): MLE P(word|spam)=0, Bayesian P(word|spam)=0.000032
  Word 'omg' (ham-only): MLE P(word|spam)=0, Bayesian P(word|spam)=0.000032
  Word 'jesus..' (ham-only): MLE P(word|spam)=0, Bayesian P(word|spam)=0.000032

✓ Notice: Laplace smoothing prevents zero-probability trap!
✓ Smallest non-zero probability (Bayesian): 0.0000121027


# Phase C: Log-Likelihood & Classification

To classify a new sentence, you'll need to sum the log-likelihoods of each word:

$$\log P(\text{Fake} | \text{Words}) \propto \log P(\text{Fake}) + \sum \log P(\text{word}_i | \text{Fake})$$

In [5]:
import numpy as np

# Calculate prior probabilities
total_messages = len(df)
prior_spam = len(spam_messages) / total_messages
prior_ham = len(ham_messages) / total_messages

print(f"Prior P(spam) = {prior_spam:.4f}")
print(f"Prior P(ham) = {prior_ham:.4f}")

# Classification function using log-likelihood
def classify_message(message, theta_spam, theta_ham, prior_spam, prior_ham, vocab_size, alpha):
    """
    Classify a message as spam or ham using log-likelihood.
    
    Args:
        message: The text message to classify
        theta_spam: Dictionary of word probabilities for spam
        theta_ham: Dictionary of word probabilities for ham
        prior_spam: Prior probability of spam
        prior_ham: Prior probability of ham
        vocab_size: Size of vocabulary
        alpha: Laplace smoothing parameter
    
    Returns:
        tuple: (prediction, log_prob_spam, log_prob_ham)
    """
    words = message.lower().split()
    
    # Start with log priors
    log_prob_spam = np.log(prior_spam)
    log_prob_ham = np.log(prior_ham)
    
    # Add log-likelihood for each word
    for word in words:
        # Get probability for spam (use smoothed probability even for unknown words)
        if word in theta_spam:
            log_prob_spam += np.log(theta_spam[word])
        else:
            # Unknown word: use smoothing probability
            log_prob_spam += np.log(alpha / (total_spam_words + alpha * vocab_size))
        
        # Get probability for ham
        if word in theta_ham:
            log_prob_ham += np.log(theta_ham[word])
        else:
            # Unknown word: use smoothing probability
            log_prob_ham += np.log(alpha / (total_ham_words + alpha * vocab_size))
    
    # Classify based on which has higher log probability
    prediction = "spam" if log_prob_spam > log_prob_ham else "ham"
    
    return prediction, log_prob_spam, log_prob_ham

# Test the classifier with example messages
test_messages = [
    "Congratulations! You won a free prize. Call now!",
    "Hey, are we still meeting for lunch tomorrow?",
    "URGENT! Claim your lottery winnings now!!!",
    "Can you pick up milk on your way home?"
]

print("\n--- Classification Results ---")
for msg in test_messages:
    pred, log_spam, log_ham = classify_message(
        msg, theta_spam_bayes, theta_ham_bayes, 
        prior_spam, prior_ham, vocab_size, alpha
    )
    print(f"\nMessage: \"{msg}\"")
    print(f"  Prediction: {pred.upper()}")
    print(f"  Log P(spam|message) = {log_spam:.2f}")
    print(f"  Log P(ham|message) = {log_ham:.2f}")


Prior P(spam) = 0.1341
Prior P(ham) = 0.8659

--- Classification Results ---

Message: "Congratulations! You won a free prize. Call now!"
  Prediction: SPAM
  Log P(spam|message) = -49.92
  Log P(ham|message) = -65.27

Message: "Hey, are we still meeting for lunch tomorrow?"
  Prediction: HAM
  Log P(spam|message) = -68.21
  Log P(ham|message) = -57.04

Message: "URGENT! Claim your lottery winnings now!!!"
  Prediction: SPAM
  Log P(spam|message) = -50.11
  Log P(ham|message) = -62.05

Message: "Can you pick up milk on your way home?"
  Prediction: HAM
  Log P(spam|message) = -71.64
  Log P(ham|message) = -60.48
