<style>
    .info-card {
        max-width: 650px;
        margin: 25px auto;
        padding: 25px 30px;
        border: 1px solid #e0e0e0;
        border-radius: 12px;
        box-shadow: 0 4px 12px rgba(0, 0, 0, 0.05);
        font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif;
        background-color: #fdfdfd;
        color: #333;
    }
    .info-card .title {
        color: #1a237e; /* Dark Indigo */
        font-size: 24px;
        font-weight: 600;
        margin-top: 0;
        margin-bottom: 15px;
        text-align: center;
        border-bottom: 2px solid #e8eaf6; /* Light Indigo */
        padding-bottom: 10px;
    }
    .info-card .details-grid {
        display: grid;
        grid-template-columns: max-content 1fr;
        gap: 12px 20px;
        margin-top: 20px;
        font-size: 16px;
    }
    .info-card .label {
        font-weight: 600;
        color: #555;
        text-align: right;
    }
    .info-card .value {
        font-weight: 400;
        color: #222;
    }
</style>

<div class="info-card">
    <h2 class="title">Unit 4.1 Exercise</h2>
    <div class="details-grid">
        <div class="label">Name:</div>
        <div class="value">Ethan Jed V. Carbonell & Vincent L. Corpes Jr.</div>
        <div class="label">Date:</div>
        <div class="value">February 12, 2026</div>
        <div class="label">Year & Section:</div>
        <div class="value">BSCS 3A AI</div>
        <div></div>
    </div>
</div>

> # Manual Method for developing a Naive Bayes Model

In [46]:
import re
from collections import defaultdict

# --- Data Definition ---
documents = [
    ("Free money now!!!", "SPAM"),
    ("Hi mom, how are you?", "HAM"),
    ("Lowest price for your meds", "SPAM"),
    ("Are we still on for dinner?", "HAM"),
    ("Win a free iPhone today", "SPAM"),
    ("Let's catch up tomorrow at the office", "HAM"),
    ("Meeting at 3 PM tomorrow", "HAM"),
    ("Get 50% off, limited time!", "SPAM"),
    ("Team meeting in the office", "HAM"),
    ("Click here for prizes!", "SPAM"),
    ("Can you send the report?", "HAM")
]

test_sentences = [
    "Limited offer, click here!",
    "Meeting at 2 PM with the manager."
]

# Pre-processing
def tokenize(text):
    return re.findall(r'\b\w+\b', text.lower())

> A. Generate a Bag of Words (for word frequency)

In [47]:
spam_freq = defaultdict(int)
ham_freq = defaultdict(int)
vocabulary = set()

n_spam = 0 # Total words in SPAM
n_ham = 0  # Total words in HAM
spam_docs_count = 0
ham_docs_count = 0

for doc, label in documents:
    tokens = tokenize(doc)
    if label == "SPAM":
        spam_docs_count += 1
        for token in tokens:
            spam_freq[token] += 1
            vocabulary.add(token)
            n_spam += 1
    else:
        ham_docs_count += 1
        for token in tokens:
            ham_freq[token] += 1
            vocabulary.add(token)
            n_ham += 1

v_size = len(vocabulary)

print("BAG OF WORDS")
print(f"Total words in SPAM (N_spam): {n_spam}")
print(f"Total words in HAM (N_ham): {n_ham}")
print(f"Vocabulary size (|V|): {v_size}\n")

BAG OF WORDS
Total words in SPAM (N_spam): 22
Total words in HAM (N_ham): 34
Vocabulary size (|V|): 45



> B. Calculate the prior for the class HAM and SPAM

In [48]:
total_docs = len(documents)
p_spam = spam_docs_count / total_docs
p_ham = ham_docs_count / total_docs

print("PRIORS")
print(f"P(SPAM) = {spam_docs_count}/{total_docs} = {p_spam:.4f}")
print(f"P(HAM) = {ham_docs_count}/{total_docs} = {p_ham:.4f}\n")

PRIORS
P(SPAM) = 5/11 = 0.4545
P(HAM) = 6/11 = 0.5455



> C. Calculate the likelihood of the tokens in the vocabulary with respect to the class.

In [76]:

def get_likelihood(word, label):
    """Calculates P(word | label) using Laplace (add-1) smoothing."""
    if label == "SPAM":
        count = spam_freq.get(word, 0)
        # (Count + 1) / (N_spam + |V|)
        return (count + 1) / (n_spam + v_size)
    else:
        count = ham_freq.get(word, 0)
        # (Count + 1) / (N_ham + |V|)
        return (count + 1) / (n_ham + v_size)
print("LIKELIHOOD OF TOKENS")
for word in sorted(vocabulary):
    prob_spam = get_likelihood(word, "SPAM")
    prob_ham = get_likelihood(word, "HAM")
    print(f"P({word:<8} | SPAM) = {prob_spam:.4f}   |   P({word:<8} | HAM) = {prob_ham:.4f}")
print()

LIKELIHOOD OF TOKENS
P(3        | SPAM) = 0.0149   |   P(3        | HAM) = 0.0253
P(50       | SPAM) = 0.0299   |   P(50       | HAM) = 0.0127
P(a        | SPAM) = 0.0299   |   P(a        | HAM) = 0.0127
P(are      | SPAM) = 0.0149   |   P(are      | HAM) = 0.0380
P(at       | SPAM) = 0.0149   |   P(at       | HAM) = 0.0380
P(can      | SPAM) = 0.0149   |   P(can      | HAM) = 0.0253
P(catch    | SPAM) = 0.0149   |   P(catch    | HAM) = 0.0253
P(click    | SPAM) = 0.0299   |   P(click    | HAM) = 0.0127
P(dinner   | SPAM) = 0.0149   |   P(dinner   | HAM) = 0.0253
P(for      | SPAM) = 0.0448   |   P(for      | HAM) = 0.0253
P(free     | SPAM) = 0.0448   |   P(free     | HAM) = 0.0127
P(get      | SPAM) = 0.0299   |   P(get      | HAM) = 0.0127
P(here     | SPAM) = 0.0299   |   P(here     | HAM) = 0.0127
P(hi       | SPAM) = 0.0149   |   P(hi       | HAM) = 0.0253
P(how      | SPAM) = 0.0149   |   P(how      | HAM) = 0.0253
P(in       | SPAM) = 0.0149   |   P(in       | HAM) = 0.0253
P(i

> D. Determine the class of the following test sentence:
> > I. Limited offer, click here!
>
> > II. Meeting at 2 PM with the manager.

In [63]:
print("TEST SENTENCES CLASSIFICATION")
for i, sentence in enumerate(test_sentences):
    print(f"\nTest Sentence {i+1}: '{sentence}'")
    
    tokens = tokenize(sentence)
    
    # Remove words not mentioned
    valid_tokens = [t for t in tokens if t in vocabulary]
    print(f"Tokens in Vocabulary (V): {valid_tokens}")
    
    # Initialize scores with the prior probabilities
    score_spam = p_spam
    score_ham = p_ham
    
    # Multiply by the likelihood of each valid token
    for token in valid_tokens:
        score_spam *= get_likelihood(token, "SPAM")
        score_ham *= get_likelihood(token, "HAM")
        
    print(f"Score(SPAM): {score_spam / 1e-6:.6f} × 10^-6")
    print(f"Score(HAM):  {score_ham / 1e-6:.6f} × 10^-6")

    
    # Compare scores to determine the class
    if score_spam > score_ham:
        print("Predicted Class: SPAM")
    else:
        print("Predicted Class: HAM")

TEST SENTENCES CLASSIFICATION

Test Sentence 1: 'Limited offer, click here!'
Tokens in Vocabulary (V): ['limited', 'click', 'here']
Score(SPAM): 12.090462 × 10^-6
Score(HAM):  1.106311 × 10^-6
Predicted Class: SPAM

Test Sentence 2: 'Meeting at 2 PM with the manager.'
Tokens in Vocabulary (V): ['meeting', 'at', 'pm', 'the']
Score(SPAM): 0.022557 × 10^-6
Score(HAM):  1.008284 × 10^-6
Predicted Class: HAM


> # Using Scikit for developing a Naive Bayes Model

In [51]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

> Defining Data

In [52]:
documents = [
    "Free money now!!!",
    "Hi mom, how are you?",
    "Lowest price for your meds",
    "Are we still on for dinner?",
    "Win a free iPhone today",
    "Let's catch up tomorrow at the office",
    "Meeting at 3 PM tomorrow",
    "Get 50% off, limited time!",
    "Team meeting in the office",
    "Click here for prizes!",
    "Can you send the report?"
]

labels = [
    "SPAM", "HAM", "SPAM", "HAM", "SPAM",
    "HAM", "HAM", "SPAM", "HAM", "SPAM", "HAM"
]

test_sentences = [
    "Limited offer, click here!",
    "Meeting at 2 PM with the manager."
]

> Vectorizing Text Data (Bag of Words)

In [64]:
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
X_train = vectorizer.fit_transform(documents)

print("VOCABULARY LEARNED")
print(vectorizer.get_feature_names_out())
print()

VOCABULARY LEARNED
['3' '50' 'a' 'are' 'at' 'can' 'catch' 'click' 'dinner' 'for' 'free' 'get'
 'here' 'hi' 'how' 'in' 'iphone' 'let' 'limited' 'lowest' 'meds' 'meeting'
 'mom' 'money' 'now' 'off' 'office' 'on' 'pm' 'price' 'prizes' 'report'
 's' 'send' 'still' 'team' 'the' 'time' 'today' 'tomorrow' 'up' 'we' 'win'
 'you' 'your']



In [None]:

# Train the Multinomial Naïve Bayes classifier
# alpha=1.0 applies Laplace smoothing by default
clf = MultinomialNB(alpha=1.0)

> Predicting the test sentences

In [67]:
X_test = vectorizer.transform(test_sentences)
predictions = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)
classes = clf.classes_

> Results

In [69]:
print("SCIKIT-LEARN PREDICTION RESULTS")
for i, sentence in enumerate(test_sentences):
    print(f"\nTest Sentence {i+1}: '{sentence}'")
    print("Probability Distribution:")
    for j, c in enumerate(classes):
        print(f"P({c}): {probabilities[i][j]:.8f}")
    print(f"Predicted Class: {predictions[i]}")

SCIKIT-LEARN PREDICTION RESULTS

Test Sentence 1: 'Limited offer, click here!'
Probability Distribution:
P(HAM): 0.08383194
P(SPAM): 0.91616806
Predicted Class: SPAM

Test Sentence 2: 'Meeting at 2 PM with the manager.'
Probability Distribution:
P(HAM): 0.97811802
P(SPAM): 0.02188198
Predicted Class: HAM
