# Naive Bayes — Student Lab

Implement Multinomial Naive Bayes from scratch (log-space + Laplace smoothing).

In [None]:
import re
import numpy as np

def check(name: str, cond: bool):
    if not cond:
        raise AssertionError(f'Failed: {name}')
    print(f'OK: {name}')

rng = np.random.default_rng(0)

## Section 0 — Toy dataset
We use a tiny spam/ham dataset so you can verify logic easily.

In [None]:
docs = [
    ('spam', 'win money now'),
    ('spam', 'free money win'),
    ('spam', 'claim your free prize'),
    ('ham',  'meeting schedule tomorrow'),
    ('ham',  'let us schedule a meeting'),
    ('ham',  'project update tomorrow'),
]

labels = np.array([1 if y=='spam' else 0 for y,_ in docs])
texts = [t for _,t in docs]

## Section 1 — Tokenization + Vocabulary

### Task 1.1: Tokenize

# TODO: implement tokenizer
# HINT: lowercase + split on non-letters

**FAANG gotcha:** punctuation/case can break counts.

In [None]:
def tokenize(text):
    # TODO
    ...

print(tokenize('Free MONEY!!!'))

### Task 1.2: Build vocabulary

# TODO: build vocab dict word->index from all docs

**Checkpoint:** What happens if a word appears only in test?

In [None]:
# TODO
vocab = ...
print(vocab)
check('vocab_nonempty', len(vocab) > 0)

## Section 2 — Vectorization (Count vectors)

### Task 2.1: Convert docs to count matrix X (n_docs, |V|)

# TODO
# HINT: loop over tokens inside a doc is ok; avoid loops over vocab per doc


In [None]:
def vectorize(texts, vocab):
    # TODO
    ...

X = vectorize(texts, vocab)
print('X shape', X.shape)
check('shape', X.shape[0] == len(texts) and X.shape[1] == len(vocab))

## Section 3 — Train Multinomial Naive Bayes

### Task 3.1: Fit with Laplace smoothing

Compute:
- class priors P(y)
- word likelihoods P(word|y) with Laplace smoothing alpha

# HINT:
- count words per class: sum rows where y==c
- add alpha to each word count

**Checkpoint:** Why smoothing is required?

In [None]:
def fit_nb(X, y, alpha=1.0):
    # TODO
    # return log_priors (2,), log_likelihoods (2, V)
    ...

log_priors, log_lik = fit_nb(X, labels, alpha=1.0)
check('priors_shape', log_priors.shape == (2,))
check('lik_shape', log_lik.shape[0] == 2 and log_lik.shape[1] == X.shape[1])

### Task 3.2: Predict in log-space

For each doc:
log P(y=c|x) ∝ log P(y=c) + sum_j x_j * log P(word_j|c)

**Interview Angle:** Why use logs?

In [None]:
def predict_nb(X, log_priors, log_lik):
    # TODO
    ...

pred = predict_nb(X, log_priors, log_lik)
print('pred', pred)
acc = float(np.mean(pred == labels))
print('train_acc', acc)
check('acc_reasonable', acc >= 0.8)

## Section 4 — Error Analysis

### Task 4.1: Most influential tokens per class
Show top tokens by log-likelihood ratio between classes.

# HINT: sort (log_lik[spam] - log_lik[ham])


In [None]:
# TODO
inv_vocab = {i:w for w,i in vocab.items()}
score = ...
top = ...
print('top spam tokens:', [inv_vocab[i] for i in top])

---
## Submission Checklist
- All TODOs completed
- Train accuracy shown
- Top tokens printed
- Checkpoint questions answered
