# Naive Bayes Example

Build a Naive Bayes classifier with smoothing using the training data below to predict the label of the test example.

| Set | Label | Document | # |
| --- | --- | --- | --- |
| Train | - | bad experience | 1 |
| Train | - | lacked ambiance | 2 |
| Train | - | food was not great | 3 |
| Train | + | amazing experience | 4 |
| Train | + | delicious food | 5 |
| Test  | ? | food was amazing, great ambiance | 6 |

In [None]:
import numpy as np
train_docs = ["bad experience", "lacked ambiance", "food was not great", "amazing experience", "delicious food"]
# Labels ∈ {0, 1} <=> {-,+}
ytr = np.array([0, 0, 0, 1, 1])
test_doc = "food was amazing great ambiance"

### Construct Vocab

In [None]:
featurizer = {word: idx for idx, word in enumerate(set(" ".join(train_docs).split(" ")))}
featurizer

### Construct BOW features w/smoothing

In [None]:
Xtr = np.zeros(shape=(len(train_docs), len(featurizer)))
for i, doc in enumerate(train_docs):
    for word in doc.split(" "):
        j = featurizer[word]
        Xtr[i, j] += 1

Xte = np.zeros(shape=(len(featurizer)))
for word in test_doc.split(" "):
    j = featurizer[word]
    Xte[j] += 1

In [None]:
M = len(train_docs)
N = len(featurizer)
K = len(set(ytr))
M, N, K

### Parameter estimation for $P(y ; \boldsymbol{\mu})$, where $\boldsymbol{\mu} \in \mathbb{R}^{K}$

In [None]:
mu_hat = np.array([sum(ytr == idx) / M for idx in range(K)])
mu_hat

### Parameter estimation for $P(\boldsymbol{x} | y; \boldsymbol{\phi})$, where $\boldsymbol{\phi} \in \mathbb{R}^{K \times N}$

In [None]:
word_count_by_class = {
    k: np.sum(Xtr[np.where(ytr == k)])
for k in range(K)}
word_count_by_class

In [None]:
# Smoothing constant (this can be any real number > 0)
alpha = 1.0
phi_hat = np.zeros(shape=(K, N))
for word, j in featurizer.items():
    for k in range(K):
        num_word_j_class_k = sum(np.squeeze(Xtr[np.where(ytr == k), j]))
        phi_hat[k, j] = (alpha + num_word_j_class_k) / (alpha * N + word_count_by_class[k])
np.sum(phi_hat, axis=1)

### Inference

Predict label by computing the argmax of the log likelihood over class labels via Bayes Rule: $ P(y | \boldsymbol{x}_{te}) \propto  \boldsymbol{x}_{te} (log \boldsymbol{\phi})^{T} + log \boldsymbol{\mu}$.

In [None]:
p_y_given_Xte = Xte.dot(np.log(phi_hat).T) + np.log(mu_hat)
p_y_given_Xte

In [None]:
yte_hat = np.argmax(p_y_given_Xte)
yte_hat

### Discussion

Clearly our NB model has misclassified the test example. There a couple of things going on here: 
1. $M$ is small (5), and there is class imbalance as reflected in $\hat{\boldsymbol{\mu}} = [0.6, 0.4]$. Coupling this with the structure of the likelihood function for $P(y | \boldsymbol{x})$, we see that the log prior term can have a pretty big effect on the result.
2. The contextual meaning of words is largely lost with BOW features. Here, the word *great* is associated with negative sentiment in the training set, while used to express positive sentiment in the test example.
3. The log likelihood values are very close to each other. In this case there are very few examples and very few words, and therefore adding a constant 1 to each entry in $X$ pushes these probabilities together. Try adjusting the `alpha` parameter to be big (e.g., 2) and small (e.g., 0.01).