# Naive Bayes

Here is a step by step explanation of the algorithm: https://youtu.be/O2L2Uv9pdDA

Bayesian classifiers and in particular the naive Bayes classifier are a family of probabilistic classification algorithms particularly suited to problems like text classification.

When to use it:

* The target function $f$ takes value from a finite set $V=\{v_1,...,v_k\}$
* Moderate or large training data set is available
* The attributes $<a_1,...,a_n>$ that describes instances are conditionally independent with respect to the given classification:

$$P(a_1,a_2,...,a_n|v_j)=\prod_i P(a_i|v_j)$$

The most probable value of $f(x)$ is:

\begin{align}
v_{MAP} &= \mbox{argmax}_{v_j \in V}P(v_j|a_1,a_2,...,a_n) \\
      &= \mbox{argmax}_{v_j \in V}\frac{P(a_1,a_2,...,a_n|v_j)P(v_j)}{P(a_1,a_2,...,a_n)}\\
      &= \mbox{argmax}_{v_j \in V} P(a_1,a_2,...,a_n|v_j)P(v_j)\\
      &= \mbox{argmax}_{v_j \in V} \prod_i P(a_i|v_j)P(v_j)
\end{align}

where MAP stands for [_maximum a posteriori probability_](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation).

As an example, let's consider a simplified dataset of only 12 messages, 8 of which are spam. For each message, only consider the words "study", "free" and "money":

In [14]:
import pandas as pd
features = ['study', 'free', 'money']
target = 'is_spam'
messages = pd.DataFrame(
  [(1, 0, 0, 0),
  (0, 0, 1, 0),
  (1, 0, 0, 0),
  (1, 1, 0, 0)] +
  [(0, 1, 0, 1)] * 4 +
  [(0, 1, 1, 1)] * 4,
columns=features+[target])
messages

Unnamed: 0,study,free,money,is_spam
0,1,0,0,0
1,0,0,1,0
2,1,0,0,0
3,1,1,0,0
4,0,1,0,1
5,0,1,0,1
6,0,1,0,1
7,0,1,0,1
8,0,1,1,1
9,0,1,1,1


Given this labelled dataset, a common requirement is to classify a new message, for which the label is unknown. For example, the message "money for psychology study", can be encoded as:

In [15]:
new_messages = pd.DataFrame(
  [(1, 0, 1)],
columns = features)
new_messages

Unnamed: 0,study,free,money
0,1,0,1


Using the [`BernoulliNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) learner from `sklearn`, we can train a regular Naive Bayes classifier with:

In [16]:
from sklearn.naive_bayes import BernoulliNB
X = messages[features]
y = messages[target]
cl = BernoulliNB().fit(X, y)

and then predict the class of the new message with:

In [17]:
cl.predict(new_messages)

array([0])

The prediction is 0, so this message is not considered to be spam.

In order to see the probabilities of each class, not just the most probable class, we can do:

In [18]:
cl.predict_proba(new_messages)

array([[0.93676815, 0.06323185]])

To see the classes corresponding to these probabilieis, we can look at the `classes_` attribute:

In [19]:
cl.classes_

array([0, 1])

which means the first probability is for class '0', while the second probability is for class '1'.

Some of the most useful attributes provided by this learner are:

* `classes_` Class labels known to the classifier;
* `class_count_` Number of samples encountered for each class during fitting;
* `class_log_prior_` Natural logarithm of the probability of each class (smoothed);
* `feature_count_` Number of samples encountered for each (class, feature) during fitting;
* `feature_log_prob_` Empirical log probability of features given a class, $P(a_i|v_j)$.

---
**Give it a try!**

The datasets `X_art` and `y_art` below describe 6 news articles. `X_art` holds the frequency of words while `y_art` holds the topic of the article. `X_new_art` is meant to represent a new article, for which we don't know the topic. What is the probability that this article is about weather?

In [20]:
import pandas as pd
import numpy as np
rng = np.random.RandomState(1)
cols = [f'word_{i}' for i in range(100)]

X_art = pd.DataFrame(rng.randint(5, size=(6, 100)), columns=cols)
y_art = pd.Series(np.array(['politics', 'economy', 'weather', 'sports', 'sports', 'culture']))
X_new_art = pd.DataFrame(X_art[2:3])

# Your code here

Expected result: 0.99999998.

---

## Prior probabilities

By default, the probabilities of the two classes (spam and non-spam) are determined from the dataset. In the results above, the prior probability of 'spam' is considered to be $8/12$, so approximately 0.67. If, however, we want to tweak the prediction to be more conservative and label less messages as spam, then we can directly specify the probability of spam to a lower value such as 0.1:

In [21]:
cl = BernoulliNB(class_prior=[0.9,0.1]).fit(X, y)
cl.predict_proba(new_messages)

array([[0.99626401, 0.00373599]])

As expected, the computed probability that the message is spam has decreased, from around 0.06 to 0.0037.