# Naive Bayes, Part 2

Derived from:

https://stackoverflow.com/questions/48177318/what-does-this-arg-max-notation-mean-in-the-scikit-learn-docs-for-naive-bayes

### Naive Bayes by Example 2
In part 1, you have seen a simple example of Naive Bayes classifier for fruit calssification. The model only has one input feature which is color. Now let's start with a slightly complex example that has more than one input feature.

Consider a fictional training data that describes the weather conditions as shown in the following table. Given the weather conditions, each tuple classifies the conditions as will it rain ('Yes') or not ('No').

| No. | Outlook | Temperature | Humidity | Rain |
|-----|---------|-------------|----------|------|
| 1   | Cloudy  | Cool        | High     | Yes  |
| 2   | Sunny   | Mild        | High     | No   |
| 3   | Cloudy  | Hot         | Normal   | No   |
| 4   | Sunny   | Cool        | Normal   | No   |
| 5   | Sunny   | Hot         | Low      | No   |
| 6   | Cloudy  | Mild        | Normal   | Yes  |
| 7   | Cloudy  | Hot         | High     | Yes  |
| 8   | Cloudy  | Cool        | Low      | No   |
| 9   | Sunny   | Cool        | High     | Yes  |
| 10  | Sunny   | Hot         | High     | No   |
| 11  | Cloudy  | Hot         | Low      | No   |
| 12  | Cloudy  | Cool        | Normal   | Yes  |
| 13  | Sunny   | Mild        | Low      | No   |
| 14  | Cloudy  | Cool        | Low      | Yes  |
| 15  | Sunny   | Mild        | Low      | No   |

In this training data, we have three features which are outlook, temperature, and humidity. Outlook has two possible values: cloudy and sunny. Temperature has three possible values: cool, mild, and hot. Humidity has three possible values: low, normal, and high. Then, we have one label that has two possible values: yes and no.

We denote features as $X_{i}$ where $i=\{1, 2, 3\}$ that corresponds to outlook, temperature, and humidity, repectively. We denote label as $y$. Naive Bayes classifier assumes each of the features is independence. In fact, the independence assumption is never true, but often works well in practice.

In part 1, you have Naive Bayes classifier as follow:
$$P(y|X)\propto P(X|y)P(y)$$
where $X$ is input feature and $y$ is output label that you want to predict. If you have more than one feature, then the Naive Bayes classifier formula becomes:
$$P(y|X_{1},...,X_{n})\propto P(X_{1}|y)...P(X_{n}|y)P(y)$$
where $X_{1}$ to $X_{n}$ are input features. You can caclulate the likelihood probability by multiplying all conditional probability of the features $X_{i}$ given label $y$ which can be expressed as:
$$P(y|X_{1},...,X_{n})\propto P(y)\prod_{i=1}^{n}P(X_{i}|y)$$

Back to our example, let's say today is cloudy, the temperature is cool, and the humidity is normal, can you predict will it rain today? Well, let's apply the same steps as in the example in part 1. Here, we want to calculate the following:

$$P(y=yes|X_{1}=cloudy,X_{2}=cool,X_{3}=normal) \propto P(y=yes)P(X_{1}=cloudy|y=yes)P(X_{2}=cool|y=yes)P(X_{3}=normal|y=yes)$$
$$P(y=no|X_{1}=cloudy,X_{2}=cool,X_{3}=normal) \propto P(y=no)P(X_{1}=cloudy|y=no)P(X_{2}=cool|y=no)P(X_{3}=normal|y=no)$$

First, create a frequency table for each feature of the training data as follows:

| Outlook | Yes   | No     | Total |
|---------|-------|--------|-------|
| Cloudy  | 5     | 3      | 8     |
| Sunny   | 1     | 6      | 7     |
| Total   | 6     | 9      | 15    |

| Temperature | Yes   | No     | Total |
|-------------|-------|--------|-------|
| Cool        | 4     | 2      | 6     |
| Mild        | 1     | 3      | 4     |
| Hot         | 1     | 4      | 5     |
| Total       | 6     | 9      | 15    |

| Humidity | Yes   | No     | Total |
|----------|-------|--------|-------|
| Low      | 1     | 5      | 6     |
| Normal   | 2     | 2      | 4     |
| High     | 3     | 2      | 5     |
| Total    | 6     | 9      | 15    |

#### **Step 1: compute the probabilities for each value of the label**
Out of 15 observations, you have 6 yes and 9 no. So the respective probabilities are:
$$P(y=yes)=\frac{6}{15}$$
$$P(y=no)=\frac{9}{15}$$

#### **Step 2: compute the conditional probability**
Out of 6 yes, you have 5 cloudy. So the probability is:
$$P(X_{1}=cloudy|y=yes)=\frac{5}{6}$$
Out of 6 yes, you have 4 cool. So the probability is:
$$P(X_{2}=cool|y=yes)=\frac{4}{6}$$
Out of 6 yes, you have 2 normal. So the probability is:
$$P(X_{3}=normal|y=yes)=\frac{2}{6}$$
Out of 9 no, you have 3 cloudy. So the probability is:
$$P(X_{1}=cloudy|y=no)=\frac{3}{9}$$
Out of 9 no, you have 2 cool. So the probability is:
$$P(X_{2}=cool|y=no)=\frac{2}{9}$$
Out of 9 no, you have 2 normal. So the probability is:
$$P(X_{3}=normal|y=no)=\frac{2}{9}$$

#### **Step 3: subtitute all the three probabilities into the Naive Bayes formula**
$$P(y=yes|X_{1}=cloudy,X_{2}=cool,X_{3}=normal) \propto \frac{6}{15}\frac{5}{6}\frac{4}{6}\frac{2}{6}=\frac{240}{3240}=0.07407407407$$
$$P(y=no|X_{1}=cloudy,X_{2}=cool,X_{3}=normal) \propto \frac{9}{15}\frac{3}{9}\frac{2}{9}\frac{2}{9}=\frac{108}{10935}=0.0098765432$$

Since $P(y=yes|X_{1}=cloudy,X_{2}=cool,X_{3}=normal) > P(y=no|X_{1}=cloudy,X_{2}=cool,X_{3}=normal)$, which means 'yes' get higher probability than 'no', then 'yes' will be our predicted output.

### The Naive Bayes Classifier Formula (Updated)
To be more mathematically precise, you can rewrite the Naive Bayes classifier formula in this form:
$$\hat{y}=\arg\max_{y}P(y)\prod_{i=1}^{n}P(X_{i}|y)$$
where $\hat{y}$ is the prediction. In statistics, the hat is used to denote an estimator or an estimated/predicted value. The $\arg\max$ of a function is the value of the input, i.e. the 'argument' at the maximum. In other words, it is the label $y$ that has maximum probability. In the above example, the label $y$ that has maximum probability is 'yes'. So, $\hat{y}=yes$.