# Supervised Machine Learning - Naive Bayes

<!-- TOC START min:2 max:4 link:true asterisk:true update:true -->
* [What you will learn in this class](#what-you-will-learn-in-this-class)
* [Bayes theorem](#bayes-theorem)
* [Naive Bayes](#naive-bayes)
  * [Case where $X_i$ is qualitative](#case-where--is-qualitative)
  * [Case where $X_i$ is quantitative](#case-where--is-quantitative)
* [General remarks](#general-remarks)
<!-- TOC END -->



## What you will learn in this class

This course is dedicated to the teaching of the so-called Naive Bayes model. It is a model that relies on the explanatory variables being independent from each other, a very strong hypothesis that is very rarely verified in practice. Nevertheless, this model can be very useful and provides a good view of the influence of each variable on the target variable within the model.

In statistics, Bayesian naive classifications belong to a family of probabilistic classifications based on Bayes' theorem.

## Conditional probability
Let's start this lecture by introducing the concept of conditionnal probabilities. Think of the following example :

You are flipping two coins one after the other, these coins are balanced. It may not seem so but we have already used many probabilistic concepts here :
* Flipping a coin is defining a *random variable*, a random variable is a set of values, in our case *tail* and *head*, to which probabilities are associated (or a probability distribution in case the value space is continuous or more exactly non countable). Here the coins are balanced which means $P(tail) = P(head) = 0.5$
* Flipping two coins one after the other is introducing two additional concepts : we just defined what is called a random experiment (the fact of throwing two coins, examining the outcome of two random variables), plus the concept of independence, the two coins are not related to each other in any, so the outcome of one flip does not have any influence on the other flip and vice versa.

Let us now define two *events* :
* Event $A$ is the first coin is *tail*
* Event $B$ is both coins are *tail*

We can now calculate the probabilities for these two events to happen :

$P(A)=P(coin_1 = tail)=0.5$

$P(B)=P(coin_1 = tail,\, coin_2 = tail)=P(coin_1 = tail)\times P(coin_2 = tail) = 0.5\times 0.5 = 0.25$

To calculate the probability of event B, we are actually calculating the *joint* probability of coin 1 and coin 2 both being tail during the same random experiment, for two independent events, the joint probability equals the product of each individual probability.

Let us now indtroduce the concept of conditional probabilities. The conditional probability of $B$ given $A$ is the probability that event $B$ is realised given the fact that event $A$ already happened. It is written as follows :

$ \begin{align*}
P(B|A) &= P(coin_1 = tail,\, coin_2 = tail|coin_1 = tail) \\
       &= P(coin_1 = tail | coin_1 = tail)\times P(coin_2 = tail | coin_1 = tail) \\
       &= 1 \times P(coin_2 = tail) \\
       &= 1 \times 0.5 \\
       &= 0.5
\end{align*} $

Here to go from line 1 to line 2 we use the fact that coin 1 and coin 2 are independent so the joint conditional probability becomes the product of both conditional probabilities. The probability of coin 1 being tail given coin 1 is tail is ... 1 ! We already know that coin 1 is tail here. Then because the result of coin 1 does not influence coin 2 the second conditional probability is that of coin 2 being tail, therefore 0.5. The conditional probability of $B$ given $A$ is then 0.5, when probability of $B$ is normally 0.25 !

Everything else in the lecture will be based on these probability rules.

## Bayes theorem

Bayes theorem corresponds to the following statement:

Let $A$ and $B$ be two random variables, then the following equality is verified:

### $P(A|B)=\frac{P(B|A)\cdot{P(A)}}{P(B)}$


The conditional probability of $A$ knowing $B$, $P(A|B)$ is equal to the product of the conditional probability of $B$ knowing $A$, $P(B|A)$, and the probability of $A$ divided by the probability of $B$.



## Naive Bayes

We consider the situation where we have $Y$ the qualitative target variable (so it is a classification problem) that we are trying to predict, and a collection of explanatory variables $X=(X_1,X_2,...,X_p)$. The problem is to estimate for each observation the law $P(Y/X)$, which gives the probability for $Y$ to take each of its possible values, knowing the values of $X$ for this observation.

Bayes's theorem intervenes here and gives us the following writing:


### $P(Y|X)=\frac{P(X|Y)\cdot{P(Y)}}{P(X)}$



The denominator does not involve $Y$ and therefore has no influence on the model results, we will only be interested in the numerator which can be recursively decomposed using the properties of conditional probabilities. The property we will use here is the fact that :
$P(A,B)=P(A|B)P(B)$
We then get the following :

### $P(X|Y)\cdot{P(Y)}=P(X_1,X_2,...,X_p,Y)$

### $P(X|Y)\cdot{P(Y)}=P(X_1|X_2,...,X_p,Y)\cdot{P(X_2,...,X_p,Y)}$

### $P(X|Y)\cdot{P(Y)}=P(X_1|X_2,...,X_p,Y)\cdot{P(X_2|X_3,...,X_p,Y)}\cdot{P(X_3,...,X_p,Y)}$

### $P(X|Y)\cdot{P(Y)}=P(X_1|X_2,...,X_p,Y)\cdot{P(X_2|X_3,...,X_p,Y)}\cdot{P(X_3|X_4,...,X_p,Y)...P(X_p|Y)\cdot{P(Y)}}$

This is where we need the fundamental and naive assumption that allows us to build our estimates : *all the explanatory variables must be independent*, because then we have for all $i$ between $1$ and $p$:

### $P(X_i|X_{i+1},...,X_p,Y)=P(X_i|Y)$


In fact, you get:

### $P(Y|X)=\frac{P(Y)P(X_1|Y)P(X_2|Y)...P(X_p|Y)}{P(X)}$



Which is very simple to calculate since we just need to estimate for each value of $Y$ the distribution of $X_i$.



### Case where $X_i$ is qualitative

In the case where $X_i$ is a qualitative explanatory variable that takes the modalities $x_{i1},...,x_{iq}$ then we can write :

### $\hat{P}(X_i=x_{ik}|Y=y)=\frac{Card(X_i=x_{ik}, Y=y)}{Card(Y=y)}$


We estimate the probability that $X_i$ takes the modality $x_{ik}$ knowing that $Y = y$ as the proportion of observations where $X_i=x_{ik}$ among all observations where $Y = y$.
This is very easy to calculate and can be very accurate with very little data.


### Case where $X_i$ is quantitative

For cases where $X_i$ is quantitative in general, we go back to the qualitative case by cutting the range of values of $X_i$ into $K$ pieces indexed by $k\in{[[1,K]]}$ and delimited by the values of $-\infty=\alpha_0,\alpha_1,...,\alpha_{k-1},+\infty=\alpha_k$ and the probability law becomes :

### $\hat{P}(X_i=x_{ik}\in[\alpha_j, \alpha_{j+1}]|Y=y)=\frac{Card(X_i\in[\alpha_ j,\alpha_{j+1}],Y=y)}{Card(Y=y)}$


I.e. the proportion of observations for which the value of $X_i$ belongs to the interval $[\alpha_j,\alpha_{j+1}]]$ among all observations for which $Y = y$. This technique is called discretization of a continuous variable.

Another way to estimate the law of $X_i$ knowing $Y$ is to make the hypothesis that $P(X_i|Y)$ follows a normal law whose parameters $\mu_i$ (mean) and $\sigma_i$ (standard deviation) are estimated thanks to the data available on $X_i$. Under the assumption of normality, $P(X_i|Y)$follows a normal law of parameters :


### $\mu_{iy}=\frac{1}{N_y}\sum_{j=1}^{N_y}x_{ij}$ 


Which is the average value of $X_i$ among the $N_y$ individuals for whom $Y = y$. Similarly, we calculate the variance of the normal law:


### $\sigma_{iy}^2=\frac{1}{N_y-1}\sum_{j=1}^{N_y}(x_{ij}-\mu_{iy})^2$



The variance estimator of $X_i$ among individuals for whom ***Y = y***. Once this estimation is done, we get:


### $\hat{P}(X_i=x_{ik}|Y=y)=\frac{1}{\sqrt{2\pi\sigma_{iy}}}exp(\frac{-(x_{ik}-\mu_{iy})^2}{2\sigma_{iy}^2})$


Once all the conditional probabilities have been calculated, we obtain for each observation and for each modality of $Y$ a probability which determines our classification. Each observation will be classified in the modality of $Y$ that is the most probable given the values of the explanatory variables $X$.

## General remarks

An advantage of the naive Bayes model is that it allows us to avoid making assumptions about the distribution laws of the explanatory variables if we transform them into qualitative variables. However, it is very rare that the fundamental hypothesis of the independence of the explanatory variables is verified in practice.

Naive Bayesian models can be aggregated in the same way as random trees, which generally gives much more stable results and also better respects the hypothesis of independence of the explanatory variables if, as we have seen in the case of random forest, only part of the explanatory variables are used to build each model.
