# **TP2 (BONUS part) : construction, apprentissage et évaluation de premiers classifieurs supervisés**


# 3.Gaussian Naive Bayes classifier

## Model overview

Below, we will develop the Gaussian Naive Bayes classifier in a multi-class classification task. To understand its principle, let's reverse the classical machine learning workflow and look at how this classifier will predict the class $C_i$ of a new data sample $(x_1,...,x_n)$ representing some $n$ features (independent variables). Based on the [Baye's theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem), this prediction results from the following conditional probability 

\begin{equation} 
P(C_i | x_1,...,x_n ) = P(x_1,...,x_n | C_i ) * P( C_i ) \tag{eq. 4.1}
\end{equation}

It is accompanied with a set of strong assumptions on these probability distributions, summed up below:

-  **Data columns are conditionally independent of each other**, i.e. the input variables are treated separately, that is

\begin{equation}
P(x_1,...,x_n | C_i) = P(x_1| C_i) *...* P(x_n| C_i) \tag{eq. 4.2}
\end{equation}

-  **Data are normally distributed**, i.e. the distribution of each input attribute (i.e. each column of our data) $P(x_k | C_i)$ will be modeled as a gaussian distribution.



## Dataset preparation

To illustrate this classifier on a multi-class single-label classification task, we will be using the famous [Iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) dataset (see its documentation to have some details on it). The code cell below downloads it into a pandas DataFrame and compute the training and test variables.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

dataset = load_iris()

y = dataset['target']
X = pd.DataFrame(dataset.data , columns = dataset['feature_names'])

X_train,X_test,y_train,y_test=train_test_split(X , y,test_size=0.2 , random_state = 7)

X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [None]:
np.unique(dataset['target'])

array([0, 1, 2])

## Model training: calculate P(X1,...,Xn |class) and P(class)

To calculate the probability of data by the class they belong to, i.e. P(data |class), we need to 1) separate our training data by class and 2) calculate  the mean and standard deviation statistics, $\mathbf{\mu}$ and $\Sigma$, for each column grouped by class. The latter is needed to train the normal distribution of each attribute value given a class.

**Question 3.1** Using the [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function of pandas dataframes, implement these two operations into a single line of code to compute $\mathbf{\mu}$ and then $\Sigma$.

**Question 3.2** Also we require to compute the class prior probability, which is simply the number of class elements divided by the total number of elements in the train set. Still using the `groupby()` method, as well as the `lambda` operator within a `apply()` method, compute this probability into a single line of code. 

*tips*: to help you with the `apply()` and `lambda()` functions, first try to understand this toy [example](https://pandas.pydata.org/pandas-docs/version/0.22.0/generated/pandas.core.groupby.GroupBy.apply.html)

## Model test: calculate P( class | X1 , ... , Xn) 

**Question 3.3** Before going any further, let's first be sure you well understand the meaning of the term $P(x_k | C_l)$ in eq. 4.2.

With one sentence, can you explain what this term will compute during our test phase ? Using the function `univariate_normal`, calculate the values of this term for $x_k = [1,2,0]$, $\mu = [1,1,1]$ and $\Sigma = [1,1,1]$. Comment the results.

**Question 3.4** Implement the calculation of P( class | X1 , ... , Xn) based on equations 4.1 and 4.2.

*Tips* : we recommend the use of three `for` loops on 1) test samples, 2) possible class and 3) each sample column. For the loop 3), you can use the `enumerate` function. 

**Question 3.5** Verify that your model performs similarly as the sklearn `GaussianNB` function.

# SOURCES

## Naive bayes

- https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

- https://towardsdatascience.com/implementing-naive-bayes-in-2-minutes-with-python-3ecd788803fe
