## Naive Bayes (generative) model

Naive Bayes model predicts the likelihood of an event based on the evidence present in the test dataset. 

**Conditional probability and Bayes' rule:**  

$$ P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A) P(B|A)}{P(B)}  $$ 

Three types of Naive Bayes model:
1. Multinomial (categorical or continuos, discrete frequency)
2. Bernoulli (binary features)
3. Gaussian (continuous, normally distributed)

**Learning with practical dataset:**
Here we fit each class (independently) with a model. Say we have two classes with one dimensional probability distributions $P_1(x)$ and $P_2(x)$. 

Let's say our training set has $\pi_1$ fraction of class one and $\pi_2$ fraction of class two ($\pi_1 + \pi_2 = 1$). 

Now for a test point (x), we predict its class for which $\pi_iP_i(x)$ is maximum. Note that $\pi_i$ is determined based on our training dataset. 

In [None]:
# load python packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 6)
plt.rcParams["font.size"] = 16

In [None]:
# import UC Irvine wine classification dataset
headers = ['Category','Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', \
           'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', \
           'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
data = pd.read_csv("../input/wineuci/Wine.csv", names=headers)
data.head()

In [None]:
data.info()

In [None]:
# Let's consider one feature (Alcohol) and plot it for the first category
cat1 = data.loc[data['Category'] == 1]
plt.hist(cat1["Alcohol"], density=True)
plt.show()

In [None]:
# Let's fit it with Gaussian distribution
mu = np.mean(cat1["Alcohol"])                # mean
var = np.var(cat1["Alcohol"])                # variance
std = np.sqrt(var)                           # standard deviation

x_axis = np.linspace(mu - 3*std, mu + 3*std, 1000)
plt.plot(x_axis, norm.pdf(x_axis,mu,std), 'r', lw=2)
plt.hist(cat1["Alcohol"], density=True)
plt.xlabel('Alcohol content')
plt.ylabel('Probability density')
plt.show()

We fit probabilty distribution ($P_i$) for each category. The probability of each category is simply the $\pi_i$ (frequency of that category)/(total sample size) in the training dataset. Now for a given new data, we simply calculate $\pi_i P_i$, and choose the label for which it is maximum. 

Now we will do the same using scikit learn modules. Where we will use all the predictor variable in the data.  

In [None]:
# first do a train test split of our data
X, X_test, y, y_test = train_test_split(data.drop(['Category'],axis=1),\
                       data['Category'], test_size=0.3, random_state=0)

In [None]:
# check that we have sufficient data of each wine category 
y.value_counts()

In [None]:
y_test.value_counts()

In [None]:
GaussNB = GaussianNB()
GaussNB.fit(X, y)
GaussNB.predict([X_test.iloc[4]])

In [None]:
y_test.head()

In [None]:
GaussNB.score(X_test, y_test)