# Naive Bayes

There are three types of Naive Bayes model under the scikit-learn library:

**Gaussian**: It is used in classification and it assumes that features follow a normal distribution.

**Multinomial**: It is used for discrete counts. For example, let’s say, we have a text classification problem. Here we can consider Bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.

**Bernoulli**: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.




---




**Pros:**
*   Easy implementation
*   Fast
*   It works with binary and multiclass problems/ discrete or continous variables

**Cons**

*   Independente predictors: It is considered a naive model because it doesn't consider the relationship between features. Eg: For example, an animal may be considered as a cat if it has cat eyes, whiskers and a long tail. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this animal is a cat and that is why it is known as ‘Naive’. Each one of these features will be treat *independently.* That situation is very hard to happen in the real world.




---




*Example:* The Student will be a pass if he wears a “red” color dress on the exam day. We can solve it using above discussed method of posterior probability.
By Bayes Theorem, P(Pass| Red) = P( Red| Pass) * P(Pass) / P (Red).
From the values, let us assume 
* P (Red|Pass) = 3/9 = 0.33,
* P(Red) = 5/14 = 0.36, 
* P( Pass)= 9/14 = 0.64. 
So, **P (Pass| Red) = 0.33 * 0.64 / 0.36 = 0.60**, which has higher probability.




# Python implementation

**Iris Flower Dataset** 

It consists of 3 classes of flowers. In this, there are 4 independent variables namely the, sepal_length, sepal_width, petal_length and petal_width. The dependent variable is the species which we will predict using the four independent features of the flowers.

**1) Import libraries**

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets

**2) Create dataframe**

In [5]:
dataset = pd.read_csv('https://raw.githubusercontent.com/mk-gurucharan/Classification/master/IrisDataset.csv')
X = dataset.iloc[:,:4].values
y = dataset['species'].values
dataset.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


**3) Splitting the dataset into the Training set and Test set**

Once we have obtained our data set, we have to split the data into the training set and the test set. In this data set, there are 150 rows with 50 rows of each of the 3 classes. As each class is given in a continuous order, we need to randomly split the dataset. Here, we have the test_size=0.2, which means that 20% of the dataset will be used for testing purpose as the test set and the remaining 80% will be used as the training set for training the Naive Bayes classification model.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

**4) Feature Scaling:**

The dataset is scaled down to a smaller range using the Feature Scaling option. In this, both the X_train and X_test values are scaled down to smaller values to improve the speed of the program.

In [7]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

**5) Training the Naive Bayes Classification model on the Training Set**

In this step, we introduce the class GaussianNB that is used from the sklearn.naive_bayes library. Here, we have used a Gaussian model, there are several other models such as Bernoulli, Categorical and Multinomial. Here, we assign the GaussianNB class to the variable classifier and fit the X_train and y_train values to it for training purpose.

In [8]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

**6) Predicting the Test set results**

Once the model is trained, we use the the classifier.predict() to predict the values for the Test set and the values predicted are stored to the variable y_pred.

In [9]:
y_pred = classifier.predict(X_test) 
y_pred

array(['setosa', 'virginica', 'versicolor', 'versicolor', 'setosa',
       'virginica', 'setosa', 'versicolor', 'versicolor', 'setosa',
       'setosa', 'setosa', 'virginica', 'setosa', 'versicolor',
       'versicolor', 'virginica', 'setosa', 'setosa', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'versicolor', 'virginica',
       'virginica', 'setosa', 'virginica', 'versicolor', 'virginica'],
      dtype='<U10')

**7)Confusion Matrix and Accuracy**
This is a step that is mostly used in classification techniques. In this, we see the Accuracy of the trained model and plot the confusion matrix.
The confusion matrix is a table that is used to show the number of correct and incorrect predictions on a classification problem when the real values of the Test Set are known. It is of the format.The True values are the number of correct predictions made.

In [10]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from sklearn.metrics import accuracy_score 
print ("Accuracy : ", accuracy_score(y_test, y_pred))
cm

Accuracy :  0.9666666666666667


array([[10,  0,  0],
       [ 0, 11,  1],
       [ 0,  0,  8]])

From the above confusion matrix, we infer that, out of 30 test set data, 29 were correctly classified and only 1 was incorrectly classified. This gives us a high accuracy of 96.67%.

Step 8: Comparing the Real Values with Predicted Values
In this step, a Pandas DataFrame is created to compare the classified values of both the original Test set (y_test) and the predicted results (y_pred).

In [11]:
df = pd.DataFrame({'Real Values':y_test[0:10], 'Predicted Values':y_pred[0:10]})
df

Unnamed: 0,Real Values,Predicted Values
0,setosa,setosa
1,virginica,virginica
2,versicolor,versicolor
3,versicolor,versicolor
4,setosa,setosa
5,versicolor,virginica
6,setosa,setosa
7,versicolor,versicolor
8,versicolor,versicolor
9,setosa,setosa


# Tips to improve the power of Naive Bayes Model:

* If continuous features do not have normal distribution, we should use transformation or different methods to **convert it in normal distribution**.
* If test data set has zero frequency issue, apply smoothing techniques like “***Laplace Correction***” to predict the class of test data set.
* **Remove correlated features**, as the highly correlated features are voted twice in the model and it can lead to over inflating importance.
* Naive Bayes classifiers has limited options for parameter tuning like alpha=1 for smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some other options. I would recommend to **focus on your  pre-processing of data and the feature selection.**
* You might think to apply some classifier combination technique like ensembling, bagging and boosting but these methods would not help. Actually, “ensembling, boosting, bagging” won’t help since their purpose is to reduce variance. **Naive Bayes has no variance to minimize.**

*Example:* The Student will be a pass if he wears a “red” color dress on the exam day. We can solve it using above discussed method of posterior probability.
By Bayes Theorem, P(Pass| Red) = P( Red| Pass) * P(Pass) / P (Red).
From the values, let us assume 
* P (Red|Pass) = 3/9 = 0.33,
* P(Red) = 5/14 = 0.36, 
* P( Pass)= 9/14 = 0.64. 
So, **P (Pass| Red) = 0.33 * 0.64 / 0.36 = 0.60**, which has higher probability.