# Naive Bayes classification algorithm 

## What is a classifier?
A classifier is a machine learning model that is used to `discriminate different objects based on certain features.`


Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. 

## What is Naive Bayes algorithm?

One of the supervised machine learning algorithm “Naive Bayes” mainly used for classification. It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. `In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.`  For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.  
`Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.`

### Principle of Naive Bayes Classifier:

A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. It works on Bayes theorem of probability to predict the class of unknown data sets. The crux of the classifier is based on the Bayes theorem.

### Bayes Theorem:

<img src="https://miro.medium.com/max/638/1*tjcmj9cDQ-rHXAtxCu5bRQ.png">


Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.


To understand this we need to know first Conditional probability. please refer to ther <a href="http://www.cs.uni.edu/~campbell/stat/prob4.html">Conditional probability and the product rule</a>



Let's import the required library.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualization purposes
import seaborn as sns # for statistical data visualization
%matplotlib inline

from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings('ignore')

## Data
We will use the Pima indian diabetes dataset. The data is available at Kaggle and can be downloaded from <a href='https://www.kaggle.com/uciml/pima-indians-diabetes-database'>here</a>. The datasets nine columns: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age and Outcome. The first eight are features and the last one ( Outcome) is the label. Outcome has two types of labels 0 (Non-Diabetic) and 1 (Diabetic).

In [None]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df.head()

## Exploratory data analysis

In [None]:
# let's chekc the general info about the dataset.
df.info()

we can see that we have all numeric data. and we don't have any null values.

In [None]:
df.isnull().sum()

In [None]:
df.describe()

Let's check the corelation between the attributes.

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df.corr(),annot=True)
plt.show()

In [None]:
sns.pairplot(df,hue='Outcome')

## Number of labels: cardinality
The number of labels within each attribute is known as `cardinality`. A high number of labels within a variable is known as `high cardinality.` High cardinality may pose some serious problems in the machine learning model.

In [None]:
columns = df.columns
for col in columns:
     print(col, 'contains', len(df[col].unique()),'lables.')

`Note` : here we have all the numerica data it won't make sense, but if it was nominal data then we have to check for cardinality.

## Declare feature vector and target variable

In [None]:
x = df.drop(['Outcome'],axis=1)
y = df['Outcome']

In [None]:
x.head()

In [None]:
y.head()

## Split data into separate training and test set

<img src='https://github.com/nijatullahmansoor/IBM-Data-Science-Professional-Certificate/raw/master/Machine%20learning%20With%20python/MSC/kaggle.PNG'>

We will be using the second Test/Split method here.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)
x_train.shape,x_test.shape

## Feature Scaling 

`Feature scaling is a method used to normalize the range of independent variables or features of data.` In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

In [None]:
col = x_train.columns

In [None]:
col

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

In [None]:
x_train = pd.DataFrame(x_train, columns=[col])
x_test = pd.DataFrame(x_test,columns=[col])

In [None]:
x_train.head()

In [None]:
x_test.head()

In [None]:
len(x_train)

these dataset we will be using to rain our module. `537`

In [None]:
len(x_test)

we will be using `231` record to test our model.

#### We now have X_train dataset ready to be fed into the Gaussian Naive Bayes classifier.

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(x_train,y_train)

## Predict the results

In [None]:
y_pred = gnb.predict(x_test)

In [None]:
y_pred

# Evaluating Model

## Check accuracy score
Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition: Accuracy = Number of correct predictions Total number of predictions.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

Model accuracy is `76%`.

Here, y_test are the true class labels and y_pred are the predicted class labels in the test-set.

## Compare the train-set and test-set accuracy

Now, I will compare the train-set and test-set accuracy to check for `overfitting` and `underfitting`.

In [None]:
y_pred_train = gnb.predict(x_train)
y_pred_train

In [None]:
print("Training-set accuracy score: {0:0.4f}".format(accuracy_score(y_train,y_pred_train)))

In [None]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(gnb.score(x_train, y_train)))

print('Test set score: {:.4f}'.format(gnb.score(x_test, y_test)))

The training-set accuracy score is `0.7672` while the test-set accuracy to be `0.7619`. These two values are quite comparable. So, there is no sign of overfitting.

### Compare model accuracy with null accuracy
So, the model accuracy is `0.7229`. But, we cannot say that our model is very good based on the above accuracy. We must compare it with the null accuracy. Null accuracy is the accuracy that could be achieved by always predicting the most frequent class.

So, we should first check the class distribution in the test set.

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

We can see that the occurences of most frequent class is 157. So, we can calculate null accuracy by dividing 157 by total number of occurences.

In [None]:
# check null accuracy score

null_accuracy = (157/(157+74))

print('Null accuracy score: {0:0.4f}'. format(null_accuracy))

We can see that our model accuracy score is 0.7229 but null accuracy score is 0.6797. So, we can conclude that our Gaussian Naive Bayes Classification model is doing good job in predicting the class labels. 
But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making.

We have another tool called `Confusion matrix` that comes to our rescue.

# Confusion matrix 

A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.


Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-

- `True Positives (TP)` – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.

- `True Negatives (TN)` – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.

- `False Positives (FP)` – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called `Type I error.`

- `False Negatives (FN)` – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called `Type II error`.

These four outcomes are summarized in a confusion matrix given below.


In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(y_test,y_pred)

In [None]:
print('Confusion matrix\n\n', cm)

In [None]:
print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

The confusion matrix shows `124 + 43 = 167 correct predictions` and `33 + 31 = 64 incorrect predictions`.

In this case, we have

- True Positives (Actual Positive:1 and Predict Positive:1) - 124
- True Negatives (Actual Negative:0 and Predict Negative:0) - 43
- False Positives (Actual Negative:0 but Predict Positive:1) - 33 (Type I error)
- False Negatives (Actual Positive:1 but Predict Negative:0) - 31 (Type II error)

These metrics are calculated using True Positive/TP ( person has diabetes and predicted diabetes) , True Negative/TN ( person did not have diabetes and predicted non- diabetic), False Positive/FP ( person did not have diabetes but predicted diabetes) and False Negative/FN ( person had diabetes but predicted non-diabetic).

### visualize confusion matrix with seaborn heatmap

In [None]:
plt.figure(figsize=(10,8))
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
plt.show()

## Let's generate a classification report to measure the quality if prediction from Naive Bayes model.

In [None]:
test_pred = gnb.predict(x_test)

In [None]:
from sklearn import metrics

In [None]:
print(metrics.classification_report(y_test,test_pred))

`Precision` - What percent of the prediction were correct ? Accuracy of positive prediction.

`Precision = TP/(TP+FP)`

`Recall `— What percent of the positive cases did we catch? Fraction of positives that were correctly identified.

`Recall = TP/ (TP+FN)` 

`F1 score` — What percent of positive prediction were correct?

`F1 score` = 2*(Recall*Precision)/(Recall+Precision)



## Results and conclusion

- I applied Naive Bayes classification algorithm to predict whether or not the patients in the dataset have diabetes or not. To evaluate the model we used accuracy and classification report generated using sklearn.

1. The model yields a very good performance as indicated by the model accuracy which was found to be `0.7672`.


2. The training-set accuracy score is `0.7672` while the test-set accuracy to be `0.7619`. These two values are quite comparable. So, there is no sign of overfitting.

3. I have compared the model accuracy score which is `Model accuracy score: 0.7229` with null `accuracy score which is 0.6797`. So, we can conclude that our Gaussian Naïve Bayes classifier model is doing a very good job in predicting the class labels.