# Pima Indians Diabetes Dataset - Sanket Mayekar

**Context:**

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("pN4HqWRybwk")

**Data Description:**

The datasets consists of several medical predictor variables and one target variable 'Class', Outcome. 

Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.


**About the data**

Numeric : Preg             : Number of times pregnant
  
Numeric : Plas             : Plasma glucose concentration a 2 hours in an oral glucose tolerance test

Numeric : Pres             : Diastolic blood pressure (mm Hg)

Numeric : Skin             : Triceps skin fold thickness (mm)

Numeric : Test             : 2-Hour serum insulin (mu U/ml)

Numeric : Mass             : Body mass index (weight in kg/(height in m)^2)

Numeric : Pedi             : Diabetes pedigree function

Numeric : Age              : Age (years)

Numeric : Class            : Class variable (0 or 1)

**Problem Statement: **

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

**Note: Use Naive Bayes Classifier**

------------------------------------------------------------------------

# Step 1: Read dataset

**a. Import packages**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes = True)


#ignore warning messages 
import warnings
warnings.filterwarnings('ignore')

**b. Import Data**

In [None]:
pima = pd.read_csv('../input/pimaindiansdiabetescsv/pima-indians-diabetes.csv')

**c. Display DataFrame**

In [None]:
pima

**d. Show first 5 observations**

In [None]:
pima.head()

**e. Show last 5 observations**

In [None]:
pima.tail()

**f. Get shape of the DataFrame**

In [None]:
pima.shape

In [None]:
pima.dtypes

dataset contains 768 observations and 9 variables

In [None]:
pima.info()

Above also shows that 768 entries and 9 columns.

There are no null values in any of the variable.

But there are values for some columns which does not make sense; they are marked as a zero (0) value

For example; a zero value in 'Pres' (BloodPressure) variable does not make sense. Person would be dead I guess by now

Also there are zero values in 'Plas' (Glucose), 'Skin' (SkinThickness), 'Test' (Insulin), and 'Mass' (BMI) variables 

We can also verify the existence of any null values using following:

In [None]:
pima.isnull().values.any()

In [None]:
pima.describe().T

this tells me following:

For example; let us take 'Preg' (Pregnancies) variable
1. count: Total number of entries or rows = 768 (no missing values)
2. mean: Sum of all values of Pregnancies divide by 768 = 3.845052
3. std: Standard Deviation - it measures how far data values are from their mean = 3.369578
4. min: Minimum number of pregnancy = 0
5. 25%: First Quartile (Q1) = 1
6. 50%: Second Quartile (Q2) = 3
7. 75%: Third Quartile (Q3) = 6
8. max: Maximum number of pregnancies = 17

Above 5 number summary can be better explained using boxplot


-------------------------------------------------------------------------

We have to predict whether or not the patients in the dataset have diabetes or not?

So let us check how many people have diabetes and how many of them do not


In [None]:
pima['class'].value_counts()

In [None]:
plt.figure(figsize=(10,7))
sns.set(font_scale = 1.5)
sns.countplot(x = 'class', data=pima, palette="Set2")
plt.ylabel('Number of People')

We can see from above plot that:

People who do not have diabetes: 500

People who have diabetes : 268


In [None]:
plt.figure(figsize=(8,8))
pieC = pima['class'].value_counts()
explode = (0.05, 0)
colors = ['moccasin', 'coral']
labels = ['0 - Non Diabetic', '1 - Diabetic']
sns.set(font_scale = 1.5)
plt.pie(pieC, labels = ('0 - Non Diabetic', '1 - Diabetic'), autopct = "%.2f%%", explode = explode, colors = colors)
plt.legend(labels, loc = 'lower left')

We can see from above pie plot that:

65.10% out of 768 Pima Indian women do not have diabetes

34.90% out of 768 Pima Indian women have diabetes


------------------------------------------------------------

# Step 2: Exploratory Data Analysis (EDA)

**a. MISSING VALUES**

There are variables that have a minimum value of zero (0). 

On some variables, a value of zero does not make sense and thus indicates missing value.

Following variables have an invalid zero value:

1. 'Pres' (BloodPressure)

2. 'Plas' (Glucose)

3. 'Skin' (SkinThickness)

4. 'Test' (Insulin)

5. 'Mass' (BMI)


**Count of missing values(zeros) in above mentioned 5 variables**

In [None]:
pima[pima['Plas'] == 0]

In [None]:
missingPlas = pima[pima['Plas'] == 0].shape[0]
print ("Number of zeros in variable Plas (Glucose): ", missingPlas)

In [None]:
missingPres = pima[pima['Pres'] == 0].shape[0]
print ("Number of zeros in variable Pres (BloodPressure): ", missingPres)

In [None]:
missingSkin = pima[pima['skin'] == 0].shape[0]
print ("Number of zeros in variable Skin (SkinThickness): ", missingSkin)

In [None]:
missingTest = pima[pima['test'] == 0].shape[0]
print ("Number of zeros in variable Test (Insulin): ", missingTest)

In [None]:
missingMass = pima[pima['mass'] == 0].shape[0]
print ("Number of zeros in variable Mass (BMI): ", missingMass)

**Another method to calculate total number of zeros in a column**

Replace zero (0) values with NaN values and then sum the NaN values in each column to know get count of NaN values

In [None]:
pima_copy = pima.copy(deep = True)

In [None]:
pima_copy[['Plas','Pres','skin','test','mass']] = pima_copy[['Plas','Pres','skin','test','mass']].replace(0,np.NaN)
print(pima_copy.isnull().sum())

------------------------------------------------------------

**b. Visualization for understanding and analysing the distribution of data for different variables**

In [None]:
pima.hist(figsize = (20,16),grid=True)

In [None]:
pima.plot(kind= 'box' , subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(15,15))
sns.set(font_scale = 1.5)

In [None]:
fig, ax = plt.subplots(4,2, figsize=(16,16))
sns.set(font_scale = 1)
sns.distplot(pima.Plas, ax = ax[0,0], color = 'orange')
sns.distplot(pima.Preg, ax = ax[0,1], color = 'red')
sns.distplot(pima.Pres, ax = ax[1,0], color = 'seagreen')
sns.distplot(pima.age, ax = ax[1,1], color = 'purple')
sns.distplot(pima.mass, ax = ax[2,0], color = 'deeppink')
sns.distplot(pima.pedi, ax = ax[2,1], color = 'brown')
sns.distplot(pima.skin, ax = ax[3,0], color = 'royalblue')
sns.distplot(pima.test, ax = ax[3,1], color = 'coral')

Variables 'age' (Age), 'pedi' (DiabetesPedigreeFunction), 'Preg' (Pregnancies), 'skin' (SkinThickness), 'test' (Insulin) are right skewed that is:

Mean > Median


------------------------------------------------------------

**c. DATA CLEANING**

Let us check how data is distributed for variables that have an invalid zero value

See if there are any outliers

See if data is normally distributed, left skewed or right skewed

We will use boxplot and distplot(Histogram)


In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.distplot(pima['Plas'], kde = True, rug = True, color = 'orange')

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.boxplot(pima.Plas, color = 'orange')

Looks like variable 'Plas' [Glucose] has one outlier

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.distplot(pima['Pres'], kde = True, rug = True, color = 'seagreen')

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.boxplot(pima.Pres, color = 'seagreen')

Variable 'Pres' [BloodPressure] has very few outliers

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.distplot(pima['skin'], kde = True, rug = True, color = 'royalblue')

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.boxplot(pima.skin, color = 'royalblue')

'skin' [SkinThickness] has 227 zero invalid values that is why lower limit and Q1 (25th) quartile are same

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.distplot(pima['test'], kde = True, rug = True, color = 'coral')

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.boxplot(pima.test, color = 'coral')

'test' [Insulin] has 374 zero invalid values that is why lower limit and Q1 (25th) quartile are same

Also there are many outliers

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.distplot(pima['mass'], kde = True, rug = True, color = 'deeppink')

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.boxplot(pima.mass, color = 'deeppink')

--------------------------------------------------------

Replace NaN (earlier we replaced 0s with NaN in pima_copy dataframe) with Median or Mean

Variables 'Plas' [Glucose] and 'Pres' [BloodPressure] do not have much outliers and we need to fill little data so we will use mean here

Variables 'skin' [SkinThickness], 'test' [Insulin], and 'mass' [BMI] have much disparity and we need to fill more data so we will use median here


In [None]:
pima_copy['Plas'].fillna(pima_copy['Plas'].mean(), inplace = True)

In [None]:
pima_copy['Pres'].fillna(pima_copy['Pres'].mean(), inplace = True)

In [None]:
pima_copy['skin'].fillna(pima_copy['skin'].median(), inplace = True)

In [None]:
pima_copy['test'].fillna(pima_copy['test'].median(), inplace = True)

In [None]:
pima_copy['mass'].fillna(pima_copy['mass'].median(), inplace = True)

In [None]:
print(pima_copy.isnull().sum())

Thus we have replaced all NaN values with Mean or Median making data clean

Data is cleaned now!

In [None]:
pima_copy.describe().T

---------------------------------------------------------

**d. VISUALIZATION**

Let us perform visulaization on **clean data (pima_copy)**

In [None]:
pima_copy.plot(kind= 'box' , subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(15,15))
sns.set(font_scale = 1.5)

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.countplot(pima_copy['Preg'])
plt.ylabel('Number of People')

The above plot tells us No. of Pregnancies vs No. of Pima Indian Women

Let us calculate the average of children had by Pima Indian woman

In [None]:
print("Average number of children had by Pima woman: ", pima_copy['Preg'].mean())

In [None]:
pima_copy['Preg'].median()

In [None]:
preg = pima_copy[pima_copy['Preg'] >= 1].shape[0]
print('Number of Pima Woman who had children: ', preg)

In [None]:
notPreg = pima_copy[pima_copy['Preg'] == 0].shape[0]
print('Number of Pima woman who did not have children: ', notPreg)

In [None]:
pregPlusDiabetes = pima_copy[(pima_copy['Preg'] >= 1) & (pima_copy['class'] == 1)].shape[0]
print('Number of woman who have children and are diabetic: ',pregPlusDiabetes)

In [None]:
pregPlusNotDiabetes = pima_copy[(pima_copy['Preg'] >= 1) & (pima_copy['class'] == 0)].shape[0]
print('Number of woman who have children and are not diabetic: ',pregPlusNotDiabetes)

In [None]:
notPregPlusDiabetes = pima_copy[(pima_copy['Preg'] == 0) & (pima_copy['class'] == 1)].shape[0]
print('Number of woman who do not have children and are diabetic: ',notPregPlusDiabetes)

In [None]:
notPregPlusNotDiabetes = pima_copy[(pima_copy['Preg'] == 0) & (pima_copy['class'] == 0)].shape[0]
print('Number of woman who do not have children and are not diabetic: ',notPregPlusNotDiabetes)

From above I can say that, Pima women who have children have more possibility of being Diabetic

-------------------------------------------------

**e. BIVARIATE ANALYSIS**

In [None]:
corr = pima_copy.corr()
corr

In [None]:
plt.figure(figsize=(15,10))
sns.set(font_scale = 1.5)
sns.heatmap(corr, annot = True, cmap = 'plasma', vmin = -1, vmax = 1, linecolor='white', linewidths= 1)

**From above heatmap, we can conclude following:**

1. There is no feature variable that has strong correlation with target 'class' (Outcome) as there is no +0.70 which indicates a strong uphill (positive) linear relationship

2. Best predictor of target variable 'class' (Outcome) is 'Plas' (Glucose) --> 0.49 which is near to 0.50 which indicates a moderate uphill (positive) relationship

3. Second best predictor of target variable 'class' (Outcome) is 'mass' (BMI) --> 0.31 which indicates a weak uphill (positive) linear relationship

4. Correlation between 'mass' (BMI) and 'Skin' (SkinThickness) is 0.54 which indicates BMI increases with increase in Skin Thickness

5. Correlation between 'age' (Age) and 'Preg' (Pregnancies) is 0.54 which indicates increase in age increases chances of having a child

6. All the variables look to be uncorrelated. So we cannot eliminate any variable just by looking at the correlation matrix

In [None]:
print('Average Glucose for Pima woman who has diabetes: ', pima_copy[pima_copy['class'] == 1]['Plas'].mean())

In [None]:
print('Average Glucose for Pima woman who does not have diabetes: ', pima_copy[pima_copy['class'] == 0]['Plas'].mean())

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(pima_copy['class'],pima_copy['Plas'], palette="Set2")
sns.set(font_scale = 1.5)

Pima woman who have diabetes has higher Glucose levels 

whereas

Pima woman who does not have diabetes has lower Glucose levels

In [None]:
print('Average BMI for Pima woman who has diabetes: ', pima_copy[pima_copy['class'] == 1]['mass'].mean())

In [None]:
print('Average BMI for Pima woman who does not have diabetes: ', pima_copy[pima_copy['class'] == 0]['mass'].mean())

I checked on Internet for BMI scale and here it is: 

Underweight: BMI is less than 18.5

Normal weight: BMI is 18.5 to 24.9

Overweight: BMI is 25 to 29.9



Thus I can say that Pima women are Obese both who has diabetes and who does not have diabetes as their average is more than 30

Pima woman who has diabetes have more BMI as comapred to who does not have diabetes


In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.boxplot(pima_copy['class'],pima_copy['age'], palette = "Set3")

In [None]:
oneOutcome = pima_copy[pima_copy['class'] == 1]
print("Minimum age of Pima woman who has Diabetes: ",oneOutcome['age'].min())

In [None]:
print("Maximum age of Pima woman who has Diabetes: ",oneOutcome['age'].max())

In [None]:
zeroOutcome = pima_copy[pima_copy['class'] == 0]
print("Minimum age of Pima woman who does not have Diabetes: ",zeroOutcome['age'].min())

In [None]:
zeroOutcome = pima_copy[pima_copy['class'] == 0]
print("Maximum age of Pima woman who does not have Diabetes: ",zeroOutcome['age'].max())

In [None]:
print('Average Age of Pima woman who has diabetes: ',pima_copy[pima_copy['class'] == 1]['age'].mean())

In [None]:
print('Average Age of Pima woman who does not have diabetes: ',pima_copy[pima_copy['class'] == 0]['age'].mean())

so as age increases the risk of being diabetes also increases

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
sns.countplot(x = 'Preg', hue = 'class', data = pima_copy, palette = 'Set2')

There is very less correlation between 'Preg' (Pregnancies) and 'class' (Outcome) that is they are weakly correlated

Pima woman with less number of children have low chance of diabetes


In [None]:
print('Average Skin Thickness of Pima woman who has diabetes: ', pima_copy[pima_copy['class'] == 1]['skin'].mean())

In [None]:
print('Average Skin Thickness of Pima woman who does not have diabetes: ', pima_copy[pima_copy['class'] == 0]['skin'].mean())

Since 'SkinThickness' had more than 227 zeros values and we replaced it with Median, data is not so representative. 

But we can say that more the skin thickness more is the probability of getting diabetes

In [None]:
print('Average Insulin of Pima woman who has diabetes: ', pima_copy[pima_copy['class'] == 1]['test'].mean())

In [None]:
print('Average Insulin of Pima woman who does not have diabetes: ', pima_copy[pima_copy['class'] == 0]['test'].mean())

Same like 'SkinThickness', for 'Insulin', we replaced invalid zero minimum value with median so data is not representative. 

But we can say that higher the insulin higher chances of getting diabetes. 

Also insulin has a moderate correlation with Glucose







In [None]:
sns.set(font_scale = 1.5)
sns.pairplot(data = pima_copy, hue = 'class', diag_kind = 'kde', palette = 'Set2')

**Analysis**

1. The diagonal shows the distribution of the the dataset with the kernel density plots.

2. The scatter-plots shows the relation between each and every attribute or features taken pairwise. Looking at the scatter-plots, we can say that no two attributes are able to clearly seperate the two outcome-class.

-----------------------------------------------

# Step 3: Build a Model using Naive Bayes Classifier

Build a predictive model so when a new person walks-in with these values of variables we feed it to the model and model tells us what is the probability of 0 or 1

We will use **"Logistic Regression"** and **"Naive Bayes classifier"** algorithm to predict 0 or 1


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc

In [None]:
from sklearn.model_selection import train_test_split
X = pima_copy.drop('class', axis  = 1)
y = pima_copy['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 17)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.size)
print(y_test.size)

# 1. Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

**a. MODEL EVALUATION**

In [None]:
confusion = metrics.confusion_matrix(y_test,y_pred)
confusion

In [None]:
ylabel = ["Actual [Non-Diab]","Actual [Diab]"]
xlabel = ["Pred [Non-Diab]","Pred [Diab]"]
#sns.set(font_scale = 1.5)
plt.figure(figsize=(15,6))
sns.heatmap(confusion, annot=True, xticklabels = xlabel, yticklabels = ylabel, linecolor='white', linewidths=1)

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.

So here:

TN = 132

TP = 45

FN = 36

FP = 18

This is how I interpret the confusion matrix: 0 = No Diabetes;  1 = Diabetes

**TP**: Our model predicted 45 women as diabetic and in actual they were diabetic (Model was correct here)

**TN**: Our model predicted 132 women as non-diabetic and in actual they were non-diabetic (Model was correct here)

**FP**: Our model predicted 18 women as diabetic and in actual they were non-diabetic (Model was wrong here - "Type 1 error")

**FN**: Our model predicted 36 women as non-diabetic and in actual they were diabetic (Model was wrong here - "Type 2 error")


**Accuracy**: Overall, how often is the classifier correct?
    
Accuracy = (TP + TN)/ total

         = (45 + 132) / (45 + 132 + 18 + 36)
         
         =  177 / 231
         
         = 0.76623376623376623376623376623377

In [None]:
print('Accuracy of Logistic Regression is: ', model.score(X_test,y_test) * 100,'%')

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

**Precision**: Precsion tells us about when model predicts yes, how often is it correct.

TP/predicted yes

Precision = TP / (TP + FP)

= 45 / (45 + 18)

= 45 / 63

= 0.7142857142857143

So when our model predict 1 and actual it is 1 then it's precision is 71%. It should be high as possible.


In [None]:
# print ('Precision: ', metrics.precision_score(y_test,y_pred))
Precision = TP / ( TP + FP )
print ('Precision: ', Precision)

**Recall**: When the actual value is positive, how often is the prediction correct?
    
TP/actual yes

Recall = TP / (TP + FN)
    
= 45 / (45 + 36)
        
= 45 / 81
    
= 0.55555555555555555555555555555556

When it's actually yes, how often does model predict yes?

Recall is also known as “sensitivity” and “true positive rate” (TPR).

In [None]:
Recall = TP / ( TP + FN )
print ('Recall: ', Recall)

**Analysis**:

**Low recall, high precision**: 
    Indicates that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP).
    
In other words, we miss a lot of Diabetic examples but those we predict as Diabetic are indeed Diabetic

In [None]:
pima_copy['class'].value_counts()

There are 500 non-diabetic Pima Indian women and 268 diabteic Pima Indian women. We can clearly see that, this dataset is unbalanced.

It is critical to know this for a few reasons:

1. Algorithms like Logistic Regression assumes that the data is balanced. But if the data is unbalanced they will put more weight on the majority class. In this case, the non-diabetic class (500).

2. Accuracy is not a useful metric now.

For example; We have 100 emails, 99 of which are spam emails and 1 is a good email. If we create an algorithm that always is going to put all emails in your spam folder, then we will have an accuracy of 99% or so. 
But we will not be performing a great job because a good email will also be in your spam folder.

Because of this we will see other metrics such as Sensitivity, Specificity and Roc_Auc

We will use **Roc_Auc score as the metric of success**

**F-measure**


It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.

In other words the F1 score conveys the balance between the precision and the recall.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

F1 = 2 * (0.71 * 0.56) / (0.71 + 0.56)

F1 = 0.63

In [None]:
metrics.f1_score(y_test, y_pred)

**Specificity** (also called the **true negative rate**) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

If a person does not have the disease how often will the test be negative (true negative rate)?

In [None]:
Specificity = TN / ( TN + FP )
print ('Specificity: ', Specificity)

**Sensitivity** (also called the **true positive rate**, the **recall**, or **probability of detection** in some fields) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).

If a person has a disease, how often will the test be positive (true positive rate)? 

In [None]:
Sensitivity = TP / ( TP + FN )
print ('Sensitivity: ', Sensitivity)

**AUC ROC** tells how much model is capable of distinguishing between classes.

AUC is the Area under the Curve. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. 

By analogy, Higher the AUC, better the model is at distinguishing between Pima Indian women with Diabetic and Non-Diabetic.

The ROC curve is plotted with **True Plot Rate (Sensitivity)** against the **False Plot Rate (1 - Specificity)** where TPR is on y-axis and FPR is on the x-axis

In [None]:
Roc_Auc = metrics.roc_auc_score(y_test, y_pred)
print ('Roc Auc Score: ', Roc_Auc)

In [None]:
y_pred_prob = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
plt.plot(fpr, tpr)
plt.title('ROC Curve for Logistic Regression Diabetes Classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
plt.xlim(0.0, 1.0)
plt.ylim(0.0, 1.0)
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')

**Analysis**

So it has a good **ROC AUC** score. There is **72%** chance that model will be able to distinguish between Diabetic class and Non-Diabetic class.

But, it fails tremendosly in the **sensitivity**. And the sensitivity, the recall is the thing that we want to improve. *Sensitivity: When the actual value is positive, how often is the prediction correct?*

Or in other words. If you have diabetes, will it detect it?

The problem with sensitivity is that it will detect diabetes to people who dont have diabetes. But it is better that, than sending someone home telling them that they do not have diabetes.

-----------------------------------------------

# 2. Naive Bayes Classifier

In [None]:
nbModel = GaussianNB()

In [None]:
nbModel.fit(X_train, y_train)

In [None]:
nb_y_pred = nbModel.predict(X_test)

**a. MODEL EVALUATION **

In [None]:
nbConfusion = metrics.confusion_matrix(y_test, nb_y_pred)
nbConfusion

In [None]:
ylabel = ["Actual [Non-Diab]","Actual [Diab]"]
xlabel = ["Pred [Non-Diab]","Pred [Diab]"]
#sns.set(font_scale = 1.5)
plt.figure(figsize=(15,6))
sns.heatmap(nbConfusion, annot=True, xticklabels = xlabel, yticklabels = ylabel, linecolor='white', linewidths=1)

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.

So here:

**TN** = 124

**TP** = 47

**FN** = 34

**FP** = 26

This is how I interpret the confusion matrix: 0 = No Diabetes; 1 = Diabetes

TP: Our model predicted 47 women as diabetic and in actual they were diabetic (Model was correct here)

TN: Our model predicted 124 women as non-diabetic and in actual they were non-diabetic (Model was correct here)

FP: Our model predicted 26 women as diabetic and in actual they were non-diabetic (Model was wrong here - "Type 1 error")

FN: Our model predicted 34 women as non-diabetic and in actual they were diabetic (Model was wrong here - "Type 2 error")Accuracy: Overall, how often is the classifier correct?

**Accuracy**: Overall, how often is the classifier correct?

Accuracy = (TP + TN)/ total
     = (47 + 124) / (47 + 124 + 26 + 34)

     =  171 / 231

     = 0.74025974025974025974025974025974

In [None]:
print('Accuracy of Naive Bayes Classifier is: ', nbModel.score(X_test,y_test) * 100,'%')

In [None]:
print(classification_report(y_test, nb_y_pred))

In [None]:
nb_TP = nbConfusion[1, 1]
nb_TN = nbConfusion[0, 0]
nb_FP = nbConfusion[0, 1]
nb_FN = nbConfusion[1, 0]

**Precision**: Precsion tells us about when model predicts yes, how often is it correct.

TP/predicted yes

Precision = TP / (TP + FP)

= 47 / (47 + 26)

= 47 / 73

= 0.64383561643835616438356164383562

So when our model predict 1 and actual it is 1 then it's precision is 64%. It should be high as possible.


In [None]:
# print ('Precision: ', metrics.precision_score(y_test, nb_y_pred))
nb_Precision = nb_TP / ( nb_TP + nb_FP)
print ('Precision: ', nb_Precision)

**Recall**: When the actual value is positive, how often is the prediction correct?
    
TP/actual yes

Recall = TP / (TP + FN)
    
= 47 / (47 + 34)
        
= 47 / 81
    
= 0.58024691358024691358024691358025

When it's actually yes, how often does model predict yes?

Recall is also known as “sensitivity” and “true positive rate” (TPR).

In [None]:
nb_Recall = nb_TP / ( nb_TP + nb_FN )
print ('Recall: ', nb_Recall)

**Analysis**:

**Low recall, high precision**: 
    Indicates that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP).
    
In other words, we miss a lot of Diabetic examples but those we predict as Daibetic are indeed Diabetic

In [None]:
pima_copy['class'].value_counts()

There are 500 non-diabetic Pima Indian women and 268 diabteic Pima Indian women. We can clearly see that, this dataset is unbalanced.

It is critical to know this for a few reasons:

1. Algorithms like Logistic Regression assumes that the data is balanced. But if the data is unbalanced they will put more weight on the majority class. In this case, the non-diabetic class (500).

2. Accuracy is not a useful metric now.

For example; We have 100 emails, 99 of which are spam emails and 1 is a good email. If we create an algorithm that always is going to put all emails in your spam folder, then we will have an accuracy of 99% or so. 
But we will not be performing a great job because a good email will also be in your spam folder.

Because of this we will see other metrics such as Sensitivity, Specificity and Roc_Auc

We will use **Roc_Auc score as the metric of success**

**F-measure**


It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.

In other words the F1 score conveys the balance between the precision and the recall.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

F1 = 2 * (0.64 * 0.58) / (0.64 + 0.58)

F1 = 0.61038961

In [None]:
metrics.f1_score(y_test, nb_y_pred)

**Specificity** (also called the **true negative rate**) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

If a person does not have the disease how often will the test be negative (true negative rate)?

In [None]:
nb_Specificity = nb_TN / ( nb_TN + nb_FP )
print ('Specificity: ', nb_Specificity)

**Sensitivity** (also called the **true positive rate**, the **recall**, or **probability of detection** in some fields) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).

If a person has a disease, how often will the test be positive (true positive rate)? 

In [None]:
nb_Sensitivity = nb_TP / ( nb_TP + nb_FN )
print ('Sensitivity: ', nb_Sensitivity)

**AUC ROC** tells how much model is capable of distinguishing between classes.

AUC is the Area under the Curve. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. 

By analogy, Higher the AUC, better the model is at distinguishing between Pima Indian women with Diabetic and Non-Diabetic.

The ROC curve is plotted with **True Plot Rate (Sensitivity)** against the **False Plot Rate (1 - Specificity)** where TPR is on y-axis and FPR is on the x-axis

In [None]:
nb_Roc_Auc = metrics.roc_auc_score(y_test,nb_y_pred)
print ('Roc Auc Score: ', nb_Roc_Auc)

In [None]:
nb_y_pred_prob = nbModel.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, nb_y_pred_prob)

In [None]:
plt.figure(figsize=(15,6))
sns.set(font_scale = 1.5)
plt.plot(fpr, tpr)
plt.title('ROC Curve for Naive Bayes Diabetes Classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
plt.xlim(0.0, 1.0)
plt.ylim(0.0, 1.0)
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')

**Analysis**

So it has a good **ROC AUC** score. There is **70%** chance that model will be able to distinguish between Diabetic class and Non-Diabetic class.

But, it fails tremendosly in the **sensitivity**. And the sensitivity, the recall is the thing that we want to improve. *Sensitivity: When the actual value is positive, how often is the prediction correct?*

Or in other words. If you have diabetes, will it detect it?

The problem with sensitivity is that it will detect diabetes to people who dont have diabetes. But it is better that, than sending someone home telling them that they do not have diabetes.

-------------------------