## Data Pre-proccessing

In [None]:
# Importing important libraries
import pandas as pd, numpy as np
import seaborn as sns

import os

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Current working directory
os.getcwd()

In [None]:
# The current working directory name
print(os.listdir("../input"))

In [None]:
#Load the data
health =  pd.read_csv("../input/pima-indians-diabetes/health care diabetes.csv")

In [None]:
#Check the nature of the datasets
print(health.shape)

In [None]:
#View the the first 10 instances
health.head(10)

In [None]:
#Column wise null values check
health.isna().sum()

In [None]:
#For the entire DataFrame null values check
health.isnull().any().any()

#### We obesrve that the data contains no null values, however let's do the value counts to see the nomalies of each variable

In [None]:
#The mean of the variable 'Insulin'
print(health['Insulin'].mean(), health['Insulin'].median())

In [None]:
#### Describe the data to get the various statistics excluding the 'missing values' for the entire DataFrame 
health.describe()

In [None]:
health.info()

In [None]:
#Check the value counts for each of the indexes
for col in health.columns:
    print('The value counts in '+col+' are:', health[col].value_counts())

We can see that there are 0 values in columns(BloodPressure, SkinThickness, Insulin, BMI) which is anomaly, as this is 
the first thing for every patients prior the doctor's consultation.

#### Performing a value_count plot below for Glucose. We can see zeros about six of them.

In [None]:
# We further do a histogram plot (value_counts) for Glucose column to see if contain any 0 values.

plt.figure(figsize=(18,8))
health['Glucose'].value_counts().plot.bar(title='Frequency Distribution of Glucose')

In [None]:
#The distribution column of the Glucose shows a positively skewed distribution. A large number of values occurs on the left
#with the fewer number of data values on the right side.
plt.figure(figsize=(18,8))
health['Glucose'].value_counts().plot.hist(title='Frequency Distribution of Glucose', color='g')

In [None]:
# Data type conversions
health['BMI'] = health['BMI'].astype('int64')
health['DiabetesPedigreeFunction'] = health['DiabetesPedigreeFunction'].astype('int64')

In [None]:
# Show new data types
health.dtypes

In [None]:
health.head()

In [None]:
col_with_0 = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']   

#### Visually explore these columns using histogram

In [None]:
# Exploring the variables using histogram
health.hist(column=col_with_0, rwidth=0.95, figsize=(15,8))

In [None]:
# We set an appropriate range to observe the 0 value counts
health.hist(column=col_with_0, bins=[0,10,15], rwidth=0.95, figsize=(15,8))

In [None]:
# Replace the all the zero values with NaN
cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
health[cols] = health[cols].replace(0, np.nan)

In [None]:
health

#### Treat the missing values with the pandas method using the mean.

In [None]:
health.fillna(health.median())

In [None]:
health = health.fillna(health.median())

In [None]:
health.dtypes

In [None]:
health['Pregnancies']=health['Pregnancies'].astype('int64')

## Exploratory Data Analysis

In [None]:
health.hist(column=cols, rwidth=0.95, figsize=(15,8))

In [None]:
# Plotting the Histogram barplot to compare the frequency distribution of each indexes.
# We also create a range to compare and analyse different distribution.

bins=[40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200]

plt.hist(health['Glucose'], rwidth=0.95, bins=bins, color='g')
plt.xlabel('Glucose range')
plt.ylabel('Total no of patients')
plt.title('Glucose Analysis')

#### The nomal glucose level is between 80 and 115. If its above 115 you are considered as diabetic, the outcome is '1'.

We can observe above that in 120 glucose range there are more than 100 patients considered possible diabetic. The maximum 
number of patients with no diabetic is almost 120. The number of patients who are normal are more the number of pre-diabetic 
or diabetic patients.

In [None]:
bin_BP=[10,20,30,40,50,60,70,80,100,110,120,130]

plt.hist(health['BloodPressure'], bins=bin_BP, rwidth=0.9, color='orange', orientation='horizontal')
plt.title('Blood Presure Analysis')
plt.xlabel('No of patients')

plt.ylabel('BloodPressure range')

In [None]:
# Check the new data type
health.dtypes

In [None]:
sns.countplot(health.dtypes.map(str))
plt.show()

In [None]:
from matplotlib.pyplot import figure
figure(figsize=(16,8))
sns.countplot(data=health, x='Age', hue='Outcome')

We note in the above age count plot that:
+ The age between 21 to 30 have the more patients without diabetes compared to the patients with diabetes. However, we 
observe that most of our data is distributed between this range of age. If you are in this age range you are more likely to
have no diabetes.
+ The age between 31 to 54 have more diabetic patients compared to non diabetic, when you grow older you are more likely to
have diabetes. One of the factors causing this may be the less activities and excercises compared to the age between 21-30.
+ The older patients from age 54 up to the age of 81 which are less populated, they are more likey to have diabetes.

In [None]:
fig, ax =plt.subplots(1,2)
figure(figsize=(12,6))
sns.countplot(health['Age'],  ax=ax[0])
sns.countplot(health['DiabetesPedigreeFunction'], ax=ax[1])
fig.show()

In [None]:
health=health.astype('int64')

In [None]:
health.dtypes

In [None]:
sns.countplot(health['Glucose'])

In [None]:
fig, ax = plt.subplots(2, 4, figsize=(20, 10))
for variable, subplot in zip(health, ax.flatten()):
    sns.countplot(health[variable], ax=subplot)
    for label in subplot.get_xticklabels():
        label.set_rotation(90)

In [None]:
health['Outcome'].value_counts()

In [None]:
sns.countplot(data=health, x='Outcome', palette='hls')
plt.show()
plt.savefig('count_plot')

### This is an Impalanced Data
+ Looking at the count plot above we can see that we are dealing with an imbalanced data. Which refers to a problem with 
classification problems where the classes are not represented equally.
+ This is an imbalanced dataset and the ratio of class-1 to class-0 instances is 500:268 or more concisely 2:1.

### Techniques we can use to combat the imbalanced training data:
#### 1.Collect More Data
+ The collection of more data is the another good way of fixing the imbalanced data. The large dataset might expose a
different and perhaps more balanced perspective on the classes.

#### 2.Change the Performance Metric
+ Accuracy is not the metrics to use when dealing with an imbalanced dataset, this metric is misleading.
+ There are metrics that have been designed to tell us a more truthful story when working with an imbalanced dataset.
+ The following performance measures can give more insight into the accuracy of the model than the traditional  
classification accuracy:
    
##### Confusion Matrix: 
+ A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect 
 predictions made (what classes incorrect predictions were assigned).
##### Precision: 
+ A measure of a classifiers exactness.
##### Recall:
+ A measure of a classifiers completeness
   ##### F1 Score (or F-score): 
+ A weighted average of precision and recall.
   
    We can also look at the following:
##### ROC Curves: 
+ Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen
based on the balance thresholds of these values.
            
#### 3. Try Different Algorithms  
+ Using one favorite algorithm on every problem is not adviceable. Using different algorithms on a problem will give 
different results and accuracy. Random Forest and Decision tree algorithms they often perform well.

In [None]:
sns.scatterplot(y=health['Age'], x=health['Outcome']);

#### Feature-Feature Relationships
+ We explore the relationship between the attributes.

In [None]:
# The scatter charts between the pair of variables to understand the relationship.
sns.pairplot(health)

Looking at the pair plot above we can see that we have the positive correlation between BMI and SkinThickness, no correlation between Glucose and SkinThickness. Age have the strong positive correlation with bloodPressure.

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(health, alpha=0.2, figsize=(16, 10), diagonal='kde')

+ Looking at the matrix of scatter plots of all attributes versus all attributes. We can see possible correlation between the 
BloodPressure and BMI and another possible relationship between Age and Pregnancies.

## Correlation Analysis and Feature Selection

#### Univariate Analysis

In [None]:
#Important libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X = health.iloc[:,0:8]
y = health.iloc[:,-1]

In [None]:
#Apply SeleckKBest class to extarct top 6 best features
bestfeatures = SelectKBest(score_func=chi2, k=6)
fit = bestfeatures.fit(X,y)

In [None]:
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

In [None]:
#Concat two DataFrames for better visualization
featureScores = pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns = ['col','Score']

The feature with highest scores will be much more correlated to the 'Outcome' feature that we have in our dataset.
Below if the the score is high, then the more important that feature is. We can see that the Insulin and Glucose have the
highest scores, that means if the Glucose and Insulin increases the Outcome of patients with diabetes also increases.

In [None]:
featureScores

In [None]:
# Print the 6 best features
print(featureScores.nlargest(6,'Score'))

Above we can observe the top six features that are correlated to the Outcome variable.

#### Correlation Matrix with Heatmap

In [None]:
#The correlation of each feauture in the dataset
corrmat = health.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(16,8))
#We plot the heatmap
sns.heatmap(health[top_corr_features].corr(), annot=True, cmap='RdYlGn')

+ Here we can infer that "Outcome" has the strong positive correlation with "Glucose" whereas it almost has no correlation 
  with DiabetesPedigreeFunction.
+ "SkinThickness" and "Insulin" has almost no correlation with "Pregnancies".
+ There is a strong correlation between independent features which are "Insulin and Glucose", "BMI and SkinThickness", "Age and
  Pregnancies".

#### Correlation matrix with heatmap also gives a clear correlation between independent features, which is very great.

In [None]:
# We can drop this two features since they are not correlated with target variable(Outcome)
health.drop(['BloodPressure','DiabetesPedigreeFunction'], axis=1, inplace=True)

In [None]:
health.head()

## Data Modeling

#### To medel this dataset I will use logistic Regression

#### Logistic Regression

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). ... Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

In [None]:
#split the dataset in features and target varaible
X = health.drop("Outcome", axis=1)
y = health["Outcome"]

#### Split the datset
To understand the model perfomance we split the datasets into training set and test set.

In [None]:
# Split X and y into training and testing set.
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

The 75% of the data will be used for model training and 25% for model testing.

#### Model Development and predictions
+ Import the Logistc Regressiom module and create a Logistic Regressin classifier object using LogisticRegression() unction.

In [None]:
# import the class
from sklearn.linear_model import LogisticRegression

In [None]:
#instantiate the model
logreg = LogisticRegression()

#fit the model with train data
logreg.fit(X_train, y_train)

#fit the model with predictor test data(X_test)
y_pred = logreg.predict(X_test)

#### Model Evaluation using Confusion Matrix

#### Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) 
on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It 
is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one.

In [None]:
# import the metrics
from sklearn import metrics
cf_matrix = metrics.confusion_matrix(y_test, y_pred)
cf_matrix

The dimension of this matrix is 2*2 because this model is binary classification. We have two classes 0 and 1.
Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions.

There were 118 True Positives, patients with diabetes that were correctly classified and 33 True Negatives, patients without
diabetes that were correctly classified. However, the algorithm misclassified 29 patients that did have a diabetes by saying 
they did not (False Negative) and algorithm misclassified 12 patients that did not have diabetes by saying that they
did (False Positive)

#### Visual Confinsion Matrix using Heatmap
We visualize the results of the model in the form of the confusion matrix using matplotlib and seaborn.

In [None]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

#### Evaluation Matrix of Confusion Matrix
+ Accuracy, Precision, and Recall

    + Accuracy is the proportion of true results among the total number of cases examined.
    + Precision gives the proportion of predicted Positives that are truly Positive.
    + Recall measures the proportion of actual Positives that were correctly classified.

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

We got the classification rate of 78% considered as the good accuracy. This are true results among the total number of cases 
examined.

The model got the 73% accururateness in predicting the patients with diabetes.

Reacall: This are the results of the patients predicted to have diabetes and Logistic regression can capture 53% of patients
    with diabetes.


## Comparing the KNN Algorithm with DecisionTree, Random Forest, and Logistic Regression classifier.

#### We use the pipeline in sklearn

In [None]:
# Import the libraries (pipeline and models)
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
pipeline_dt=Pipeline([('dt_classifier',DecisionTreeClassifier(random_state=0))])

In [None]:
pipeline_rf=Pipeline([('rf_classifier',RandomForestClassifier())])

In [None]:
pipeline_knn=Pipeline([('kn_classifier',KNeighborsClassifier())])

In [None]:
pipeline_lr=Pipeline([('lr_classifier',LogisticRegression())])

In [None]:
#Make the list of pipelines
pipelines = [pipeline_dt,pipeline_rf,pipeline_knn,pipeline_lr]

In [None]:
best_accuracy=0.0
best_classifier=0
best_pipeline=""

In [None]:
#Dictionery of pipelines and classifier type for ease of reference
pipe_dict = {0: 'Decision Tree', 1: 'RandomForest', 2: 'KNeighbors', 3:'Logistic Regression'}

#Fit the pipilines
for pipe in pipelines:
    pipe.fit(X_train, y_train)

In [None]:
for i,model in enumerate(pipelines):
    print("{} Test Accuracy: {}".format(pipe_dict[i],model.score(X_test,y_test)))

In [None]:
for i,model in enumerate(pipelines):
    if model.score(X_test,y_test)>best_accuracy:
        best_accuracy=model.score(X_test,y_test)
        best_pipeline=model
        best_classifier=i
print('Classifier with best accuracy:{}'.format(pipe_dict[best_classifier]))         

Logistic Regression is the best classifier compared with other algorithms, with the good accuracy of 78%. This is the
proportion of true results among the total number of cases examined.

### The Confusion Matrix for each Agorithm

#### Decision Tree Confusion Matrix

In [None]:
y_pred_0 = pipeline_dt.predict(X_test)

In [None]:
dt_cnf_matrix=metrics.confusion_matrix(y_test,y_pred_0)
dt_cnf_matrix

#### Random Forest Confusion Matrix

In [None]:
y_pred_1 = pipeline_rf.predict(X_test)

In [None]:
rf_cnf_matrix=metrics.confusion_matrix(y_test,y_pred_1)
rf_cnf_matrix

#### KNeighborsClassifier Confusion Matrix

In [None]:
y_pred_3 = pipeline_knn.predict(X_test)

In [None]:
knn_cnf_matrix=metrics.confusion_matrix(y_test,y_pred_3)
knn_cnf_matrix

#### LogisticRegression Confusion Matrix

In [None]:
y_pred_4=pipeline_lr.predict(X_test)

In [None]:
lr_cnf_matrix=metrics.confusion_matrix(y_test,y_pred_4)
lr_cnf_matrix

Looking at the results the KNN correctly classified 113 patients with diabetes and 34 patients without diabetes, better TP
compared to the Decision Tree and Random Forest. Decision Tree and Random Forest are the only classifiers with the better number of True Negatives
compared to other classifiers. However KNN did better compared to the Logistic Regression which got 118 correctly 
classified patients with diabetes and 33 correctly classified patients without diabetes.

If identifying patients without diabetes was more important in this data we would choose Random Forest or Decision tree Classifier. 
Also if identifying patients with diabetes was more important in this data, we would choose Logistic regression.

#### The Logistic Regression or KNeighbors Classifier are better choices over Random Forest and Decision Tree Classifier in correctly identifying true positive patients.

However the two Confusion Matric of Logistic Regression and KNeighbors Classifier make it hard to choose which machine
learning method is the better fit for this data. They differ with a small margin.
+ We need to use more sophisticated metrics, like Sensitivity, Specificity, ROC and AUC that can help us make a decision.

### Sensitivity and Specificity
+ Sensitivity tells us what percentage of patients with diabetes were correctlly identified.
+ Specificity tells us what percentage of patients without diabetes were correctly identified.

#### Sensitivity and Specificity of Random Forest and Logistic Regression

In [None]:
# For KNeighbors
sensitivity, specificity= [113/(113+28), 34/(34+17)]
print("sensitivity:", sensitivity)
print("specificity:", specificity)

In [None]:
# For Logistic Regression
sensitivity, specificity= [117/(117+29), 33/(33+13)]
print("sensitivity:", sensitivity)
print("specificity:", specificity)

Since identifying patients with diabetes and without diabetes were more important. We choose Logistic Regression as the best 
classifier for both classifications, beacuse of its good sensitivity and specificity compared to the KNeighbors classifier.
We take note that they have the same parcentage of sensitivity because Logistic Regression have the high number of errors of 
True Negative compared to KNeighbors.

### AUC Curve and Area Under ROC Curve in Machine Learning 
+ ROC or Receiver Operating Characteristic curve is used to evaluate classification models in Machine Learing.
+ ROC or Receiver Operating Characteristic plot is used to visualise the performance of a binary classifier. It gives us the
trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) at different classification thresholds.
+ It is nothing but a graph dispalying the performance of a classification model.
+ It is very popular model to measure the accuracy of a classification model.

## Lets try and Improve the performance of Random Classifier by thresholding

### ROC Curve of the Random Forest Classifier vs. Perfect Classifier(default threshold(0.5))

In [None]:
# Import the roc libraries and use roc_curve() to get the threshold, TPR, and FPR
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
fpr, tpr, thresholds = roc_curve(y_test, pipeline_rf.predict_proba(X_test)[:,1])

The fpr, tpr and threshold arrays:

In [None]:
fpr

In [None]:
tpr

In [None]:
thresholds

In [None]:
# For AUC we use roc_auc_score() function for ROC
rf_roc_auc1 = roc_auc_score(y_test, pipeline_rf.predict(X_test))

In [None]:
#Plot the ROC Curve
plt.figure()
plt.plot(fpr, tpr, label='Random Forest(Sensitivity = %0.3f)' % rf_roc_auc1)
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Tuning the threshold value to build a classifier model with more desired output.

In [None]:
# The predicted probabilities of class 1(diabetes patients)
y_pred_11 = pipeline_rf.predict_proba(X_test)[:,1]
y_pred_11 = y_pred_11.reshape(1,-1)
y_pred_11

In [None]:
# Set the threshold at 0.35
from sklearn.preprocessing import binarize
y_pred_11 = binarize(y_pred_11,0.6)[0]
y_pred_11

In [None]:
#Converting the array from float data type to integer data type
y_pred_11 = y_pred_11.astype(int)
y_pred_11

In [None]:
rf_cnf_matrix1 = metrics.confusion_matrix(y_test, y_pred_11)
rf_cnf_matrix1

+ Note: Here
    + True Positive is 117
    + True Negative is 32
    + False Positve is 13
    + False Negative is 30
    
We can clearly see the classifier has improved. Compare with the previous Confusion Matrix given below:    

In [None]:
rf_cnf_matrix

In [None]:
# Other performance matrics
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_11))

We compare the performance metrics (above) at threshold 0.6 to the performance metrics at default threshold (below).

In [None]:
print(classification_report(y_test, y_pred_1))

We can see an improvement on the precision of the threshold classifier with 70% on classification patients with diabetes.
Whereas the default classifier got 69% of patients with diabetes.

We observe how sensitity changes with threshold, we plot the ROC Curve again.

In [None]:
rf_roc_auc2 = roc_auc_score(y_test, y_pred_11)

In [None]:
plt.figure()
plt.plot(fpr, tpr, label='Threshold=.5=>Sensitivity = %0.3f,\nThreshold=.6=>Sensitivity= %0.3f' % (rf_roc_auc1,rf_roc_auc2))
plt.plot([0,1], [0,1], 'r--')
plt.plot([0,1], [0,1], 'b--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="best")
plt.savefig('Log_ROC')         
plt.show()

Looking at the sensitivity for the threshold which is slightly below the default threshold, this is because there was an 
increase of the False Negative

### ROC Curve for Logistic Regression Classifier

In [None]:
# We use roc_curve() to get the threshold, TPR and FPR
fpr, tpr, thresholds = roc_curve(y_test, pipeline_lr.predict_proba(X_test)[:,1])

In [None]:
fpr

In [None]:
tpr

In [None]:
thresholds

In [None]:
# For AUC we use roc_auc_score() function for ROC
lr_roc_auc3 = roc_auc_score(y_test, pipeline_lr.predict(X_test))

In [None]:
#Plot the ROC Curve
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression(Sensitivity = %0.3f)' % lr_roc_auc3)
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

+ We can tune the threshold values to build a classifier model with more desired output for Logistic algorithm and other 
classifier models.
+ ROC curve assist to choose a threshold that balances sensitivity and specificity that makes sense of perspective for certain
conditions.
+ Increasing the threshold may decrease the sensitivity and increase the specificity. Therefore sensitivity is inversely 
proportional to the the specificity. There is always a trade-off between sensitivity and specificity.