# Pima Indians Diabetes Database

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

* This tutorial is highly recommended for beginners.
* This is my fourth notebook. Do point out my mistakes in comment section.
* I achieved accuracy 77% on test data.
* If you find my work interesting, do upvote it.

This is default first cell in any kaggle kernel. They import NumPy and Pandas libraries and it also lists the available Kernel files. NumPy is the fundamental package for scientific computing with Python. Pandas is the most popular python library that is used for data analysis.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing Necessary Libraries

In [None]:
# Plotting Libraries

import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf
%matplotlib inline

# Metrics for Classification technique

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

# Scaler

from sklearn.preprocessing import RobustScaler, StandardScaler

# Cross Validation

from sklearn.model_selection import KFold, cross_val_score, GridSearchCV, RandomizedSearchCV, train_test_split

# Linear Models

from sklearn.linear_model import LogisticRegression

# Ensemble Technique

from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier

# Other model

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Model Stacking 

from mlxtend.classifier import StackingCVClassifier

# Other libraries

from datetime import datetime
from scipy.stats import skew
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from sklearn.impute import SimpleImputer
from numpy import nan
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

# Loading Dataset

Our first step is to extract data. We will be extracting data using pandas function read_csv. Specify the location to the dataset and import them.

In [None]:
data = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
data.head(6) # Mention no of rows to be displayed from the top in the argument

# Exploring Dataset

In [None]:
# Shape of the dataset

data.shape

**There are 768 rows and 9 columns in the dataset.**

In [None]:
data.info()

**There are no missing values in the dataset. Two columns are of float type and rest are int type.**

In [None]:
data.describe().transpose()

# EDA

**Let's check the correlation between the features.**

In [None]:
plt.figure(figsize=(20,12))
sns.set_context('notebook',font_scale = 1.3)
sns.heatmap(data.corr(),annot=True,cmap='coolwarm')
plt.tight_layout()

**Let's check whether the dependent variable is balanced or not.**

In [None]:
sns.countplot(x=data['Outcome'],data = data)

**It looks like ratio between negative and positive patients is approx 2:1 and actually this is not imbalanced dataset as we have enough values for both 0 and 1.**

# Feature Engineering

In [None]:
X = data.drop('Outcome',axis = 1)
y = data['Outcome']


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)

**Checking for zero values.**

In [None]:
print("total number of rows : {0}".format(len(data)))
print("number of rows missing glucose_conc: {0}".format(len(data.loc[data['Glucose'] == 0])))
print("number of rows missing diastolic_bp: {0}".format(len(data.loc[data['BloodPressure'] == 0])))
print("number of rows missing insulin: {0}".format(len(data.loc[data['Insulin'] == 0])))
print("number of rows missing bmi: {0}".format(len(data.loc[data['BMI'] == 0])))
print("number of rows missing diab_pred: {0}".format(len(data.loc[data['DiabetesPedigreeFunction'] == 0])))
print("number of rows missing age: {0}".format(len(data.loc[data['Age'] == 0])))

In [None]:
# Filling Zero values

fill_values = SimpleImputer(missing_values=0, strategy="mean")

X_train = fill_values.fit_transform(X_train)
X_test = fill_values.fit_transform(X_test)

# Modelling and Stacking 

In [None]:
# RandomForestClassifier

random_forest_model = RandomForestClassifier(random_state = 42)

random_forest_model.fit(X_train, y_train.ravel())

In [None]:
predict_train_data = random_forest_model.predict(X_test)

print("Accuracy = {0:.3f}".format(accuracy_score(y_test, predict_train_data)))

In [None]:
## Hyperparameter Optimzation

params1={
    
    "n_estimators" : [100, 300, 500, 800, 1200], 
    "max_depth" : [5, 8, 15, 25, 30],
    "min_samples_split" : [2, 5, 10, 15, 100],
    "min_samples_leaf" : [1, 2, 5, 10] 

}

In [None]:
rfm = RandomForestClassifier(random_state = 42)

In [None]:
rfms = RandomizedSearchCV(rfm,param_distributions=params1,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

In [None]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
# Here we go
start_time = timer(None) # timing starts from this point for "start_time" variable
rfms.fit(X_train,y_train.ravel())
timer(start_time) # timing ends here for "start_time" variable

In [None]:
rfms.best_estimator_

In [None]:
model1 = RandomForestClassifier(max_depth=8, min_samples_split=10, n_estimators=500,
                       random_state=42)

In [None]:
model1.fit(X_train,y_train)
y_pred1 = model1.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred1))
print(confusion_matrix(y_test,y_pred1))

**So after doing Hyperparameter optimization, we are able to achieve accuracy of 76% approx through RandomForestClassifier model. Let's try XGBoost Classifier.**

In [None]:
model2 = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
model2.fit(X_train,y_train.ravel())
y_pred2 = model2.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred2))
print(confusion_matrix(y_test,y_pred2))

**So after doing Hyperparameter optimization, we are able to achieve accuracy of 73% approx through XGBoost Classifier. Let's try CatBoostClassifier.**

In [None]:
model3 = CatBoostClassifier()

In [None]:
model3.fit(X_train,y_train)
y_pred3 = model3.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred3))
print(confusion_matrix(y_test,y_pred3))

**So after doing Hyperparameter optimization, we are able to achieve accuracy of 75% approx through CatBoostClassifier. Let's try SVC.**

In [None]:
params4 = {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['rbf']} 

In [None]:
svcs = RandomizedSearchCV(SVC(),param_distributions=params4,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

In [None]:
# Here we go
start_time = timer(None) # timing starts from this point for "start_time" variable
svcs.fit(X_train,y_train.ravel())
timer(start_time) # timing ends here for "start_time" variable

In [None]:
svcs.best_estimator_

In [None]:
model4 = SVC(C=0.1, gamma=0.001)

In [None]:
model4.fit(X_train,y_train)
y_pred4 = model4.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred4))
print(confusion_matrix(y_test,y_pred4))

**So after doing Hyperparameter optimization, we are able to achieve accuracy of 70% approx through SVC. Let's try AdaBoost Classifier.**

In [None]:
params5 = {'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}

In [None]:
adas = RandomizedSearchCV(AdaBoostClassifier(),param_distributions=params5,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

In [None]:
# Here we go
start_time = timer(None) # timing starts from this point for "start_time" variable
adas.fit(X_train,y_train.ravel())
timer(start_time) # timing ends here for "start_time" variable

In [None]:
adas.best_estimator_

In [None]:
model5 = AdaBoostClassifier(learning_rate=0.01, n_estimators=500)

In [None]:
model5.fit(X_train,y_train)
y_pred5 = model5.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred5))
print(confusion_matrix(y_test,y_pred5))

**So after doing Hyperparameter optimization, we are able to achieve accuracy of 77% approx through AdaBoost Classifier. Let's try LightGBM Classifier.**

In [None]:
params6 = {
    'learning_rate': [ 0.1],
    'num_leaves': [31],
    'boosting_type' : ['gbdt'],
    'objective' : ['binary']
}

In [None]:
lgbs = RandomizedSearchCV(LGBMClassifier(),param_distributions=params6,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

In [None]:
# Here we go
start_time = timer(None) # timing starts from this point for "start_time" variable
lgbs.fit(X_train,y_train.ravel())
timer(start_time) # timing ends here for "start_time" variable

In [None]:
lgbs.best_estimator_

In [None]:
model6 = LGBMClassifier(objective='binary')

In [None]:
model6.fit(X_train,y_train)
y_pred6 = model6.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred6))
print(confusion_matrix(y_test,y_pred6))

**So after doing Hyperparameter optimization, we are able to achieve accuracy of 74% approx through LightGBM Classifier. Let's try GradientBoostingClassifier.**

In [None]:
model7 = GradientBoostingClassifier(random_state = 42)

In [None]:
model7.fit(X_train,y_train)
y_pred7 = model7.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred7))
print(confusion_matrix(y_test,y_pred7))

**So after doing Hyperparameter optimization, we are able to achieve accuracy of 74% approx through GradientBoostingClassifier. Now let's try stacking of models.**

In [None]:
## Stacking of Models

model8 = StackingCVClassifier(classifiers=[model1,model2,model3,model5,model6,model7],
                            meta_classifier=model1,
                            random_state=42)

In [None]:
model8.fit(X_train,y_train.ravel())

In [None]:
y_pred8 = model8.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred8))
print(confusion_matrix(y_test,y_pred8))

**After stacking the models, we are able to achieve accuracy of 76%. Highest so far is achieved by AdaBoost Classifier which is 77%.**

**Note : My next work will be on Malaria Dataset. My aim is to work on atleaat 5 disease dataset and then I will be creating Web app using Flask where user can check whether they are suffering from those diseases or not. After completing the web app, I will deploy it on Heroku and code can be accessed from GitHub.**

# Thank You!!