**Problem Statement**:
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:
1.	Number of times pregnant.
2.	Plasma glucose concentration 2 hours in an oral glucose tolerance test.
3.	Diastolic blood pressure (mm Hg).
4.	Triceps skinfold thickness (mm).
5.	2-Hour serum insulin (mu U/ml).
6.	Body mass index (weight in kg/(height in m)^2).
7.	Diabetes pedigree function.
8.	Age (years).
9.	Is Diabetic (0 or 1).

In [40]:
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
import pickle
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [6]:
# Reading the csv file and creting the data
data = pd.read_csv('pima-indians-diabetes.csv')

In [7]:
data.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure (mm Hg),Triceps skinfold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age,Is Diabetic
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [12]:
data.columns

Index(['Number of times pregnant', 'Plasma glucose concentration',
       'Diastolic blood pressure (mm Hg)', 'Triceps skinfold thickness (mm)',
       '2-Hour serum insulin (mu U/ml)',
       'Body mass index (weight in kg/(height in m)^2)',
       'Diabetes pedigree function', 'Age', 'Is Diabetic'],
      dtype='object')

Now out of the 8 feature columns, apart from the 'Number of times pregnant' field, none of the columns should have the value as 0. But, as mentioned in the above data description the Missing values has been replaced y 0. We need to replace it first with NaN and then impute it.

In [14]:
cols = ['Plasma glucose concentration',
       'Diastolic blood pressure (mm Hg)', 'Triceps skinfold thickness (mm)',
       '2-Hour serum insulin (mu U/ml)',
       'Body mass index (weight in kg/(height in m)^2)',
       'Diabetes pedigree function', 'Age']

In [15]:
for col in cols:
    data[col] = data[col].replace(0,np.nan)

In [16]:
# checking for missing values
data.isna().sum()

Number of times pregnant                            0
Plasma glucose concentration                        5
Diastolic blood pressure (mm Hg)                   35
Triceps skinfold thickness (mm)                   227
2-Hour serum insulin (mu U/ml)                    374
Body mass index (weight in kg/(height in m)^2)     11
Diabetes pedigree function                          0
Age                                                 0
Is Diabetic                                         0
dtype: int64

In [17]:
data.describe()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure (mm Hg),Triceps skinfold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age,Is Diabetic
count,768.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,0.348958
std,3.369578,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0,0.0
75%,6.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [18]:
#imputing the missing values
data['Plasma glucose concentration']= data['Plasma glucose concentration'].fillna(data['Plasma glucose concentration'].mode()[0])
data['Diastolic blood pressure (mm Hg)']= data['Diastolic blood pressure (mm Hg)'].fillna(data['Diastolic blood pressure (mm Hg)'].mode()[0])
data['Triceps skinfold thickness (mm)']= data['Triceps skinfold thickness (mm)'].fillna(data['Triceps skinfold thickness (mm)'].mean())
data['2-Hour serum insulin (mu U/ml)']= data['2-Hour serum insulin (mu U/ml)'].fillna(data['2-Hour serum insulin (mu U/ml)'].mean())
data['Body mass index (weight in kg/(height in m)^2)']= data['Body mass index (weight in kg/(height in m)^2)'].fillna(data['Body mass index (weight in kg/(height in m)^2)'].mean())


In [19]:
# checking for missing values
data.isna().sum()

Number of times pregnant                          0
Plasma glucose concentration                      0
Diastolic blood pressure (mm Hg)                  0
Triceps skinfold thickness (mm)                   0
2-Hour serum insulin (mu U/ml)                    0
Body mass index (weight in kg/(height in m)^2)    0
Diabetes pedigree function                        0
Age                                               0
Is Diabetic                                       0
dtype: int64

In [22]:
# Seprating x and y labels
x = data.drop('Is Diabetic', axis = 1)
y = data['Is Diabetic']

In [23]:
x.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure (mm Hg),Triceps skinfold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33


In [28]:
# scaling the datapoints
scaler = StandardScaler()
scaled_data = scaler.fit_transform(x)

In [27]:
scaled_data

array([[ 0.63994726,  0.86840303, -0.02442979, ...,  0.16629174,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.19914997, -0.52034382, ..., -0.85253118,
        -0.36506078, -0.19067191],
       [ 1.23388019,  2.01704359, -0.68564849, ..., -1.33283341,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 , -0.01769112, -0.02442979, ..., -0.91074963,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.14640039, -1.01625784, ..., -0.34311972,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.93660356, -0.18973447, ..., -0.29945588,
        -0.47378505, -0.87137393]])

In [56]:
train_x, test_x, train_y, test_y = train_test_split(scaled_data, y, test_size= 0.25, random_state= 1)

In [57]:
model = XGBClassifier(objective = 'binary:logistic')
model.fit(train_x, train_y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [58]:
#checking training accuracy
y_pred = model.predict(train_x)
accuracy = accuracy_score(train_y,y_pred)
accuracy

1.0

In [59]:
#checking test accuracy
y_pred = model.predict(test_x)
accuracy = accuracy_score(test_y,y_pred)
accuracy

0.7239583333333334

## Hyper-parameter Tuning using GridSearchCV

In [60]:
param_grid={
    'learning_rate': [1,0.5,0.1,0.01,0.001],
    'max_depth': [3,5],
    'n_estimators': [10,50,100,200]
}

In [61]:
grid=GridSearchCV(model,param_grid,verbose=3)

In [62]:
grid.fit(train_x,train_y)

Fitting 5 folds for each of 40 candidates, totalling 200 fits
[CV] learning_rate=1, max_depth=3, n_estimators=10 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=10, score=0.698, total=   0.0s
[CV] learning_rate=1, max_depth=3, n_estimators=10 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=10, score=0.757, total=   0.0s
[CV] learning_rate=1, max_depth=3, n_estimators=10 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=10, score=0.765, total=   0.0s
[CV] learning_rate=1, max_depth=3, n_estimators=10 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=10, score=0.739, total=   0.0s
[CV] learning_rate=1, max_depth=3, n_estimators=10 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=10, score=0.765, total=   0.0s
[CV] learning_rate=1, max_depth=3, n_estimators=50 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=50, score=0.724, total=   0.0s
[CV] learning_rate=1, max_depth=

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s



[CV] learning_rate=1, max_depth=3, n_estimators=50 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=50, score=0.765, total=   0.0s
[CV] learning_rate=1, max_depth=3, n_estimators=50 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=50, score=0.748, total=   0.0s
[CV] learning_rate=1, max_depth=3, n_estimators=50 ...................
[CV]  learning_rate=1, max_depth=3, n_estimators=50, score=0.722, total=   0.0s
[CV] learning_rate=1, max_depth=3, n_estimators=100 ..................
[CV]  learning_rate=1, max_depth=3, n_estimators=100, score=0.690, total=   0.1s
[CV] learning_rate=1, max_depth=3, n_estimators=100 ..................
[CV]  learning_rate=1, max_depth=3, n_estimators=100, score=0.748, total=   0.1s
[CV] learning_rate=1, max_depth=3, n_estimators=100 ..................
[CV]  learning_rate=1, max_depth=3, n_estimators=100, score=0.739, total=   0.1s
[CV] learning_rate=1, max_depth=3, n_estimators=100 ..................
[CV]  learning_rate

[CV]  learning_rate=0.5, max_depth=5, n_estimators=50, score=0.748, total=   0.0s
[CV] learning_rate=0.5, max_depth=5, n_estimators=50 .................
[CV]  learning_rate=0.5, max_depth=5, n_estimators=50, score=0.713, total=   0.0s
[CV] learning_rate=0.5, max_depth=5, n_estimators=50 .................
[CV]  learning_rate=0.5, max_depth=5, n_estimators=50, score=0.757, total=   0.1s
[CV] learning_rate=0.5, max_depth=5, n_estimators=100 ................
[CV]  learning_rate=0.5, max_depth=5, n_estimators=100, score=0.733, total=   0.1s
[CV] learning_rate=0.5, max_depth=5, n_estimators=100 ................
[CV]  learning_rate=0.5, max_depth=5, n_estimators=100, score=0.748, total=   0.1s
[CV] learning_rate=0.5, max_depth=5, n_estimators=100 ................
[CV]  learning_rate=0.5, max_depth=5, n_estimators=100, score=0.774, total=   0.1s
[CV] learning_rate=0.5, max_depth=5, n_estimators=100 ................
[CV]  learning_rate=0.5, max_depth=5, n_estimators=100, score=0.713, total=   0

[CV]  learning_rate=0.01, max_depth=3, n_estimators=50, score=0.774, total=   0.0s
[CV] learning_rate=0.01, max_depth=3, n_estimators=50 ................
[CV]  learning_rate=0.01, max_depth=3, n_estimators=50, score=0.757, total=   0.0s
[CV] learning_rate=0.01, max_depth=3, n_estimators=50 ................
[CV]  learning_rate=0.01, max_depth=3, n_estimators=50, score=0.722, total=   0.0s
[CV] learning_rate=0.01, max_depth=3, n_estimators=100 ...............
[CV]  learning_rate=0.01, max_depth=3, n_estimators=100, score=0.698, total=   0.1s
[CV] learning_rate=0.01, max_depth=3, n_estimators=100 ...............
[CV]  learning_rate=0.01, max_depth=3, n_estimators=100, score=0.783, total=   0.1s
[CV] learning_rate=0.01, max_depth=3, n_estimators=100 ...............
[CV]  learning_rate=0.01, max_depth=3, n_estimators=100, score=0.765, total=   0.1s
[CV] learning_rate=0.01, max_depth=3, n_estimators=100 ...............
[CV]  learning_rate=0.01, max_depth=3, n_estimators=100, score=0.748, tot

[CV]  learning_rate=0.001, max_depth=5, n_estimators=10, score=0.722, total=   0.0s
[CV] learning_rate=0.001, max_depth=5, n_estimators=10 ...............
[CV]  learning_rate=0.001, max_depth=5, n_estimators=10, score=0.730, total=   0.0s
[CV] learning_rate=0.001, max_depth=5, n_estimators=50 ...............
[CV]  learning_rate=0.001, max_depth=5, n_estimators=50, score=0.690, total=   0.0s
[CV] learning_rate=0.001, max_depth=5, n_estimators=50 ...............
[CV]  learning_rate=0.001, max_depth=5, n_estimators=50, score=0.757, total=   0.0s
[CV] learning_rate=0.001, max_depth=5, n_estimators=50 ...............
[CV]  learning_rate=0.001, max_depth=5, n_estimators=50, score=0.739, total=   0.0s
[CV] learning_rate=0.001, max_depth=5, n_estimators=50 ...............
[CV]  learning_rate=0.001, max_depth=5, n_estimators=50, score=0.722, total=   0.0s
[CV] learning_rate=0.001, max_depth=5, n_estimators=50 ...............
[CV]  learning_rate=0.001, max_depth=5, n_estimators=50, score=0.748, 

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:   12.6s finished


GridSearchCV(estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0, gpu_id=-1,
                                     importance_type='gain',
                                     interaction_constraints='',
                                     learning_rate=0.300000012,
                                     max_delta_step=0, max_depth=6,
                                     min_child_weight=1, missing=nan,
                                     monotone_constraints='()',
                                     n_estimators=100, n_jobs=0,
                                     num_parallel_tree=1, random_state=0,
                                     reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, subsample=1,
                                     tree_method='exact', validate_parameters=1,
                            

In [63]:
# Best Parameters using GridSearchCV
grid.best_params_

{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 50}

In [64]:
#Creating new model using the best parameters
new_model = XGBClassifier(learning_rate=1, max_depth=5, n_estimators= 50)
new_model.fit(train_x,train_y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [65]:
y_pred_new = new_model.predict(test_x)
accuracy_new = accuracy_score(test_y,y_pred_new)
accuracy_new

0.75