# Pima Indians Diabetes Database

## Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

## Content
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

## Acknowledgements
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

## Inspiration
Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

## So let's begin here...

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Load Data

In [None]:
data = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
data.shape

In [None]:
data.head(5)

In [None]:
data.info()

In [None]:
data.isnull().sum()

### Correlation

In [None]:
import seaborn as sns
corr_ds = data.corr()
top_corr = corr_ds.index
plt.figure(figsize=(20,20))
g = sns.heatmap(data[top_corr].corr(), annot = True)

In [None]:
data.corr()

In [None]:
sns.countplot(data['Outcome'])

From above graph we can't say that our data is unbalanced.

## Train Data

In [None]:
X = data.drop(['Outcome'], axis = 1)
y = data['Outcome']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

## XGBoost

In [None]:
import xgboost
from sklearn.model_selection import RandomizedSearchCV

xgb_model = xgboost.XGBClassifier()

In [None]:
param = {
    'learning_rate':[0.05,0.1,0.15,0.2,0.25,0.3],
    'max_depth':[3,4,5,6,8,10,12],
    'min_child_weight':[1,3,5,7],
    'gamma':[0.0,0.1,0.2,0.3,0.4],
    'colsample_bytree':[0.3,0.4,0.5,0.7]
}

In [None]:
random_search = RandomizedSearchCV(xgb_model, param_distributions = param, n_iter = 5,
                                     scoring = 'roc_auc', n_jobs = -1, cv = 5, verbose = 3)
random_search.fit(X_train,y_train)

In [None]:
random_search.best_estimator_

In [None]:
xgb_model = xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.4, gamma=0.1, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.05, max_delta_step=0, max_depth=12,
              min_child_weight=5, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
xgb_model.fit(X_train,y_train)

In [None]:
pred_xgb = xgb_model.predict(X_test)

acc_xgb = accuracy_score(y_test,pred_xgb)
print("Accuracy XGB:", acc_xgb)

In [None]:
cm_xgb = confusion_matrix(y_test,pred_xgb)
sns.heatmap(cm_xgb, annot=True)

## Support Vector Classifier

In [None]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

svc_model = make_pipeline(StandardScaler(), SVC(gamma='auto'))

svc_model.fit(X_train, y_train)

In [None]:
pred_svc = svc_model.predict(X_test)

acc_svc = accuracy_score(y_test,pred_svc)
print("Accuracy SVC:", acc_svc)

In [None]:
cm_svc = confusion_matrix(y_test,pred_svc)
sns.heatmap(cm_svc, annot=True)

**Accuracy for other algorithms**
I have tried using different classification algorithms and below are the accuracy which I got.

Accuracy : 0.7142857142857143 (Random Forest)<br>
Accuracy : 0.727272727272727 (XGBoost)<br>
Accuracy : 0.7337662337662337 (Logistic Regression)<br>
Accuracy : 0.7467532467532467 (Support Vector Classifier)<br>
Accuracy : 0.7012987012987013 (Decision Tree)<br>
Accuracy : 0.7142857142857143 (Naive Bayes)<br>
Accuracy : 0.512987012987013 (Stochastic Gradient Descent)<br>
Accuracy : 0.7077922077922078 (K Nearest Neighbor)<br>
