**Characterized by high levels of sugar in the blood, Type 2 diabetes can be prevent or delayed with lifestyle changes. By modelling diabetes in patients, individuals can be better informed about their risks of developing the disease.**

In [None]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns",100)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
params={"figure.facecolor":(0.0,0.0,0.0,0),
        "axes.facecolor":(1.0,1.0,1.0,1),
        "savefig.facecolor":(0.0,0.0,0.0,0)}
plt.rcParams.update(params)

In [None]:
df=pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
df.head()

This dataset is of only Pima Indian females aged 21 and above.

In [None]:
df.shape

The data has 768 entries, and can be described in 9 columns.

The columns are:
* Pregnancies: the number of times the patient has been pregnant.
* Glucose: the plasma glucose concentration after 2 hours in an oral glucose tolerance test.
* BloodPressure: the aiastolic blood pressure (mm Hg).
* SkinThickness: the triceps skin fold thickness (mm).
* Insulin: 2-Hour serum insulin (mu U/ml).
* BMI: the body mass index.
* DiabetesPedigreeFunction: a function which scores the likelihood of diabetes based on family history.
* Age: the age (years)
* Outcome: "0" as no diabetes, "1" as with diabetes.

In [None]:
df.info()

Good, no null values and all the columns are numeric.

In [None]:
df.describe()

But is it possible to have 0 levels of *Glucose*, *BloodPressure*, *SkinThickness*, *Insulin*, *BMI*? Let's take a closer look:

In [None]:
for i in df[["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]]:
    plt.figure()
    sns.distplot(df[i],kde=False,color="#AE9CCD")

For *Glucose*, *BloodPressure* and *BMI* it is obvious that the 0s must be addressed. While not so obvious in *SkinThickness* and *Insulin* these 0s should also be addressed as it is not possible to have 0mm of skin or 0 mu U/ml of insulin.

As such the 0s should be treated as missing data, and changed to NaN. So then we do have null values...

In [None]:
df[["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]]=df[["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]].replace(0,np.nan)

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull(),cmap="RdPu")

If we drop the null values, we will loose too much information. So let's replace them, but replace them with what? Do we choose the mean, median, mode or some other arbituary number?

If we choose either the mean, median or mode, we must split the data into the training and testing set to ensure the value (i.e. mean, median or mode) is not leaked over from the testing set.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x=df.drop(["Outcome"],axis=1)
y=df["Outcome"]

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state=7)

In [None]:
print("Training set x shape:",x_train.shape,"and y shape:",y_train.shape)
print("Testing set shape:",x_test.shape,"and y shape:",y_test.shape)

Now we can impute the replacing value (i.e. mean, median or mode) from only the training set.

Let's take a look at each column to decide which metric to use:

In [None]:
for i in x_train[["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]]:
    plt.figure()
    sns.distplot(df[i],kde=False,color="#AE9CCD")

For *Glucose* and *BloodPressure* I will use the mean; while for the *SkinThickness*, *Insulin* and *BMI* I will use the median. This is based on their distribution (*Glucose* and *BloodPressure* have a more normal distribution whereas *SkinThickness*, *Insulin* and *BMI* are more skewed).

First fill in the missing values in the training set:

In [None]:
print("The number of null values:")
print(x_train.isnull().sum())

In [None]:
x_train["Glucose"].fillna(x_train["Glucose"].mean(),inplace=True)
x_train["BloodPressure"].fillna(x_train["BloodPressure"].mean(),inplace=True)
x_train["SkinThickness"].fillna(x_train["SkinThickness"].median(),inplace=True)
x_train["Insulin"].fillna(x_train["Insulin"].median(),inplace=True)
x_train["BMI"].fillna(x_train["BMI"].median(),inplace=True)

In [None]:
print("Check there are no more null values:")
print(x_train.isnull().sum())

Now fill in the missing values in the testing set using the training set:

In [None]:
print("The number of null values:")
print(x_test.isnull().sum())

In [None]:
x_test["Glucose"].fillna(x_train["Glucose"].mean(),inplace=True)
x_test["BloodPressure"].fillna(x_train["BloodPressure"].mean(),inplace=True)
x_test["SkinThickness"].fillna(x_train["SkinThickness"].median(),inplace=True)
x_test["Insulin"].fillna(x_train["Insulin"].median(),inplace=True)
x_test["BMI"].fillna(x_train["BMI"].median(),inplace=True)

In [None]:
print("Check there are no more null values:")
print(x_test.isnull().sum())

The data is now nice and clean, but will need to be scaled to ensure columns with higher values do not have a higher weighting.

But as before, to avoid any data leakage, we will only fit the scaler to the training set and not the testing set (i.e. fit and transform the training set, but only transform the testing set).

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler=StandardScaler()

In [None]:
x_train=scaler.fit_transform(x_train)

In [None]:
x_test=scaler.transform(x_test)

It is finally time to model the data!

Let's try as many algorithms as we can.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

The algorithms will be evaluated based on the number of False Negatives and accuracy.

The accuracy score will tell us how accurately the model predicts the True Positives and True Negatives, which is still essential, but there is a trade-off between having a model that predicts more False Positives (predicting diabetes in a healthy person) vs False Negatives (predicting no diabetes in a person that does have diabetes) - i.e. type 1 error vs type 2 error respectively. I think it is more important to prevent type 2 errors to ensure that patients are not overlooked.

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,roc_auc_score

In [None]:
name=["Logistic","kNN","DecisionTree","RandomForest","GradientBoost","AdaBoost","SVM","LGBM","XGB"]
models=[LogisticRegression(random_state=7),KNeighborsClassifier(),DecisionTreeClassifier(random_state=7),RandomForestClassifier(random_state=7),GradientBoostingClassifier(random_state=7),AdaBoostClassifier(random_state=7),SVC(random_state=7),LGBMClassifier(random_state=7),XGBClassifier(random_state=7)]
score=[]
falsenegative=[]

for model in models:
    model.fit(x_train,y_train)
    score.append(model.score(x_test,y_test))
    y_predict=model.predict(x_test)
    tn,fp,fn,tp=confusion_matrix(y_test,y_predict).ravel()
    falsenegative.append(fn)

In [None]:
results=pd.DataFrame({"name":name,"models":models,"score":score,"fn":falsenegative})
results.sort_values(["fn","score"],ascending=[True,False])

Thus the best model is the Gradient Boosting Classifier with the lowest number of False Negatives and the third highest accuracy.

In [None]:
model=GradientBoostingClassifier(random_state=7)
model.fit(x_train,y_train)
y_predict=model.predict(x_test)

In [None]:
print(classification_report(y_test,y_predict))

In [None]:
print(confusion_matrix(y_test,y_predict))

In [None]:
y_score=model.predict_proba(x_test)[:,1]

false_positive_rate,true_positive_rate,threshold=roc_curve(y_test,y_score)
print("roc_auc_score: ",roc_auc_score(y_test,y_score))

plt.plot(false_positive_rate,true_positive_rate)
plt.plot([0,1],ls="--")
plt.plot([0,0],[1,0],c=".7")
plt.plot([1,1],c=".7")
plt.title("Receiver Operating Characteristic")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()

The model is based on its default parameters:

* loss="deviance"
* learning_rate=0.1
* n_estimators=100
* subsample=1.0
* criterion="friedman_mse"
* min_samples_split=2
* min_samples_leaf=1
* min_weight_fraction_leaf=0.0
* max_depth=3
* min_impurity_decrease=0.0
* initestimator=None
* random_state=None
* max_features=None
* verbose=0
* max_leaf_nodes=None
* warm_start=False
* validation_fraction=0.1
* n_iter_no_change=None
* tol=1e-4
* ccp_alpha=0.0

 Perhaps the model can be improved by tuning its parameters.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
params={
    "loss":["exponential"],
    "learning_rate":[0.01,0.05,0.1,0.15,0.2],
    #"n_estimators":[],
    "subsample":[0.8,0.9,1.0],
    "criterion":["friedman_mse"],
    #"min_samples_split":[],
    #"min_samples_leaf":[],
    "max_depth":[3,5,8],
    "max_features":["log2","sqrt"]
}

In [None]:
grid=GridSearchCV(GradientBoostingClassifier(random_state=7),params,refit=True,verbose=1)
grid.fit(x_train,y_train)
y_predict=grid.predict(x_test)

In [None]:
print(classification_report(y_test,y_predict))

In [None]:
print(confusion_matrix(y_test,y_predict))

In [None]:
y_score=grid.predict_proba(x_test)[:,1]

false_positive_rate,true_positive_rate,threshold=roc_curve(y_test,y_score)
print("roc_auc_score: ",roc_auc_score(y_test,y_score))

plt.plot(false_positive_rate,true_positive_rate)
plt.plot([0,1],ls="--")
plt.plot([0,0],[1,0],c=".7")
plt.plot([1,1],c=".7")
plt.title("Receiver Operating Characteristic")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()

Although only ever so slightly, the model has improved - precision, recall, f1-score, accuracy and ROC AUC score all increased; and the number of False Negatives and False Positives both decreased!

The model may be further improved by - for example - adjusting its parameters (as above); collecting more data, and data with minimal missing information; and conducting feature engineering and selection.