# Subject of the Project

The dataset is primarily used for predicting the onset of diabetes within five years in females of Pima Indian heritage over the age of 21 given medical details about their bodies. The dataset is meant to correspond with a binary (2-class) classification machine learning problem. 

We have a dependent variable that indicates the state of having diabetes. Our goal is to model the relationship between other variables and whether or not they have diabetes.

When the various features of the people are entered, we want to establish a machine learning model that will make a prediction about whether these people will have diabetes or not. This is a classification problem.

## Dataset Information

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

We have 9 columns and 768 instances (rows). The column names are provided as follows (in order):

    - Pregnancies: Number of times pregnant
    - Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    - BloodPressure: Diastolic blood pressure (mm Hg)
    - SkinThickness: Triceps skinfold thickness (mm)
    - Insulin: 2-Hour serum insulin measurement (mu U/ml)
    - BMI: Body mass index (weight in kg/(height in m) 2 )
    - DiabetesPedigreeFunction: Diabetes pedigree function
    - Age: Age (years)
    - Outcome: Class variable (0 or 1, 0 = non-diabetic, 1 = diabetic)

# Data Understanding

In [None]:
#installation of libraries
import numpy as np
import pandas as pd 
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import KFold

In [None]:
#any warnings that do not significantly impact the project are ignored.
import warnings
warnings.simplefilter(action = "ignore") 

In [None]:
#reading the dataset
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
#selection of the first 5 observations
df.head() 

In [None]:
#return a random sample of items from an axis of object
df.sample(3) 

In [None]:
#makes random selection from dataset at the rate of written value
df.sample(frac = 0.01) 

In [None]:
#size information
df.shape

In [None]:
#dataframe's index dtype and column dtypes, non-null values and memory usage information
df.info()

In [None]:
#explanatory statistics values of the observation units corresponding to the specified percentages
df.describe([0.10,0.25,0.50,0.75,0.90,0.95,0.99]).T
#transposition of the df table. This makes it easier to evaluate.

In [None]:
#correlation between variables
df.corr()

Our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes. Visualize some of the differences between those that developed diabetes and those that did not. 

In [None]:
#get a histogram of the Glucose column for both classes

col = 'Glucose'
plt.hist(df[df['Outcome']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(df[df['Outcome']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()


It seems that this histogram is showing us a pretty big difference between Glucose and two prediction classes.

In [None]:
for col in ['BMI', 'BloodPressure']:
    plt.hist(df[df['Outcome']==0][col], 10, alpha=0.5, label='non-diabetes')
    plt.hist(df[df['Outcome']==1][col], 10, alpha=0.5, label='diabetes')
    plt.legend(loc='upper right')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.title('Histogram of {}'.format(col))
    plt.show()

These histograms show us the distributions of 'BMI', 'BloodPressure', 'Glucose' for the two class variables (non-diabetes and diabetes).

There seems to be a large jump in 'Glucose' for those who will eventually develop diabetes. To solidify this, we can visualize correlation matrix in an attempt to quantify the relationship between these variables. 

In [None]:
def plot_corr(df,size = 9): 
    corr = df.corr() #corr = variable, where we assign the correlation matrix to a variable
    fig, ax = plt.subplots(figsize = (size,size)) 
    #fig = the column to the right of the chart, subplots (figsize = (size, size)) = determines the size of the chart
    ax.matshow(corr) # prints the correlation, which draws the matshow matrix directly
    cax=ax.matshow(corr, interpolation = 'nearest') #plotting axis, code that makes the graphic like square or map
    fig.colorbar(cax) #plotting color
    plt.xticks(range(len(corr.columns)),corr.columns,rotation=65) 
    # draw xticks, rotation = 17 is for inclined printing of expressions written for each top column
    plt.yticks(range(len(corr.columns)),corr.columns) #draw yticks

In [None]:
#we draw the dataframe using the function.
plot_corr(df) 

In [None]:
#correlation matrix in seaborn library
import seaborn as sb
sb.heatmap(df.corr());

In [None]:
#this way we can see the correlations
sb.heatmap(df.corr(),annot =True); 

Conclusion: The highest correlations with Outcome were observed between Glucose, BMI, Age and Pregnancies.

In [None]:
#proportions of classes 0 and 1 in Outcome
df["Outcome"].value_counts()*100/len(df)

In [None]:
#how many classes are 0 and 1
df.Outcome.value_counts()

In [None]:
#histogram of the Age variable
df["Age"].hist(edgecolor = "black");

In [None]:
#Age, Glucose and BMI means according to Outcome variable
df.groupby("Outcome").agg({"Age":"mean","Glucose":"mean","BMI":"mean"})

# Data Pre-Processing

## Missing Data Analysis

In [None]:
#no missing data in dataset
df.isnull().sum()

In [None]:
#zeros in the corresponding variables mean NA, so 0 is assigned instead of NA
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)

It seems that there is no missing value in the data set, but when the variables are examined, the zeros in these variables represent NA.

In [None]:
#exclusive values
df.isnull().sum()

In [None]:
def median_target(var):   
    
    temp = df[df[var].notnull()] 
    temp = temp[[var, 'Outcome']].groupby(['Outcome'])[[var]].median().reset_index() #reset_index; solved problems in indices
    
    return temp
#Non-nulls are selected from within df and assigned to a dataframe named temp, ignoring the observation units filled.

Independent and dependent variable selected from dataframe, groupby operation is applied to the dependent variable then the independent variable is selected and the median of this variable is taken.

In [None]:
#median of glucose taken according to Outcome's value of 0 and 1
median_target("Glucose")

In [None]:
#median values of diabetes and non-diabetes were given for incomplete observations.

columns = df.columns

columns = columns.drop("Outcome")

for col in columns:
    
    df.loc[(df['Outcome'] == 0 ) & (df[col].isnull()), col] = median_target(col)[col][0]
    df.loc[(df['Outcome'] == 1 ) & (df[col].isnull()), col] = median_target(col)[col][1]
    #select the outcome value 0 and the relevant variable blank, select the relevant variable
#It refers to pre-comma filtering operations, it is used for column selection after comma.

## Feature Engineering

In [None]:
#according to BMI, some ranges were determined and categorical variables were assigned.
NewBMI = pd.Series(["Underweight", "Normal", "Overweight", "Obesity 1", "Obesity 2", "Obesity 3"], dtype = "category")

df["NewBMI"] = NewBMI

df.loc[df["BMI"] < 18.5, "NewBMI"] = NewBMI[0]

df.loc[(df["BMI"] > 18.5) & (df["BMI"] <= 24.9), "NewBMI"] = NewBMI[1]
df.loc[(df["BMI"] > 24.9) & (df["BMI"] <= 29.9), "NewBMI"] = NewBMI[2]
df.loc[(df["BMI"] > 29.9) & (df["BMI"] <= 34.9), "NewBMI"] = NewBMI[3]
df.loc[(df["BMI"] > 34.9) & (df["BMI"] <= 39.9), "NewBMI"] = NewBMI[4]
df.loc[df["BMI"] > 39.9 ,"NewBMI"] = NewBMI[5]

In [None]:
df.head()

In [None]:
#categorical variable creation according to the insulin value
def set_insulin(row):
    if row["Insulin"] >= 16 and row["Insulin"] <= 166:
        return "Normal"
    else:
        return "Abnormal"     

In [None]:
df.head()

In [None]:
#NewInsulinScore variable added with set_insulin
df["NewInsulinScore"] = df.apply(set_insulin, axis=1)

In [None]:
df.head()

In [None]:
#some intervals were determined according to the glucose variable and these were assigned categorical variables.

NewGlucose = pd.Series(["Low", "Normal", "Overweight", "Secret", "High"], dtype = "category")

df["NewGlucose"] = NewGlucose

df.loc[df["Glucose"] <= 70, "NewGlucose"] = NewGlucose[0]

df.loc[(df["Glucose"] > 70) & (df["Glucose"] <= 99), "NewGlucose"] = NewGlucose[1]

df.loc[(df["Glucose"] > 99) & (df["Glucose"] <= 126), "NewGlucose"] = NewGlucose[2]

df.loc[df["Glucose"] > 126 ,"NewGlucose"] = NewGlucose[3]

In [None]:
df.head()

## One-Hot Encoding

In [None]:
#categorical variables were converted into numerical values by making One Hot Encoding transform
#it is also protected from the Dummy variable trap
df = pd.get_dummies(df, columns =["NewBMI","NewInsulinScore", "NewGlucose"], drop_first = True)

In [None]:
df.head()

## Variable Standardization

In [None]:
#categorical variables
categorical_df = df[['NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
                     'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']]

In [None]:
#categorical variables deleted from df
y = df["Outcome"]
X = df.drop(["Outcome",'NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
                     'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'], axis = 1)
cols = X.columns
index = X.index

In [None]:
y.head()

In [None]:
X.head()

In [None]:
#by standardizing the variables in the dataset, the performance of the models is increased.
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(X)
X = transformer.transform(X)
X = pd.DataFrame(X, columns = cols, index = index)

In [None]:
X.head()

In [None]:
#combining non-categorical and categorical variables
X = pd.concat([X, categorical_df], axis = 1)

In [None]:
X.head()

# Modeling

In [None]:
models = []
models.append(('LR', LogisticRegression(random_state = 12345)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier(random_state = 12345)))
models.append(('RF', RandomForestClassifier(random_state = 12345)))
models.append(('SVM', SVC(gamma='auto', random_state = 12345)))
models.append(('XGB', GradientBoostingClassifier(random_state = 12345)))
models.append(("LightGBM", LGBMClassifier(random_state = 12345)))

#evaluate each model in turn
results = []
names = []

for name, model in models:
        
        cv_results = cross_val_score(model, X, y, cv = 10, scoring= "accuracy")
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        
#boxplot algorithm comparison
fig = plt.figure(figsize=(15,10))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results,
            vert=True, #vertical box alignment
            patch_artist=True) #fill with color
                         
ax.set_xticklabels(names)
plt.show()

RF, XGB and LightGBM gave good results. We focused on optimizing these models

# Model Optimization

## Model Tuning

### Random Forests Tuning

In [None]:
rf_params = {"n_estimators" :[100,200,500,1000], 
             "max_features": [3,5,7], 
             "min_samples_split": [2,5,10,30],
            "max_depth": [3,5,8,None]}

In [None]:
rf_model = RandomForestClassifier(random_state = 12345)

In [None]:
gs_cv = GridSearchCV(rf_model, 
                    rf_params,
                    cv = 10,
                    n_jobs = -1,
                    verbose = 2).fit(X, y)

In [None]:
gs_cv.best_params_

### Final Model Installation

In [None]:
rf_tuned = RandomForestClassifier(**gs_cv.best_params_)

In [None]:
rf_tuned = rf_tuned.fit(X,y)

In [None]:
cross_val_score(rf_tuned, X, y, cv = 10).mean()

In [None]:
feature_imp = pd.Series(rf_tuned.feature_importances_,
                        index=X.columns).sort_values(ascending=False)

sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()

### XGBoost Tuning

In [None]:
xgb = GradientBoostingClassifier(random_state = 12345)

In [None]:
xgb_params = {
    "learning_rate": [0.01, 0.1, 0.2, 1],
    "min_samples_split": np.linspace(0.1, 0.5, 3),
    "max_depth":[3,5,8],
    "subsample":[0.5, 0.9, 1.0],
    "n_estimators": [100,500]}

In [None]:
xgb_cv = GridSearchCV(xgb,xgb_params, cv = 10, n_jobs = -1, verbose = 2).fit(X, y)

In [None]:
xgb_cv.best_params_

### Final Model Installation

In [None]:
xgb_tuned = GradientBoostingClassifier(**xgb_cv.best_params_).fit(X,y)

In [None]:
cross_val_score(xgb_tuned, X, y, cv = 10).mean()

In [None]:
feature_imp = pd.Series(xgb_tuned.feature_importances_,
                        index=X.columns).sort_values(ascending=False)

sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()

### LightGBM Tuning

In [None]:
lgbm = LGBMClassifier(random_state = 12345)

In [None]:
lgbm_params = {"learning_rate": [0.01, 0.03, 0.05, 0.1, 0.5],
              "n_estimators": [500, 1000, 1500],
              "max_depth":[3,5,8]}

In [None]:
gs_cv = GridSearchCV(lgbm, 
                     lgbm_params, 
                     cv = 10, 
                     n_jobs = -1, 
                     verbose = 2).fit(X, y)

In [None]:
gs_cv.best_params_

### Final Model Installation

In [None]:
lgbm_tuned = LGBMClassifier(**gs_cv.best_params_).fit(X,y)

In [None]:
cross_val_score(lgbm_tuned, X, y, cv = 10).mean()

In [None]:
feature_imp = pd.Series(lgbm_tuned.feature_importances_,
                        index=X.columns).sort_values(ascending=False)

sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()

# Comparison of Final Models

In [None]:
models = []

models.append(('RF', RandomForestClassifier(random_state = 12345, max_depth = 8, max_features = 7, min_samples_split = 2, n_estimators = 500)))
models.append(('XGB', GradientBoostingClassifier(random_state = 12345, learning_rate = 0.1, max_depth = 5, min_samples_split = 0.1, n_estimators = 100, subsample = 1.0)))
models.append(("LightGBM", LGBMClassifier(random_state = 12345, learning_rate = 0.01,  max_depth = 3, n_estimators = 1000)))

results = []
names = []

With model tune operations, better estimation was made compared to base models.

In [None]:
for name, model in models:
    
        cv_results = cross_val_score(model, X, y, cv = 10, scoring= "accuracy")
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        
# boxplot algorithm comparison
fig = plt.figure(figsize=(15,10))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results,
            vert=True, #vertical box alignment
            patch_artist=True) #fill with color
                         
ax.set_xticklabels(names)
plt.show()

# Conclusion    
    
    - Machine learning models were established to predict whether people will have diabetes with varying variables.

    - The 3 classification models that best describe the dataset were selected and these models were compared according to their success rates. Compared models are Random Forests, XGBoost, LightGBM.

    - As a result of this comparison; It is determined that the model that best describes and gives the best results is LightGBM.

# Resources

    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
    
    - https://github.com/omarozt/MachineLearningWorkshop
    
    - https://www.kaggle.com/ibrahimyildiz/pima-indians-diabetes-pred-0-9078-acc
    
    - https://seaborn.pydata.org/examples/color_palettes.html
        
    - Feature Engineering Made Easy, Sinan Ozdemir and Divya Susarla 