## Hi kagglers! This is my first ever kaggle notebook. Please let me know how can I improve. Have fun!

# **Dataset** : Pima Indians Diabetes Database
# **Source** : National Institute of Diabetes and Digestive and Kidney Diseases
## **Objective** :  Diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset

# **Features**:
### *Pregnancies* : Number of times pregnant
### *Glucose* : Plasma glucose concentration a 2 hours in an oral glucose tolerance test
### *BloodPressure* : Diastolic blood pressure (mm Hg)
### *SkinThickness* : Triceps skin fold thickness (mm)
### *Insulin* : 2-Hour serum insulin (mu U/ml)
### *BMI* : Body mass index (weight in kg/(height in m)^2)
### *DiabetesPedigreeFunction* : Diabetes pedigree function
### *Age* : Age (years)

# **Target**:
### *Outcome* : Class variable (0 or 1) 268 of 768 are 1, the others are 0

# Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold,StratifiedKFold, GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report,roc_auc_score ,accuracy_score
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
import warnings
import optuna
warnings.filterwarnings('ignore')

# Data Preprocessing

In [None]:
# Reading the dataset using Pandas and checking the head(first 5) of the dataset
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
df.head()

In [None]:
# Checking the number of datapoints in each class
print(df['Outcome'].value_counts())
sns.countplot(x='Outcome', data=df)

In [None]:
# Finding missing values using seaborn heatmap
plt.figure(figsize=(12,6))
sns.heatmap(df)

In [None]:
for i in df.drop('Outcome', axis=1).columns:
    print(i, df[df[i] == 0][i].count())

### Observation : There are some data which are 0 in the features of our dataset. But, pregnancies can be 0. So, except it we will try to fill rest features with 'Median' values respectively.

In [None]:
# Here we are first converting 0 with nan and then with median values. Using median instead of mean cause it is less prone to outliers.
to_process_features = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df[to_process_features] = df[to_process_features].replace(0, np.nan)
for i in to_process_features:
    df[i].fillna(df[i].median(), inplace=True)

In [None]:
df.isnull().sum()

In [None]:
# Plotting the dataset
df.hist()
plt.show()

In [None]:
# Ploting pairwise relationships in a dataset.
sns.pairplot(df, hue='Outcome')

In [None]:
# Defining our features and target
X = df.drop('Outcome',axis=1)
y = df['Outcome']
# Splitting the dataset using train_test_split() in 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
# Using Standard Scaler to fit and transform the training data but only transforming the test data, so that no data leakage happens
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train))
X_test_scaled = pd.DataFrame(scaler.transform(X_test))
# Re-assigning the columns
X_train_scaled.columns = X_train.columns
X_test_scaled.columns = X_test.columns

In [None]:
# Checking the distribution of the scaled train and test data
X_train_scaled.hist()
plt.show()

In [None]:
X_test_scaled.hist()
plt.show()

In [None]:
# function to fit, predict and show the scores 
def evaluation(model, X_train_scaled, y_train, X_test_scaled, y_test):
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f"Accuracy Score : {accuracy_score(y_test, y_pred)}")
    print("*"*50)
    print(f"\nRoc auc score : {roc_auc_score(y_test, y_pred)}")
    print("*"*50)
    print(f"\nConfusion Matrix : \n {confusion_matrix(y_test, y_pred)}")
    print("*"*50)
    print(f"\nClassification Report : \n {classification_report(y_test, y_pred)}")   

# Machine Learning models

In [None]:
# Logistic Regression
lr = LogisticRegression(random_state=42)
evaluation(lr, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# Knn
knn = KNeighborsClassifier()
evaluation(knn, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# Support Vector Classification
sv = SVC()
evaluation(sv, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# linear svc [‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss which 
# is used by Linear SVC]
lrsv = LinearSVC(random_state=42)
evaluation(lrsv, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# Decision Tree
dt = DecisionTreeClassifier(random_state=42)
evaluation(dt, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# Random Forest
rf = RandomForestClassifier(random_state=42)
evaluation(rf, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# XGBoost classifier 
xgbc = XGBClassifier(eval_metric='logloss')
evaluation(xgbc, X_train_scaled, y_train, X_test_scaled, y_test)

# Observation:
### 1) Xgboost classifier worked best overall. 
### 2) Accuracy score: 0.7597 and ROC_AUC score: 0.7566
### 3) It also has False Negative (Type I error) as 14 which is less than others and False Positive(Type II error) as 23 as can be seen in Confusion Matrix

# Hyperparameter Tuning of the ML models

In [None]:
# Function for computing fit, predict and scores after using GridSearchCV()
def grid_evaluation(model, X_train_scaled, y_train, X_test_scaled, y_test):
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f"Best Parameters : {model.best_params_}")
    print("*"*50)
    # Best Score: Mean cross-validated score of the best_estimator
    print(f"\nBest Score :  {model.best_score_}")
    print("*"*50)
    print(f"\nAccuracy Score (Train set) :  {model.score(X_train_scaled,y_train)}")
    print("*"*50)
    print(f"\nAccuracy Score (Test set): {accuracy_score(y_test, y_pred)}")
    print("*"*50)
    print(f"\nRoc auc score : {roc_auc_score(y_test, y_pred)}")
    print("*"*50)
    print(f"\nConfusion Matrix : \n {confusion_matrix(y_test, y_pred)}")
    print("*"*50)
    print(f"\nClassification Report : \n {classification_report(y_test, y_pred)}") 

In [None]:
# KNN tuned
param_grid = {'n_neighbors' : np.arange(1, 30, 2),
             'metric' : ['euclidean', 'minkowski', 'manhatten']}

best_param_knn = {'metric': ['euclidean'], 'n_neighbors': [25]}

knnt = KNeighborsClassifier()
grid_knnt = GridSearchCV(knnt, best_param_knn, scoring='accuracy', cv=10, refit=True)
grid_evaluation(grid_knnt, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
#  svc tuned
param_grid = {'C': [0.01, 0.1, 1, 10, 100], 
              'gamma': [1,0.1,0.01,0.001],
              'kernel': ['rbf']}

best_param_svc = [{'C': [100], 'gamma': [0.001], 'kernel': ['rbf']}]

svct = SVC()
grid_svct = GridSearchCV(svct, best_param_svc, scoring='accuracy', cv=10, refit=True)
grid_evaluation(grid_svct, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# Decision Tree tuned
param_grid = {"splitter":["best","random"],
            "max_depth" : [1,3,5,7,9,11,12],
            "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
            "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
            "max_features":["auto","log2","sqrt",None],
            "max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90]}

best_param_dt = {'max_depth': [5],
                 "max_features":[None],
                 "max_leaf_nodes":[None],
                 'min_samples_leaf': [1], 
                 'min_weight_fraction_leaf': [0.1], 
                 'splitter': ['best']}

dt = DecisionTreeClassifier(random_state=42)
grid_dt = GridSearchCV(dt, best_param_dt ,scoring='accuracy',cv = 10,refit = True)
grid_evaluation(grid_dt, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# Random Forest tuned
param_grid = {'n_estimators':[200,500,1000,2000],
              'max_depth':[2,3,4,5]
              }
best_param_rf = {'max_depth': [3], 'n_estimators': [2000]}

rfct = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_rfct = GridSearchCV(rfct, best_param_rf ,scoring='accuracy',cv = 10,refit = True)
grid_evaluation(grid_rfct, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# Using optuna for doing hyperparameter tuning in Xgboost. You can uncomment and run to find the optimum parameters
# # def objective(trial):
    
# #     learning_rate = trial.suggest_float("learning_rate", 1e-2, 0.25, log=True)
# #     reg_lambda = trial.suggest_loguniform("reg_lambda", 1e-8, 100.0)
# #     reg_alpha = trial.suggest_loguniform("reg_alpha", 1e-8, 100.0)
# #     subsample = trial.suggest_float("subsample", 0.1, 1.0)
# #     colsample_bytree = trial.suggest_float("colsample_bytree", 0.1, 1.0)
# #     max_depth = trial.suggest_int("max_depth", 1, 7)
# #     n_estimators= trial.suggest_int("n_estimators", 50,2000,50)

# #     model = XGBClassifier(eval_metric='logloss',
# #         random_state=42,
# #         tree_method="gpu_hist",
# #         gpu_id=0,
# #         predictor="gpu_predictor",
# #         n_estimators = n_estimators,
# #         learning_rate=learning_rate,
# #         reg_lambda=reg_lambda,
# #         reg_alpha=reg_alpha,
# #         subsample=subsample,
# #         colsample_bytree=colsample_bytree,
# #         max_depth=max_depth,
# #     )
# #     model.fit(X_train_scaled, y_train,verbose=1)
# #     preds = model.predict(X_test_scaled)
# #     pred_labels = np.rint(preds)
# #     accuracy = accuracy_score(y_test, preds)
# #     return accuracy

# # study = optuna.create_study(direction='maximize')
# # study.optimize(objective, n_trials=1000)
# # print("Number of finished trials: ", len(study.trials))
# # print("Best trial:")
# # trial = study.best_trial

# # print("  Value: {}".format(trial.value))
# # print("  Params: ")
# # for key, value in trial.params.items():
# #     print("    {}: {}".format(key, value))

In [None]:
# XGBoost tuned
learning_rate= 0.1135930253853376
reg_lambda= 0.0015187772228404815
reg_alpha= 3.1569434136364856e-08
subsample= 0.19543620768271805
colsample_bytree= 0.9783970896407462
max_depth= 1
n_estimators= 100

xgb = XGBClassifier(eval_metric='logloss',
        random_state=42,
        tree_method="gpu_hist",
        gpu_id=0,
        predictor="gpu_predictor",
        n_estimators = n_estimators,
        learning_rate=learning_rate,
        reg_lambda=reg_lambda,
        reg_alpha=reg_alpha,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        max_depth=max_depth
    )

evaluation(xgb, X_train_scaled, y_train, X_test_scaled, y_test)

# Observation:
### 1) Xgboost works best this time also but with higher accuracy and Roc score
### 2) Xgboost: Accuracy = 0.8052 and Roc_auc = 0.7838
### 3) Xgboost has False Negative 16 and False Positive 14 which is good.
### 4) Random Forest also improved. Accuracy = 0.7922 and Roc_auc = 0.7495

# Stacking

In [None]:
# Stacking two best models RandomForestClassifier() and XGBClassifier() to see if performance increases
estimators_list = [
    ('rf', grid_rfct),
    ('xgb', xgb)]

stack_model = StackingClassifier(estimators = estimators_list, final_estimator=LogisticRegression())

stack_model.fit(X_train_scaled, y_train)
y_train_pred = stack_model.predict(X_train_scaled)

y_test_pred = stack_model.predict(X_test_scaled)

print(f"Accuracy Score (Train set) : {accuracy_score(y_train, y_train_pred)}")
print("*"*50)
print(f"\nAccuracy Score (Test set) : {accuracy_score(y_test, y_test_pred)}")
print("*"*50)
print(f"\nRoc auc score : {roc_auc_score(y_test, y_test_pred)}")
print("*"*50)
print(f"\nConfusion Matrix : \n {confusion_matrix(y_test, y_test_pred)}")
print("*"*50)
print(f"\nClassification Report : \n {classification_report(y_test, y_test_pred)}") 

# Observation:
### After Stacking: Accuracy = 0.7922 and Roc_auc = 0.7957 . Both are less than the tuned XGBClassifier()
### So, Hyperparameter tuned XGBClassifier is the best model among others.

## Thank you. Leave an upvote if you liked my notebook. Leave a suggestion if I can improve it.