## About Dataset

Context

Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead to reduced quality of life and life expectancy. After different foods are broken down into sugars during digestion, the sugars are then released into the bloodstream. This signals the pancreas to release insulin. Insulin helps enable cells within the body to use those sugars in the bloodstream for energy. Diabetes is generally characterized by either the body not making enough insulin or being unable to use the insulin that is made as effectively as needed.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are associated with chronically high levels of sugar remaining in the bloodstream for those with diabetes. While there is no cure for diabetes, strategies like losing weight, eating healthily, being active, and receiving medical treatments can mitigate the harms of this disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making predictive models for diabetes risk important tools for public and public health officials.

The scale of this problem is also important to recognize. The Centers for Disease Control and Prevention has indicated that as of 2018, 34.2 million Americans have diabetes and 88 million have prediabetes. Furthermore, the CDC estimates that 1 in 5 diabetics, and roughly 8 in 10 prediabetics are unaware of their risk. While there are different types of diabetes, type II diabetes is the most common form and its prevalence varies by age, education, income, location, race, and other social determinants of health. Much of the burden of the disease falls on those of lower socioeconomic status as well. Diabetes also places a massive burden on the economy, with diagnosed diabetes costs of roughly $327 billion dollars and total costs with undiagnosed diabetes and prediabetes approaching $400 billion dollars annually.

Content

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, a csv of the dataset available on Kaggle for the year 2015 was used. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

This dataset contains 3 files:

diabetes _ 012 _ health _ indicators _ BRFSS2015.csv is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_012 has 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. There is class imbalance in this dataset. This dataset has 21 feature variables
diabetes _ binary _ 5050split _ health _ indicators _ BRFSS2015.csv is a clean dataset of 70,692 survey responses to the CDC's BRFSS2015. It has an equal 50-50 split of respondents with no diabetes and with either prediabetes or diabetes. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is balanced.
diabetes _ binary _ health _ indicators _ BRFSS2015.csv is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is not balanced.
Explore some of the following research questions:

Can survey questions from the BRFSS provide accurate predictions of whether an individual has diabetes?
What risk factors are most predictive of diabetes risk?
Can we use a subset of the risk factors to accurately predict whether an individual has diabetes?
Can we create a short form of questions from the BRFSS using feature selection to accurately predict if someone might have diabetes or is at high risk of diabetes?
Acknowledgements

It it important to reiterate that I did not create this dataset, it is just a cleaned and consolidated dataset created from the BRFSS 2015 dataset already on Kaggle. That dataset can be found here and the notebook I used for the data cleaning can be found here.

Inspiration

Zidian Xie et al for Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques using the 2014 BRFSS was the inspiration for creating this dataset and exploring the BRFSS in general. Link

In [50]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
import plotly.express as px
import numpy as np
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [51]:
df = pd.read_csv('csv\\diabetes.csv')

In [52]:
df.isnull().sum()

Diabetes_012            0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

In [53]:
df.columns

Index(['Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')

In [54]:
input_cols = ['HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income']

target_col = 'Diabetes_012'

numeric_cols=  ['HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income']

In [55]:
train_inputs = df[input_cols].copy()
train_targets = df[target_col].copy()

In [56]:
train_inputs

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0


In [57]:
train_targets

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
         ... 
253675    0.0
253676    2.0
253677    0.0
253678    0.0
253679    2.0
Name: Diabetes_012, Length: 253680, dtype: float64

In [58]:
# from sklearn.preprocessing import MinMaxScaler

# scaler = MinMaxScaler()
# scaler.fit(train_inputs[numeric_cols])

# train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
# train_inputs

In [59]:
from sklearn.model_selection import train_test_split
X_train, X_val, train_targets, val_targets = train_test_split(train_inputs, train_targets, test_size=0.20  ,random_state=42)

In [60]:
X_train

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
31141,0.0,1.0,1.0,20.0,1.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
98230,0.0,0.0,1.0,34.0,0.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,8.0,5.0,8.0
89662,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,5.0,0.0,1.0,12.0,5.0,6.0
208255,0.0,1.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,5.0,6.0,7.0
233415,0.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,3.0,0.0,0.0,1.0,0.0,12.0,4.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119879,1.0,0.0,1.0,45.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,15.0,0.0,0.0,0.0,5.0,4.0,1.0
103694,1.0,1.0,1.0,29.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,11.0,6.0,7.0
131932,0.0,1.0,1.0,25.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,3.0,0.0,0.0,9.0,6.0,8.0
146867,0.0,0.0,0.0,23.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,1.0,2.0,0.0,0.0,0.0,0.0,5.0,6.0,6.0


In [61]:
train_targets

31141     0.0
98230     0.0
89662     2.0
208255    0.0
233415    0.0
         ... 
119879    0.0
103694    0.0
131932    0.0
146867    0.0
121958    2.0
Name: Diabetes_012, Length: 202944, dtype: float64

In [62]:
X_val

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
219620,0.0,0.0,1.0,21.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,3.0,7.0,0.0,0.0,7.0,4.0,2.0
132821,1.0,1.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,0.0,13.0,6.0,6.0
151862,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,4.0,7.0
139717,0.0,0.0,1.0,27.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,1.0,2.0,4.0,7.0
239235,0.0,1.0,1.0,31.0,1.0,0.0,0.0,0.0,1.0,1.0,...,1.0,1.0,4.0,27.0,27.0,1.0,0.0,8.0,3.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169513,1.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,3.0,0.0,10.0,0.0,0.0,9.0,6.0,7.0
182415,0.0,0.0,1.0,25.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,1.0,10.0,0.0,0.0,10.0,5.0,8.0
109739,0.0,1.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,3.0,3.0,0.0,0.0,1.0,6.0,6.0,8.0
181671,0.0,0.0,1.0,24.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,4.0,0.0,0.0,0.0,1.0,13.0,4.0,5.0


In [63]:
val_targets

219620    0.0
132821    0.0
151862    0.0
139717    0.0
239235    0.0
         ... 
169513    2.0
182415    0.0
109739    0.0
181671    0.0
202118    0.0
Name: Diabetes_012, Length: 50736, dtype: float64

## RandomForest

In [64]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [65]:
def evalmodel(model,**params):
    best_model = model(**params).fit(X_train,train_targets)
    train_preds = best_model.predict(X_train)
    val_preds = best_model.predict(X_val)
    return {
        'Model' : model,
        'train_score': f'{round(accuracy_score(train_preds,train_targets)*100,2)}%',
        'val_score' : f'{round(accuracy_score(val_preds,val_targets)*100,2)}%' ,
    }

In [66]:
evalmodel(DecisionTreeClassifier)

{'Model': sklearn.tree._classes.DecisionTreeClassifier,
 'train_score': '99.33%',
 'val_score': '76.84%'}

In [67]:
evalmodel(LogisticRegression)

{'Model': sklearn.linear_model._logistic.LogisticRegression,
 'train_score': '84.34%',
 'val_score': '84.5%'}

In [68]:
evalmodel(RandomForestClassifier)

{'Model': sklearn.ensemble._forest.RandomForestClassifier,
 'train_score': '99.32%',
 'val_score': '84.24%'}

In [69]:
from xgboost import XGBClassifier
evalmodel(XGBClassifier)

{'Model': xgboost.sklearn.XGBClassifier,
 'train_score': '85.99%',
 'val_score': '85.04%'}

In [70]:
from lightgbm import LGBMClassifier
evalmodel(LGBMClassifier)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.009618 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 202
[LightGBM] [Info] Number of data points in the train set: 202944, number of used features: 21
[LightGBM] [Info] Start training from score -0.171805
[LightGBM] [Info] Start training from score -4.008117
[LightGBM] [Info] Start training from score -1.968338


{'Model': lightgbm.sklearn.LGBMClassifier,
 'train_score': '85.31%',
 'val_score': '85.09%'}

## Manual Hyper Parameter Tuning

In [71]:
params= {          
            'random_state' : 69,
            'n_jobs': -1,
            'max_depth': 10,
            'n_estimators': 200,                  
}
evalmodel(RandomForestClassifier,**params)

{'Model': sklearn.ensemble._forest.RandomForestClassifier,
 'train_score': '85.13%',
 'val_score': '84.93%'}

In [72]:
from hyperopt import hp,fmin,tpe,STATUS_OK,Trials

{'criterion': 2,
 'max_depth': 5,
 'min_samples_leaf': 0.4984700988274807,
 'min_samples_split': 0.5286996901950891,
 'n_estimator': 3}

In [73]:
params= {          
            'random_state' : 69,
            'n_jobs': -1,
            'max_depth': 15,
            'criterion' : 'gini',
            'min_samples_leaf': 0.4984700988274807,
            'min_samples_split': 0.5286996901950891,
            'n_estimators': 350,                  
}
evalmodel(RandomForestClassifier,**params)

{'Model': sklearn.ensemble._forest.RandomForestClassifier,
 'train_score': '84.21%',
 'val_score': '84.35%'}

## LightGBM

In [74]:
from lightgbm import LGBMClassifier

## SPACE 1
space = {
    'learning_rate':    hp.uniform('learning_rate',0.1,1),

    'max_depth':        hp.choice('max_depth',        np.arange(2, 100, 1, dtype=int)),

    'min_child_weight': hp.choice('min_child_weight', np.arange(1, 50, 1, dtype=int)),

    'colsample_bytree': hp.uniform('colsample_bytree',0.4,1),

    'subsample':        hp.uniform('subsample', 0.6, 1),

    'num_leaves':       hp.choice('num_leaves',       np.arange(2, 200, 1, dtype=int)),

    'min_split_gain':   hp.uniform('min_split_gain', 0, 1),

    'reg_alpha':        hp.uniform('reg_alpha',0,1),

    'reg_lambda':       hp.uniform('reg_lambda',0,1),

    'n_estimators':     400
}|

{'colsample_bytree': 0.7307032857576866,

 'learning_rate': 0.10738152629339512,

 'max_depth': 36,

 'min_child_weight': 23,

 'min_split_gain': 0.2528786101471297,

 'num_leaves': 33,

 'reg_alpha': 0.12465033908508727,

 'reg_lambda': 0.9611554948975123,
 
 'subsample': 0.8281005540944155}

## SPACE 2

space = {
    'learning_rate':    hp.uniform('learning_rate',0.010738152629339512,0.030058749798),

    'max_depth':        hp.choice('max_depth',        np.arange(30, 50, 1, dtype=int)),

    'min_child_weight': hp.choice('min_child_weight', np.arange(28, 70, 1, dtype=int)),

    'colsample_bytree': hp.uniform('colsample_bytree',0.44076,0.66847),

    'subsample':        hp.uniform('subsample', 0.6281005540944155, 0.9281005540944155),

    'num_leaves':       hp.choice('num_leaves',       np.arange(30, 45, 1, dtype=int)),

    'min_split_gain':   hp.uniform('min_split_gain', 0.23, 0.65),

    'reg_alpha':        hp.uniform('reg_alpha',0.86113,1),

    'reg_lambda':       hp.uniform('reg_lambda',0.6281005,1),

    'n_estimators':     hp.choice('n_estimators', np.arange(350,600,1,dtype=int))
}

{'colsample_bytree': 0.50846296990002,

 'learning_rate': 0.01785493532879305,

 'max_depth': 11,

 'min_child_weight': 25,

 'min_split_gain': 0.2561298152675693,

 'n_estimators': 180,

 'num_leaves': 14,

 'reg_alpha': 0.9063404866626565,

 'reg_lambda': 0.9419746238500906,

 'subsample': 0.814215896898399}

In [75]:
space = {
    'learning_rate':    hp.uniform('learning_rate',0.010738152629339512,0.030058749798),
    'max_depth':        hp.choice('max_depth',        np.arange(30, 50, 1, dtype=int)),
    'min_child_weight': hp.choice('min_child_weight', np.arange(28, 70, 1, dtype=int)),
    'colsample_bytree': hp.uniform('colsample_bytree',0.44076,0.66847),
    'subsample':        hp.uniform('subsample', 0.6281005540944155, 0.9281005540944155),
    'num_leaves':       hp.choice('num_leaves',       np.arange(30, 45, 1, dtype=int)),
    'min_split_gain':   hp.uniform('min_split_gain', 0.23, 0.65),
    'reg_alpha':        hp.uniform('reg_alpha',0.86113,1),
    'reg_lambda':       hp.uniform('reg_lambda',0.6281005,1),
    'n_estimators':     hp.choice('n_estimators', np.arange(350,600,1,dtype=int)),
    "bagging_freq": hp.choice('bagging_freq', np.arange(1,10,1,dtype=int)), 
    "bagging_fraction": hp.uniform('bagging_fraction',0.50,0.90)
}

def objective(space):
    model = LGBMClassifier(n_jobs=-1,**space)
    accuracy = cross_val_score(model, X_train,train_targets, cv = 5).mean()

    # We aim to maximize accuracy, therefore we return it as a negative value
    return -accuracy

In [76]:
# trials = Trials()
# best = fmin(fn= objective,
#             space= space,
#             algo= tpe.suggest,
#             max_evals = 60,
#             trials= trials)
# best

In [77]:
params={
        'n_jobs' : -1,
        'random_state':68,
        'colsample_bytree': 0.50846296990002,
        'learning_rate': 0.03785493532879305,
        'max_depth': 2**20,
        'min_child_weight': 25,
        'min_split_gain': 0.2561298152675693,
        'n_estimators': 180,
        'num_leaves': 20,
        'reg_alpha': 0.09963404866626565,
        'reg_lambda': 0.9419746238500906,
        'subsample': 0.784215896898399,
        "bagging_freq": 5, 
        "bagging_fraction": 0.75,
}

evalmodel(LGBMClassifier,**params)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008706 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 202
[LightGBM] [Info] Number of data points in the train set: 202944, number of used features: 21
[LightGBM] [Info] Start training from score -0.171805
[LightGBM] [Info] Start training from score -4.008117
[LightGBM] [Info] Start training from score -1.968338


{'Model': lightgbm.sklearn.LGBMClassifier,
 'train_score': '85.08%',
 'val_score': '85.1%'}

## Hyper Parameter Tuning using Optuna

## Trial#1

params={
        'n_jobs' : -1,

        'random_state':68,

        'colsample_bytree': 0.50846296990002,

        'learning_rate': 0.021961346754085575,

        'max_depth': 2**31,

        'min_child_weight': 25,

        'min_split_gain': 0.3053041848061701,

        'n_estimators': 383,

        'min_child_samples': 10,

        'num_leaves': 31,

        'lambda_l1': 0.38499061611697205,

        'lambda_l2': 0.09216828113298257,

        'subsample': 0.784215896898399,

        "bagging_freq": 2, 
        
        'bagging_fraction': 0.9786503969324173
}

In [78]:
import optuna
def objective(trial):

    param = {
        "objective": "multiclass",
        "metric": "multi_logloss",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "lambda_l1": trial.suggest_float("lambda_l1", 0.2344,0.64897, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 0.0821,0.98763, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 32,64),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.756271, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.8, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_samples": trial.suggest_int("min_child_samples", 10,50),
        "n_estimators": trial.suggest_int("n_estimators", 300,600),
        'max_depth': trial.suggest_int("num_leaves",2**32, 2**64),
        'min_split_gain': trial.suggest_float("min_split_gain", 1e-1, 1, log=True),
        'learning_rate': trial.suggest_float("learning_rate", 0.01, 0.05, log=True),
    }
    model = LGBMClassifier(n_jobs=-1,**param)
    accuracy = cross_val_score(model, X_train,train_targets, cv = 5).mean()
    return accuracy

In [79]:
# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=150)

# print("Number of finished trials: {}".format(len(study.trials)))

# print("Best trial:")
# trial = study.best_trial

# print("  Value: {}".format(trial.value))
# trial.params

In [80]:
#trial.params

In [81]:
params={
        'n_jobs' : -1,
        'random_state':42,
        'colsample_bytree': 0.50846296990002,
        'learning_rate': 0.0339636787391576,
        'max_depth': 2**43,
        'min_child_weight': 27,
        'min_split_gain': 0.7032416584200394,
        'n_estimators': 515,
        'min_child_samples': 27,
        'num_leaves': 43,
        'lambda_l1': 0.34020299510463375,
        'lambda_l2': 0.9176514188818733,
        'subsample': 0.784215896898399,
        "bagging_freq": 5, 
        'bagging_fraction': 0.826303189778174,
}

evalmodel(LGBMClassifier,**params)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010923 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 202
[LightGBM] [Info] Number of data points in the train set: 202944, number of used features: 21
[LightGBM] [Info] Start training from score -0.171805
[LightGBM] [Info] Start training from score -4.008117
[LightGBM] [Info] Start training from score -1.968338


{'Model': lightgbm.sklearn.LGBMClassifier,
 'train_score': '85.2%',
 'val_score': '85.12%'}

In [82]:
# import plotly.express as px
# fig = optuna.visualization.plot_optimization_history(study)
# fig.update_layout( width=900, height=600)
# fig.show()

<img src='files\\Optimization Plots\\optimization history plot.png'>

In [83]:
# import botorch
# from optuna.visualization import plot_terminator_improvement
# fig = plot_terminator_improvement(study, plot_error=True)
# fig.update_layout( width=900, height=600)
# fig.show()

## XGBOOST

In [84]:
evalmodel(XGBClassifier)

{'Model': xgboost.sklearn.XGBClassifier,
 'train_score': '85.99%',
 'val_score': '85.04%'}

In [85]:
params = {
    'max_depth': 200,
    'n_estimators' : 350,
    'eta': 0.019,
    'gamma': 4,
    'min_child_weight':10,
    'subsample':0.4344,
    'alpha':  1,
    'lambda' : 2,
    # 'colsample_bytree':0.5,
    # 'colsample_bylevel':0.57748,
    # 'colsample_bynode':0.5,
    'max_delta_step':6,
    'grow_policy':'lossguide',
    'max_leaves':20,
}
evalmodel(XGBClassifier,**params)

{'Model': xgboost.sklearn.XGBClassifier,
 'train_score': '84.99%',
 'val_score': '85.11%'}

## Trial #1

{
    
'max_depth': 507,

 'n_estimators': 465,

 'eta': 0.024923147602866933,

 'gamma': 1,

 'min_child_weight': 17,

 'subsample': 0.44205873287201014,

 'alpha': 6,

 'lambda': 1,

 'colsample_bytree': 0.4108221535102925,

 'max_delta_step': 2,

 'max_leaves': 29
 }

In [86]:
import optuna
def objective(trial):

    param = {
        'max_depth': trial.suggest_int("max_depth",200,600),
        "n_estimators": trial.suggest_int("n_estimators", 300,600),
        'eta': trial.suggest_float('eta',0.01,0.06),
        'gamma': trial.suggest_int("gamma",1,20),
        'min_child_weight':trial.suggest_int("min_child_weight",8,30),
        'subsample':trial.suggest_float('subsample',0.3,1),
        'alpha': trial.suggest_int("alpha",1,10),
        'lambda' : trial.suggest_int("lambda",1,10),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        'max_delta_step':trial.suggest_int("max_delta_step",1,10),
        'grow_policy':'lossguide',
        'max_leaves':trial.suggest_int("max_leaves",1,30),
    }
    model = XGBClassifier(n_jobs=-1,**param)
    accuracy = cross_val_score(model, X_train,train_targets, cv = 3).mean()
    return accuracy

In [87]:
# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=150)

# print("Number of finished trials: {}".format(len(study.trials)))

# print("Best trial:")
# trial = study.best_trial

# print("  Value: {}".format(trial.value))
# trial.params

In [88]:
best_params = {
        'max_depth': 2**29,
        'n_estimators': 465,
        'eta': 0.024923147602866933,
        'gamma': 0,
        'min_child_weight': 17,
        'subsample': 0.44205873287201014,
        'alpha': 6,
        'lambda': 1,
        'colsample_bytree': 0.4108221535102925,
        'max_delta_step': 2,
        'max_leaves': 29
}
evalmodel(XGBClassifier,**best_params)

{'Model': xgboost.sklearn.XGBClassifier,
 'train_score': '85.09%',
 'val_score': '85.11%'}

In [101]:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn import metrics
model = RandomForestClassifier(random_state=76)
model.fit(X_train,train_targets)
svc_disp = metrics.plot_roc_curve(model,X_train, train_targets)
plt.show()

AttributeError: module 'sklearn.metrics' has no attribute 'plot_roc_curve'

In [102]:
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import plot_roc_curve
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X_train, train_targets, random_state=42)
svc = SVC(random_state=42)
svc.fit(X_train, y_train)
svc_disp = plot_roc_curve(svc, X_val, val_targets)
plt.show()

ImportError: cannot import name 'plot_roc_curve' from 'sklearn.metrics' (c:\Users\Saket\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\__init__.py)