<a href="https://colab.research.google.com/github/stanislavlia/msds_cred_scoring/blob/master/training_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install plotly
!pip install optuna
!pip install xgboost
!pip install catboost
!pip install shap

Collecting shap
  Downloading shap-0.44.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (533 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m533.5/533.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting slicer==0.0.7 (from shap)
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.44.0 slicer-0.0.7


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import optuna
import plotly.express as px
import logging
import shap

from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

#Set up logging
logger = logging.getLogger(optuna.__name__)
logger.setLevel(logging.DEBUG)

handler = logging.StreamHandler()
logger.addHandler(handler)

In [None]:
train_df = pd.read_csv("drive/MyDrive/msds_homecredit/processed_train.csv")
test_df = pd.read_csv("drive/MyDrive/msds_homecredit/processed_test.csv")

In [None]:
train_df.drop('Unnamed: 0', axis=1, inplace=True)
test_df.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
X_train = train_df.drop("TARGET", axis=1)
y_train = train_df["TARGET"]

X_test = test_df

## Choice of metric

Our target variable is quite unbalanced (9% of defaulters vs 91% the rest).
This fact tells us that we shouldn't chose accuracy since this metric is not relevant for unbalanced data. We are interested in both catching potential defaulters and avoiding false-positive default predictions because it is essentially lose of money for the bank. I consider two metric that might fit well to our goals: **F1-score** and **AUC** (Area under ROC curve) because these metrics will lead us to good combination of both **precision** and **recall**. I chose to go with **AUC** although *F1-score* would be a good metric as well.

Chosen metric - **AUC**



## Discovering Important Features in Data

In order to get a better understanding of our features, I will train a small *Random Forest* model to compute feature importance from the trees as well as SHAP importance.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_discover = RandomForestClassifier(max_depth=10,
                                       n_estimators=50,
                                       bootstrap=True)
rf_discover.fit(X_train, y_train)

In [None]:
feature_importances = pd.DataFrame({"feature" : X_train.columns,
                                     "importance" : rf_discover.feature_importances_})
feature_importances = feature_importances.sort_values(by="importance", ascending=False)

In [None]:
feature_importances.head(15)

Unnamed: 0,feature,importance
28,EXT_SOURCE_2,0.135829
29,EXT_SOURCE_3,0.122082
27,EXT_SOURCE_1,0.040218
89,DAYS_CREDIT_mean,0.023999
7,DAYS_BIRTH,0.022154
99,DAYS_CREDIT_UPDATE_mean,0.020172
91,DAYS_CREDIT_ENDDATE_mean,0.016375
128,AMT_PAYMENT_mean,0.015854
155,NAME_CONTRACT_STATUS_mode,0.015701
139,NAME_EDUCATION_TYPE,0.014433


In [None]:
fig = px.bar(feature_importances[:50], x='feature', y='importance', title='50 Most Important Features')
fig.update_layout(
    autosize=False,
    width=1100,     # Width of the figure in pixels
    height=700,     # Height of the figure in pixels
    yaxis=dict(
        tickmode='auto',
        nticks=20   # Number of ticks on y-axis, for more precision
    )
)
fig.show()

## Hand-crafted features

In [None]:
# Adding a small constant to avoid division by zero
epsilon = 1e-6

train_df["EXT_SOUCE_AVG"] = (train_df["EXT_SOURCE_1"] + train_df["EXT_SOURCE_2"] + train_df["EXT_SOURCE_3"]) / 3
train_df["LABOR_PERIOD_RATE"] = train_df["DAYS_EMPLOYED"] / (train_df["DAYS_BIRTH"] + epsilon)
train_df["CURR_VS_PREV_GOODS_PRICE"] = train_df["AMT_GOODS_PRICE"] / (train_df["AMT_GOODS_PRICE_mean"] + epsilon)
train_df["CURR_VS_PREV_ANNUITY"] = train_df["AMT_ANNUITY"] / (train_df["AMT_ANNUITY_mean"] + epsilon)
train_df["CONSUMPTION_RATE"] = train_df["AMT_GOODS_PRICE"] / (train_df["AMT_INCOME_TOTAL"] + epsilon)
train_df["CURR_REGISTRATION_PERIOD"] = train_df["DAYS_REGISTRATION"] / (train_df["DAYS_BIRTH"] + epsilon)
train_df["CREDIT_LOAD"] = train_df["AMT_CREDIT"] / (train_df["AMT_INCOME_TOTAL"] + epsilon)
train_df["CREDIT_LOAD_MEAN"] = train_df["AMT_CREDIT_mean"] / (train_df["AMT_INCOME_TOTAL"] + epsilon)
train_df["DECISION_ACTION_TIME"] = (- train_df["DAYS_DECISION_mean"]) - (- train_df["DAYS_ENTRY_PAYMENT_mean"])
train_df["PAYMENT_ANNUITY_RATIO"] = train_df["AMT_PAYMENT_mean"] / (train_df["AMT_ANNUITY_mean"] + epsilon)

# Applying the same changes to test_df
test_df["EXT_SOUCE_AVG"] = (test_df["EXT_SOURCE_1"] + test_df["EXT_SOURCE_2"] + test_df["EXT_SOURCE_3"]) / 3
test_df["LABOR_PERIOD_RATE"] = test_df["DAYS_EMPLOYED"] / (test_df["DAYS_BIRTH"] + epsilon)
test_df["CURR_VS_PREV_GOODS_PRICE"] = test_df["AMT_GOODS_PRICE"] / (test_df["AMT_GOODS_PRICE_mean"] + epsilon)
test_df["CURR_VS_PREV_ANNUITY"] = test_df["AMT_ANNUITY"] / (test_df["AMT_ANNUITY_mean"] + epsilon)
test_df["CONSUMPTION_RATE"] = test_df["AMT_GOODS_PRICE"] / (test_df["AMT_INCOME_TOTAL"] + epsilon)
test_df["CURR_REGISTRATION_PERIOD"] = test_df["DAYS_REGISTRATION"] / (test_df["DAYS_BIRTH"] + epsilon)
test_df["CREDIT_LOAD"] = test_df["AMT_CREDIT"] / (test_df["AMT_INCOME_TOTAL"] + epsilon)
test_df["CREDIT_LOAD_MEAN"] = test_df["AMT_CREDIT_mean"] / (test_df["AMT_INCOME_TOTAL"] + epsilon)
test_df["DECISION_ACTION_TIME"] = (- test_df["DAYS_DECISION_mean"]) - (- test_df["DAYS_ENTRY_PAYMENT_mean"])
test_df["PAYMENT_ANNUITY_RATIO"] = test_df["AMT_PAYMENT_mean"] / (test_df["AMT_ANNUITY_mean"] + epsilon)


Add some comments about why I chose such combinations...

## Choice of models

Overall, our training data has 177 features which tells us that we deal with
a high dimensional data. For this problem, **tree-based ensembles** would be a nice choise because of predictive power and efficiencty and computation. Another advantage of such models is ability to interpret results and analyze feature importance.

Models to consider:
  - RandomForest
  - Catboost
  - Xgboost
  - LightGBM
  

## Optuna framework for Hyperparameters tuning

Optuna is an open-source hyperparameter optimization framework designed for machine learning. It provides an efficient way to search for the best set of parameters for a given model to improve its performance. Optuna uses *Bayesian Optimization* under the hood. Bayesian optimization in Optuna is a sophisticated approach to hyperparameter tuning, using probabilistic models to guide the search for the best hyperparameters. It's particularly effective for optimizing complex functions where evaluations (like training and validating a machine learning model) are expensive in terms of time and computational resources. We need to define an objective function that we want to optimize. In our case, we are interested in maximizing **AUC** on cross-validation.

## Tunning for RandomForest

In [None]:

def rf_objective(trial):
    # Define the hyperparameter grid
    n_estimators = trial.suggest_int("n_estimators", 100, 1000)
    max_depth = trial.suggest_int("max_depth", 2, 32, log=True)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 14)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 14)

    clf = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        random_state=42
    )

    scores = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
    return np.mean(scores)

rf_study = optuna.create_study(direction="maximize")
rf_study.optimize(rf_objective, n_trials=5, timeout= 2400)


rf_trial = rf_study.best_trial
print(f'Best trial: {rf_trial.params}')

##Saving study
rf_study_df = rf_study.trials_dataframe()
rf_study_df.to_csv("drive/MyDrive/msds_homecredit/rf_study.csv", index=False)

[I 2024-01-13 07:00:15,035] A new study created in memory with name: no-name-1e51066b-19c6-48b0-833c-639f9a13f4e4
A new study created in memory with name: no-name-1e51066b-19c6-48b0-833c-639f9a13f4e4
A new study created in memory with name: no-name-1e51066b-19c6-48b0-833c-639f9a13f4e4
[I 2024-01-13 07:32:12,671] Trial 0 finished with value: 0.7249206881502835 and parameters: {'n_estimators': 472, 'max_depth': 17, 'min_samples_split': 7, 'min_samples_leaf': 2}. Best is trial 0 with value: 0.7249206881502835.
Trial 0 finished with value: 0.7249206881502835 and parameters: {'n_estimators': 472, 'max_depth': 17, 'min_samples_split': 7, 'min_samples_leaf': 2}. Best is trial 0 with value: 0.7249206881502835.
Trial 0 finished with value: 0.7249206881502835 and parameters: {'n_estimators': 472, 'max_depth': 17, 'min_samples_split': 7, 'min_samples_leaf': 2}. Best is trial 0 with value: 0.7249206881502835.
[I 2024-01-13 07:48:11,760] Trial 1 finished with value: 0.730204166165661 and parameters

Best trial: {'n_estimators': 631, 'max_depth': 6, 'min_samples_split': 6, 'min_samples_leaf': 10}


In [None]:
rf_best_params = {'n_estimators': 568,
                  'max_depth': 14,
                  'min_samples_split': 7,
                  'min_samples_leaf': 10}



In [None]:
optuna.visualization.plot_optimization_history(rf_study, target_name="AUC")



In [None]:
optuna.visualization.plot_param_importances(rf_study)

### Reduce the size of model
I think, it is a good idea to try to traing  following models on reduced data that contains top *K* most important features according to feature importance. This will allow us to iterate faster over models and set of hyperparameters

In [None]:
X_train = train_df.drop("TARGET", axis=1)
y_train = train_df["TARGET"]

X_test = test_df



In [None]:
rf_best_model = RandomForestClassifier(n_estimators=50,
                                      max_depth=14,
                                      min_samples_split=7,
                                      min_samples_leaf=10,
                                      n_jobs=-1)
rf_best_model.fit(X_train, y_train)


In [None]:
new_feature_importances = pd.DataFrame({"feature" : X_train.columns,
                                     "importance" : rf_best_model.feature_importances_})
new_feature_importances = new_feature_importances.sort_values(by="importance", ascending=False)

In [None]:
fig = px.bar(new_feature_importances[:50], x='feature',
             y='importance', title='New TOP 50 Important features')

fig.update_traces(marker_color='orange')

fig.update_layout(
    autosize=False,
    width=1100,
    height=700,
    yaxis=dict(
        tickmode='auto',
        nticks=20
    )
)
fig.show()

First of all, we can see that some of handcrafted features turned out to be successful combination. The most useful feature turned out to be **EXT_SOURCE_AVG** the average of External scoring sources which is a hand-crafted feature.
**PAYMENT_ANNUITY_RATIO** is also among top of features.
Overall, *Feature engineering* gave us a bit more useful columns to consider.

Now, I suggest to use only first 75 features that are most important in order to make training iterations faster.

In [None]:
TOP_75_FEATS = new_feature_importances[:75]["feature"].values
TOP_75_FEATS

array(['EXT_SOUCE_AVG', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'EXT_SOURCE_1',
       'DAYS_CREDIT_mean', 'DAYS_BIRTH', 'PAYMENT_ANNUITY_RATIO',
       'DAYS_CREDIT_UPDATE_mean', 'DAYS_CREDIT_ENDDATE_mean',
       'AMT_PAYMENT_mean', 'DAYS_EMPLOYED', 'LABOR_PERIOD_RATE',
       'CNT_INSTALMENT_FUTURE_mean', 'DAYS_DECISION_mean',
       'NUM_INSTALMENT_NUMBER_std', 'CURR_VS_PREV_ANNUITY',
       'DAYS_ID_PUBLISH', 'DAYS_INSTALMENT_mean',
       'DAYS_LAST_PHONE_CHANGE', 'AMT_INSTALMENT_mean', 'AMT_ANNUITY',
       'DAYS_LAST_DUE_1ST_VERSION_mean', 'DAYS_ENTRY_PAYMENT_mean',
       'CNT_INSTALMENT_FUTURE_std', 'DAYS_REGISTRATION', 'AMT_CREDIT',
       'AMT_CREDIT_SUM_DEBT_mean', 'CURR_REGISTRATION_PERIOD',
       'SELLERPLACE_AREA_mean', 'CNT_INSTALMENT_mean',
       'DAYS_ENDDATE_FACT_mean', 'CNT_PAYMENT_mean',
       'AMT_CREDIT_SUM_mean', 'AMT_INSTALMENT_std',
       'AMT_GOODS_PRICE_mean', 'AMT_ANNUITY_mean', 'CREDIT_LOAD_MEAN',
       'AMT_APPLICATION_mean', 'CREDIT_LOAD', 'SK_ID_CURR',
  

In [None]:
##Leave only important features

X_reduced_train = X_train[TOP_75_FEATS]
X_reduced_test = X_test[TOP_75_FEATS]

## Tunning for LightGBM

 *Using reduced data*

In [None]:
import lightgbm as lgb


def lgbm_objective(trial):
    param = {
        'objective': 'binary',
        'metric': 'auc',
        'verbosity': -1,
        'n_jobs' : -1,
        'boosting_type': 'gbdt',
        'n_estimators': trial.suggest_int('n_estimators', 100, 900),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2),
        'max_depth': trial.suggest_int('max_depth', 3, 8),
        'num_leaves': trial.suggest_int('num_leaves', 20, 300),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 1.0)
    }

    clf = lgb.LGBMClassifier(**param)
    scores = cross_val_score(clf, X_reduced_train, y_train, cv=3, scoring='roc_auc')
    return np.mean(scores)

lgbm_study = optuna.create_study(direction="maximize")
lgbm_study.optimize(lgbm_objective, n_trials=20, timeout=2400)

# Best trial
lgbm_trial = lgbm_study.best_trial
print(f'Best trial for Gradient Boosting: {lgbm_trial.params}')

##Saving study
lgbm_study_df = lgbm_study.trials_dataframe()
lgbm_study_df.to_csv("drive/MyDrive/msds_homecredit/lgbm_study.csv", index=False)

[I 2024-01-13 07:49:37,393] A new study created in memory with name: no-name-f9ba2fad-2f1e-431c-bd9e-0df99f1ce2f5
A new study created in memory with name: no-name-f9ba2fad-2f1e-431c-bd9e-0df99f1ce2f5
A new study created in memory with name: no-name-f9ba2fad-2f1e-431c-bd9e-0df99f1ce2f5
[I 2024-01-13 07:50:07,965] Trial 0 finished with value: 0.7648578302655195 and parameters: {'n_estimators': 115, 'learning_rate': 0.16355054833961238, 'max_depth': 4, 'num_leaves': 115, 'min_child_samples': 80, 'subsample': 0.8336654115917892, 'colsample_bytree': 0.5657631877734255, 'reg_alpha': 0.11399181101533484, 'reg_lambda': 0.0418844737267855}. Best is trial 0 with value: 0.7648578302655195.
Trial 0 finished with value: 0.7648578302655195 and parameters: {'n_estimators': 115, 'learning_rate': 0.16355054833961238, 'max_depth': 4, 'num_leaves': 115, 'min_child_samples': 80, 'subsample': 0.8336654115917892, 'colsample_bytree': 0.5657631877734255, 'reg_alpha': 0.11399181101533484, 'reg_lambda': 0.04188

Best trial for Gradient Boosting: {'n_estimators': 732, 'learning_rate': 0.031316147584151494, 'max_depth': 4, 'num_leaves': 254, 'min_child_samples': 53, 'subsample': 0.7984504967334111, 'colsample_bytree': 0.8251989329299895, 'reg_alpha': 0.6757873677566055, 'reg_lambda': 0.5459196324312461}


In [None]:
optuna.visualization.plot_optimization_history(lgbm_study, target_name="AUC")

In [None]:
optuna.visualization.plot_param_importances(lgbm_study)

In [None]:
lgbm_best_params = {'n_estimators': 732, 'learning_rate': 0.031316147584151494, 'max_depth': 4,
                    'num_leaves': 254, 'min_child_samples': 53, 'subsample': 0.7984504967334111,
                    'colsample_bytree': 0.8251989329299895,
                    'reg_alpha': 0.6757873677566055, 'reg_lambda': 0.5459196324312461}

## Tunning for XGBoost
---



In [None]:
from xgboost import XGBClassifier

def xgb_objective(trial):

    param = {
        'max_depth': trial.suggest_int('max_depth', 1, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 1e-8, 1.0, log=True),
        'subsample': trial.suggest_float('subsample', 0.01, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.01, 1.0, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 1.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 1.0, log=True),
        'eval_metric': 'mlogloss',
        'use_label_encoder': False,
        'device' : 'cuda'
    }


    clf = XGBClassifier(**param)
    scores = cross_val_score(clf, X_reduced_train, y_train, cv=3, scoring='roc_auc')
    return np.mean(scores)

In [None]:
xgb_study = optuna.create_study(direction="maximize")
xgb_study.optimize(xgb_objective, n_trials=20, timeout=2400)

# Best trial
xgb_trial = xgb_study.best_trial
print(f'Best trial for Gradient Boosting: {xgb_trial.params}')

##Saving study
xgb_study_df = xgb_study.trials_dataframe()
xgb_study_df.to_csv("drive/MyDrive/msds_homecredit/xgb_study.csv", index=False)

[I 2024-01-13 08:34:56,975] A new study created in memory with name: no-name-30cd1719-c560-446e-934c-b5fa44ea5628
A new study created in memory with name: no-name-30cd1719-c560-446e-934c-b5fa44ea5628
A new study created in memory with name: no-name-30cd1719-c560-446e-934c-b5fa44ea5628
[I 2024-01-13 08:35:07,890] Trial 0 finished with value: 0.754263443183404 and parameters: {'max_depth': 1, 'learning_rate': 0.927200974282978, 'n_estimators': 70, 'min_child_weight': 2, 'gamma': 0.007460526328726022, 'subsample': 0.6988110327627195, 'colsample_bytree': 0.13103387474723274, 'reg_alpha': 0.00011259149658238417, 'reg_lambda': 8.65148311804781e-08}. Best is trial 0 with value: 0.754263443183404.
Trial 0 finished with value: 0.754263443183404 and parameters: {'max_depth': 1, 'learning_rate': 0.927200974282978, 'n_estimators': 70, 'min_child_weight': 2, 'gamma': 0.007460526328726022, 'subsample': 0.6988110327627195, 'colsample_bytree': 0.13103387474723274, 'reg_alpha': 0.00011259149658238417, 

Best trial for Gradient Boosting: {'max_depth': 2, 'learning_rate': 0.15885761012327237, 'n_estimators': 415, 'min_child_weight': 6, 'gamma': 3.9356793970459116e-07, 'subsample': 0.6311152293659329, 'colsample_bytree': 0.05892904769636796, 'reg_alpha': 5.406899871601363e-08, 'reg_lambda': 0.020467130907403503}


In [None]:
xgb_best_params = {'max_depth': 2, 'learning_rate': 0.15885761012327237,
                   'n_estimators': 415, 'min_child_weight': 6,
                   'gamma': 3.9356793970459116e-07, 'subsample': 0.6311152293659329,
                   'colsample_bytree': 0.05892904769636796,
                   'reg_alpha': 5.406899871601363e-08, 'reg_lambda': 0.020467130907403503}

In [None]:
optuna.visualization.plot_optimization_history(xgb_study, target_name="AUC")

In [None]:
optuna.visualization.plot_param_importances(xgb_study)

## Tunning for Catboost

In [None]:
import catboost as cb

def cat_objective(trial):
  param = {
        "objective": trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
        "depth": trial.suggest_int("depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
        "bootstrap_type": trial.suggest_categorical(
            "bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]
        ),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1),
        'silent' : True

    }
  clf = cb.CatBoostClassifier(**param)
  scores = cross_val_score(clf, X_reduced_train, y_train, cv=3, scoring='roc_auc')
  return np.mean(scores)

cat_study = optuna.create_study(direction="maximize")
cat_study.optimize(cat_objective, n_trials=20, timeout=2400)

# Best trial
cat_trial = cat_study.best_trial
print(f'Best trial for Gradient Boosting: {cat_trial.params}')

##Saving study
cat_study_df = cat_study.trials_dataframe()
cat_study_df.to_csv("drive/MyDrive/msds_homecredit/cat_study.csv", index=False)

[I 2024-01-13 08:51:45,916] A new study created in memory with name: no-name-8f0fb612-97f1-47b1-811b-117010c04f54
A new study created in memory with name: no-name-8f0fb612-97f1-47b1-811b-117010c04f54
A new study created in memory with name: no-name-8f0fb612-97f1-47b1-811b-117010c04f54
[I 2024-01-13 08:54:27,357] Trial 0 finished with value: 0.7652014996781626 and parameters: {'objective': 'Logloss', 'colsample_bylevel': 0.039055149801086805, 'depth': 9, 'boosting_type': 'Plain', 'bootstrap_type': 'Bernoulli', 'l2_leaf_reg': 9.692986359915572, 'learning_rate': 0.07629718148933579}. Best is trial 0 with value: 0.7652014996781626.
Trial 0 finished with value: 0.7652014996781626 and parameters: {'objective': 'Logloss', 'colsample_bylevel': 0.039055149801086805, 'depth': 9, 'boosting_type': 'Plain', 'bootstrap_type': 'Bernoulli', 'l2_leaf_reg': 9.692986359915572, 'learning_rate': 0.07629718148933579}. Best is trial 0 with value: 0.7652014996781626.
Trial 0 finished with value: 0.76520149967

Best trial for Gradient Boosting: {'objective': 'Logloss', 'colsample_bylevel': 0.05280499488636512, 'depth': 8, 'boosting_type': 'Plain', 'bootstrap_type': 'Bayesian', 'l2_leaf_reg': 4.401996001745513, 'learning_rate': 0.061871791340297806}


In [None]:
catb_best_params = {'objective': 'Logloss', 'colsample_bylevel': 0.05280499488636512, 'depth': 8,
                    'boosting_type': 'Plain', 'bootstrap_type': 'Bayesian',
                    'l2_leaf_reg': 4.401996001745513, 'learning_rate': 0.061871791340297806}

In [None]:
optuna.visualization.plot_optimization_history(cat_study, target_name="AUC")

In [None]:
optuna.visualization.plot_param_importances(cat_study)


## Summary

###  Best model
The best model turned out to be **LightGBM** that achieved $AUC = 0.77$ on *Cross Validation* with 3 folds. I suggest to go with that model and set of parameters and refit it on the whole training dataset.

In [None]:
lgbm_best_params

{'n_estimators': 732,
 'learning_rate': 0.031316147584151494,
 'max_depth': 4,
 'num_leaves': 254,
 'min_child_samples': 53,
 'subsample': 0.7984504967334111,
 'colsample_bytree': 0.8251989329299895,
 'reg_alpha': 0.6757873677566055,
 'reg_lambda': 0.5459196324312461}

In [None]:
lgbm_study_df.sort_values(by="value", ascending=False).head(1)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_colsample_bytree,params_learning_rate,params_max_depth,params_min_child_samples,params_n_estimators,params_num_leaves,params_reg_alpha,params_reg_lambda,params_subsample,state
15,15,0.770105,2024-01-13 08:18:06.613229,2024-01-13 08:20:18.924746,0 days 00:02:12.311517,0.825199,0.031316,4,53,732,254,0.675787,0.54592,0.79845,COMPLETE


### Comparison



In [None]:
studies = [rf_study_df, lgbm_study_df, xgb_study_df, cat_study_df]

AUCS_achieved  = []
model_names = ["RandomForest", "LightGBM", "XGboost", "Catboost"]

for study in studies:
  auc = float(study.sort_values(by="value", ascending=False).head(1)["value"].values)
  AUCS_achieved.append(auc)

comparison_table = pd.DataFrame({"model" : model_names, "best AUC" : AUCS_achieved})

In [None]:
comparison_table.sort_values(by="best AUC", ascending=False).head()

Unnamed: 0,model,best AUC
1,LightGBM,0.770105
2,XGboost,0.766084
3,Catboost,0.765781
0,RandomForest,0.730204


### Retraing best model

In [None]:
lightgbm_model = lgb.LGBMClassifier(**lgbm_best_params, verbose=-1)
lightgbm_model.fit(X_reduced_train, y_train)




Exception ignored on calling ctypes callback function: <function _log_callback at 0x79dc0c3e4430>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/lightgbm/basic.py", line 203, in _log_callback
    def _log_callback(msg: bytes) -> None:
KeyboardInterrupt: 


No further splits with positive gain, best gain: -inf


In [None]:
lightgbm_model.booster_.save_model("drive/MyDrive/msds_homecredit/lgbm_mode.txt")

<lightgbm.basic.Booster at 0x79dc1e6c8b80>

### Computing predictions for test

In [None]:
predicted_probs = lightgbm_model.predict_proba(X_reduced_test)[:, 1]
submission = pd.DataFrame({"SK_ID_CURR" : X_reduced_test["SK_ID_CURR"],
                           "TARGET" : predicted_probs})


In [None]:
submission.head(6)

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.03459
1,100005,0.140334
2,100013,0.027351
3,100028,0.036764
4,100038,0.169699
5,100042,0.056141


In [None]:
submission.to_csv("drive/MyDrive/msds_homecredit/submission.csv", index=False)