# Baselines

In this notebook we get baseline AUC scores for LightGBM's `LGBMClassifier` and scikit-learn's `HistGradientBoostingClassifier`. We use Kaggle notebooks to test out XGBoost and CatBoost since they both run very slowly on my local computer which has only CPU capabilities.

In each case we use 3-fold cross-validation, fix the random seed, set a high value for the number of trees/iterations and use early stopping to avoid overfitting. Otherwise, we leave all settings at their defaults, the next few notebooks will be concerned with hyperparameter tuning.

In [1]:
# Global variables for testing changes to this notebook quickly
NUM_TREES = 10000
EARLY_STOP = 150
NUM_FOLDS = 3
RANDOM_SEED = 0

In [2]:
# Essential imports
import numpy as np
import pandas as pd
import pyarrow
import time
import os
import gc

# Model evaluation
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score

# Models
from lightgbm import LGBMClassifier
from sklearn.ensemble import HistGradientBoostingClassifier

# Hide warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Generate training set
train = pd.read_feather("../data/train.feather")

# Save features and categorical features
features = [x for x in train.columns if x not in ['id','target']]
lgbm_cat_features = [x for x in features if train[x].dtype.name.startswith("int")]
hist_cat_features = [train[x].dtype.name.startswith("int") for x in features]

# Model 1: LGBMClassifier

The first model we test is the [LGBMClassifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html) from the LightGBM library.

In [4]:
def score_lightgbm():
    start = time.time()
    scores = np.zeros(NUM_FOLDS)
    print('')
    
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = 0)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train['target'])):
        
        # train, valid split for cross-validation
        X_train, y_train = train[features].iloc[train_idx], train['target'].iloc[train_idx]
        X_valid, y_valid = train[features].iloc[valid_idx], train['target'].iloc[valid_idx]

        # model with params
        model = LGBMClassifier(
            n_estimators = NUM_TREES,
            random_state = RANDOM_SEED,
        )

        model.fit(
            X_train, y_train,
            eval_set = [(X_valid, y_valid)],
            eval_metric = 'auc',
            early_stopping_rounds = EARLY_STOP,
            categorical_feature = lgbm_cat_features,
            verbose = False,
        )

        valid_preds = model.predict_proba(X_valid)[:,1]
        
        scores[fold] = roc_auc_score(y_valid, valid_preds)
        print(f"Fold {fold} (AUC):", scores[fold])
        
    end = time.time()
    return scores.mean(), round(end-start, 2)

In [5]:
lgbm_score, lgbm_time = score_lightgbm()

print("\nTraining Time:", lgbm_time)
print("Holdout (AUC):", lgbm_score)


Fold 0 (AUC): 0.8551987614596401
Fold 1 (AUC): 0.8551955781455962
Fold 2 (AUC): 0.8550151870874342

Training Time: 463.0
Holdout (AUC): 0.8551365088975569


# Model 2: HistGradientBoostingClassifier

The second model we consider is the [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html#sklearn.ensemble.HistGradientBoostingClassifier) from scikit-learn, which itself is modeled after LightGBM.

In [6]:
def score_histgbm():
    
    start = time.time()
    scores = np.zeros(NUM_FOLDS)
    print('')
    
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = 0)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train['target'])):
        
        # train, valid split for cross-validation
        X_train, y_train = train[features].iloc[train_idx], train['target'].iloc[train_idx]
        X_valid, y_valid = train[features].iloc[valid_idx], train['target'].iloc[valid_idx]

        # model with params
        model = HistGradientBoostingClassifier(
            max_iter = NUM_TREES,
            early_stopping = True,
            n_iter_no_change = EARLY_STOP,
            categorical_features = hist_cat_features,
            validation_fraction = 0.1,
        )

        model.fit(X_train, y_train)

        valid_preds = model.predict_proba(X_valid)[:,1]
        
        scores[fold] = roc_auc_score(y_valid, valid_preds)
        print(f"Fold {fold} (AUC):", scores[fold])
        
    end = time.time()
    return scores.mean(), round(end-start, 2)

In [7]:
hist_score, hist_time = score_histgbm()

print("\nTraining Time:", hist_time)
print("Holdout (AUC):", hist_score)


Fold 0 (AUC): 0.854598950948193
Fold 1 (AUC): 0.8547508884946713
Fold 2 (AUC): 0.8546138417438963

Training Time: 659.82
Holdout (AUC): 0.8546545603955868
