# Tuning LightGBM hyperparameters (scikit-learn API)

This example demonstrates how to use FCVOpt to tune LightGBM hyperparameters for a binary classification task.

Key Features:
- Uses `SklearnCVObj` to wrap LightGBM's scikit-learn API
- Demonstrates FCVOpt's hierarchical Gaussian process for efficient cross-validation
- Shows how to define a hyperparameter search space with appropriate scales

Requirements: `lightgbm` must be installed via `pip install lightgbm`

In [1]:
# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS
import os
os.environ['OMP_NUM_THREADS'] = '1'

In [2]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score

# FCVOpt imports
from fcvopt.crossvalidation import SklearnCVObj
from fcvopt.optimizers import FCVOpt
from fcvopt.configspace import ConfigurationSpace
from ConfigSpace import Integer, Float

## Generate Sample Data

We create a synthetic binary classification dataset with class imbalance to simulate a realistic scenario.

In [3]:
# Generate binary classification dataset with class imbalance (90% vs 10%)
# Using 2000 samples, 25 features (5 informative, 10 redundant)
X, y = make_classification(
    n_samples= 2000,
    n_features= 25,
    n_informative= 5,
    n_redundant=10,
    n_classes= 2, n_clusters_per_class= 2,
    weights=[0.9, 0.1], # imbalanced data,
    random_state=23
)

print(f"Shape of features matrix: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")

Shape of features matrix: (2000, 25)
Class distribution: [1796  204]


## Define Hyperparameter Search Space

We'll tune key LightGBM hyperparameters using appropriate ranges and scales:

| Hyperparameter | Range | Scale | Description |
|----------------|-------|-------|-------------|
| `n_estimators` | [50, 1000] | log | Number of boosting rounds (trees) |
| `learning_rate` | [1e-3, 0.25] | log | Shrinkage rate for updates |
| `num_leaves` | [2, 128] | log | Maximum number of leaves per tree |
| `min_data_in_leaf` | [2, 100] | log | Minimum samples required in a leaf |
| `colsample_bytree` | [0.05, 1.0] | log | Fraction of features used per tree |

In [4]:
# Create configuration space for hyperparameter search
config = ConfigurationSpace()

# Add hyperparameters with appropriate ranges and scales
# Note: LightGBM sklearn API uses 'num_round' -> number of estimators
config.add([
    Integer('n_estimators', bounds=(50, 1000), log=True),
    Float('learning_rate', bounds=(1e-3, 0.25), log=True),
    Integer('num_leaves', bounds=(2, 128), log=True),
    Integer('min_data_in_leaf', bounds=(2, 100), log=True),
    Float('colsample_bytree', bounds=(0.05, 1), log=True),
])
print(config)

Configuration space object:
  Hyperparameters:
    colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale
    learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale
    min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale
    num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale



## Define Cross-Validation Objective

We wrap the LightGBM classifier in a `SklearnCVObj` that:
- Evaluates models using 10-fold stratified cross-validation
- Uses ROC-AUC as the performance metric (converted to a loss: 1 - AUC)
- Handles the model training and evaluation for each hyperparameter configuration

In [5]:
# Define loss metric: we want to maximize AUC, so we minimize (1 - AUC)
def auc_loss(y_true, y_pred):
    return 1 - roc_auc_score(y_true, y_pred)

# Create CV objective that wraps the LightGBM classifier
cv_obj = SklearnCVObj(
    estimator=lgb.LGBMClassifier(objective="binary", verbosity=-1),  # Suppress LightGBM output
    X=X, 
    y=y,
    loss_metric=auc_loss,
    needs_proba=True,
    task='classification',
    n_splits=10,  # 10-fold cross-validation
    rng_seed=42,
    stratified=True  # Use stratified splits to preserve class distribution
)

print(f"Created CV objective with {cv_obj.cv.get_n_splits()} folds")

print(f"Number of CV folds: {cv_obj.cv.get_n_splits()}")
print(f"Training samples: {len(cv_obj.y)}")
print(f"Features: {cv_obj.X.shape[1]}")

Created CV objective with 10 folds
Number of CV folds: 10
Training samples: 2000
Features: 25


## Run Hyperparameter Optimization with FCVOpt

In [6]:
# Initialize FCVOpt optimizer
optimizer = FCVOpt(
    obj=cv_obj.cvloss,  # The cross-validation objective function
    n_folds=cv_obj.cv.get_n_splits(),  # Total number of folds (10)
    config=config,  # Search space definition
    acq_function='LCB',  # Lower Confidence Bound acquisition function
    tracking_dir='./hpt_opt_runs/',  # Directory for saving optimization results
    experiment='lgb_sklearn_tuning',  # Experiment name for tracking
    seed=123
)

# Run optimization for 50 BO iterations
best_conf = optimizer.optimize(n_trials= 50)


Number of candidates evaluated.....: 50
Observed obj at incumbent..........: 0.0347222
Estimated obj at incumbent.........: 0.0648997

 Best Configuration at termination:
 Configuration(values={
  'colsample_bytree': 0.4551726039474,
  'learning_rate': 0.0387740433153,
  'min_data_in_leaf': 10,
  'n_estimators': 123,
  'num_leaves': 32,
})


In [7]:
# Evaluate the best configuration found by FCVOpt
# Convert loss back to AUC for easier interpretation (loss = 1 - AUC)
best_cv_loss = cv_obj(best_conf)
best_cv_auc = 1 - best_cv_loss

print(f"10-fold CV Loss: {best_cv_loss:.4f}")
print(f"10-fold CV ROC-AUC: {best_cv_auc:.4f}")

10-fold CV Loss: 0.0665
10-fold CV ROC-AUC: 0.9335
