# Extending CVObjective for Custom Models

This notebook demonstrates how to extend the `CVObjective` base class to work with models that don't have a scikit-learn compatible API. We'll use LightGBM's native Python API as an example.

When to extend CVObjective:
- Your model doesn't follow scikit-learn's `fit`/`predict` interface
- You need custom training logic (e.g., early stopping with validation sets)
- You want fine-grained control over the training process

When to use SklearnCVObj instead:
- Your model follows scikit-learn's API (has `fit` and `predict` methods)
- Examples: `RandomForestClassifier`, `XGBClassifier`, `LGBMClassifier`

Requirements: `lightgbm` must be installed prior to running this notebook.

In [1]:
# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS
import os
os.environ['OMP_NUM_THREADS'] = '1'

In [2]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score

# FCVOpt imports
from fcvopt.crossvalidation import CVObjective
from fcvopt.optimizers import FCVOpt
from fcvopt.configspace import ConfigurationSpace
from ConfigSpace import Integer, Float

## Generate Sample Data

We use the same dataset as in the sklearn API example for consistency.

In [3]:
# Generate binary classification dataset with class imbalance (90% vs 10%)
# Using 2000 samples, 25 features (5 informative, 10 redundant)
X, y = make_classification(
    n_samples= 2000,
    n_features= 25,
    n_informative= 5,
    n_redundant=10,
    n_classes= 2, n_clusters_per_class= 2,
    weights=[0.9, 0.1], # imbalanced data,
    random_state=23
)

print(f"Shape of features matrix: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")

Shape of features matrix: (2000, 25)
Class distribution: [1796  204]


## Define Hyperparameter Search Space

We use the same hyperparameter space as in the sklearn API example:

| Hyperparameter | Range | Scale | Description |
|----------------|-------|-------|-------------|
| `num_round` | [50, 1000] | log | Number of boosting rounds (trees) |
| `learning_rate` | [1e-3, 0.25] | log | Shrinkage rate for updates |
| `num_leaves` | [2, 128] | log | Maximum number of leaves per tree |
| `min_data_in_leaf` | [2, 100] | log | Minimum samples required in a leaf |
| `colsample_bytree` | [0.05, 1.0] | log | Fraction of features used per tree |

In [4]:
# Create configuration space for hyperparameter search
config = ConfigurationSpace()

# Add hyperparameters with appropriate ranges and scales
config.add([
    Integer('num_round', bounds=(50, 1000), log=True),
    Float('learning_rate', bounds=(1e-3, 0.25), log=True),
    Integer('num_leaves', bounds=(2, 128), log=True),
    Integer('min_data_in_leaf', bounds=(2, 100), log=True),
    Float('colsample_bytree', bounds=(0.05, 1), log=True)
])
print(config)

Configuration space object:
  Hyperparameters:
    colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale
    learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale
    min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale
    num_round, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale



## Extend CVObjective for LightGBM Native API

To create a custom CV objective, extend the `CVObjective` base class and implement the `fit_and_test` method.

Key responsibilities:
1. `CVObjective` base class: Handles the CV loop, fold splitting, and aggregation
2. Your `fit_and_test` method: Trains and evaluates on a SINGLE fold

The `fit_and_test` method receives train/test indices for one fold and should return a scalar loss value.

In [5]:
class LightGBMCVObj(CVObjective):
    """
    Custom CVObjective for LightGBM native API.
    
    This demonstrates how to extend CVObjective for models that don't follow
    scikit-learn's API. The parent class handles cross-validation, while this 
    class focuses on training/evaluating a single fold.
    """    
    def fit_and_test(self, params, train_index, test_index):
        """
        Train and evaluate model on a SINGLE fold.
        
        This method is called by CVObjective for each fold during cross-validation.
        Do NOT loop over folds here - that's handled by the parent class.
        
        Parameters
        ----------
        params : dict
            Hyperparameters to evaluate (e.g., {'num_round': 100, 'learning_rate': 0.1})
        train_index : array-like
            Indices for training data in this fold
        test_index : array-like
            Indices for test data in this fold
            
        Returns
        -------
        float
            Loss value for this fold (lower is better)
        """
        # Step 1: Extract data for this fold
        # Handle both DataFrame and ndarray formats
        if isinstance(self.X, pd.DataFrame):
            X_train = self.X.iloc[train_index]
            X_test = self.X.iloc[test_index]
        else:
            X_train = self.X[train_index]
            X_test = self.X[test_index]

        y_train = self.y[train_index]
        y_test = self.y[test_index]

        # Step 2: Prepare LightGBM parameters
        # Separate num_round from model parameters
        lgb_params = {
            'objective': 'binary',  # Binary classification
            'metric': 'auc',  # Track AUC during training
            'seed': self.rng_seed,
            'verbosity': -1  # Suppress output
        }

        # Extract num_round separately (it's not a model parameter)
        num_round = params.get('num_round', 100)  # Default to 100 if not specified
        
        # Add all other parameters to lgb_params
        for param, val in params.items():
            if param != 'num_round':
                lgb_params[param] = val
        
        # Step 3: Create LightGBM Dataset objects
        # LightGBM's native API requires Dataset objects
        train_data = lgb.Dataset(X_train, label=y_train)
        
        # Step 4: Train model using lgb.train (native API)
        # This gives more control than the sklearn API
        bst = lgb.train(lgb_params, train_data, num_round)
        
        # Step 5: Predict on test set
        # Returns predicted probabilities for binary classification
        y_pred_proba = bst.predict(X_test)
        
        # Step 6: Compute and return loss
        # Use the loss_metric provided during initialization
        return self.loss_metric(y_test, y_pred_proba)

In [6]:
# Define loss metric: minimize (1 - AUC)
def auc_loss(y_true, y_pred):
    return 1 - roc_auc_score(y_true, y_pred)

# Create our custom CV objective
# Pass all the same arguments as we would to CVObjective
cv_obj = LightGBMCVObj(
    X=X, 
    y=y,
    loss_metric=auc_loss,  # Our custom loss function
    task='classification',
    n_splits=10,  # 10-fold cross-validation
    stratified=True,  # Use stratified splits to preserve class distribution
    rng_seed=42
)

print(f"LightGBMCVObj initialized")
print(f"Number of CV folds: {cv_obj.cv.get_n_splits()}")
print(f"Training samples: {len(cv_obj.y)}")
print(f"Features: {cv_obj.X.shape[1]}")

LightGBMCVObj initialized
Number of CV folds: 10
Training samples: 2000
Features: 25


### Test the Custom CV Objective

Before running full optimization, test that our custom implementation works correctly.

In [8]:
# Test with the default configuration
test_config = config.get_default_configuration()
print("Testing with configuration:")
print(test_config)

# Evaluate this configuration using our custom CV objective
# This calls cv_obj.__call__(), which internally:
# 1. Loops over all 10 folds
# 2. Calls fit_and_test() for each fold
# 3. Returns the mean loss across folds
test_loss = cv_obj(dict(test_config))
test_auc = 1 - test_loss

print(f"\nCV Loss (1-AUC): {test_loss:.6f}")
print(f"CV AUC: {test_auc:.6f}")
print("Custom CV objective working correctly!")

Testing with configuration:
Configuration(values={
  'colsample_bytree': 0.22360679775,
  'learning_rate': 0.0158113883008,
  'min_data_in_leaf': 14,
  'num_leaves': 16,
  'num_round': 224,
})

CV Loss (1-AUC): 0.072066
CV AUC: 0.927934
Custom CV objective working correctly!


## Run Hyperparameter Optimization with FCVOpt

Now use the custom CV objective with FCVOpt, exactly as we did with Sklearn CV Obj in the previous example.

In [9]:
# Initialize FCVOpt optimizer with our custom CV objective
# Note: We use cv_obj.cvloss (deprecated but still supported) or cv_obj directly
optimizer = FCVOpt(
    obj=cv_obj.cvloss,  # Our custom CV objective's evaluation method
    n_folds=cv_obj.cv.get_n_splits(),  # Total number of folds (10)
    config=config,  # Search space definition
    acq_function='LCB',  # Lower Confidence Bound acquisition function
    tracking_dir='./hpt_opt_runs/',  # Directory for saving optimization results
    experiment='lgb_native_tuning',  # Experiment name for tracking
    seed=123
)

# Run optimization for 50 BO iterations
best_conf = optimizer.optimize(n_trials=50)

2025/10/25 19:26:40 INFO mlflow.tracking.fluent: Experiment with name 'lgb_native_tuning' does not exist. Creating a new experiment.



Number of candidates evaluated.....: 50
Observed obj at incumbent..........: 0.0386111
Estimated obj at incumbent.........: 0.0706391

 Best Configuration at termination:
 Configuration(values={
  'colsample_bytree': 1.0,
  'learning_rate': 0.0383681477115,
  'min_data_in_leaf': 6,
  'num_leaves': 38,
  'num_round': 446,
})


In [10]:
# Evaluate the best configuration found by FCVOpt
# Convert loss back to AUC for easier interpretation (loss = 1 - AUC)
best_cv_loss = cv_obj(best_conf)
best_cv_auc = 1 - best_cv_loss

print(f"10-fold CV Loss: {best_cv_loss:.4f}")
print(f"10-fold CV ROC-AUC: {best_cv_auc:.4f}")

10-fold CV Loss: 0.0635
10-fold CV ROC-AUC: 0.9365
