# Introduction to FCVOpt 

This notebook demonstrates the FCVOpt API for efficient hyperparameter optimization using **fractional cross-validation**. We'll tune a Random Forest classifier on a synthetic dataset to illustrate the key concepts and workflow.

## What is FCVOpt?

FCVOpt implements an innovative approach to hyperparameter optimization that addresses a fundamental challenge in machine learning:

- **The Problem**: K-fold cross-validation is more robust than simple train-test splits, but requires fitting K models at each hyperparameter configuration—making optimization computationally expensive.
  
- **The Solution**: FCVOpt uses a hierarchical Gaussian process model to exploit correlation between folds across the hyperparameter space. This allows the algorithm to evaluate only one CV fold for many configurations while still providing reliable estimates.

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import zero_one_loss

from fcvopt.optimizers import FCVOpt
from fcvopt.crossvalidation import SklearnCVObj
from fcvopt.configspace import ConfigurationSpace
from ConfigSpace import Integer, Float

## Generating the data

We'll create a synthetic binary classification dataset with the following characteristics:
- 1,250 samples with 50 features
- Only 10 features are informative, 25 are redundant, and the rest are noise
- 10% label noise to make the problem more realistic
- 80/20 train/test split for final model evaluation

**Note**: The test set is held out entirely and will only be used to evaluate the final optimized model. It plays no role in hyperparameter optimization.

In [2]:
# Generate sample classification data
X, y = make_classification(
    n_samples=1250, 
    n_features=50, 
    n_informative=10,
    n_redundant=25,
    n_classes=2,
    flip_y=0.1,
    random_state=42
)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Class distribution: {np.bincount(y_train)}")

Training set: 1000 samples, 50 features
Test set: 250 samples
Class distribution: [497 503]


## FCVOPT API

FCVOpt follows a simple and intuitive three-step process:

```
1. Define Cross-Validation Objective
   ↓
2. Define Hyperparameter Search Space  
   ↓
3. Run Optimization
```
Let's walk through each step in detail.

##  Step 1: Define the Cross-Validation Objective
The CV objective encapsulates:
- **What model** we're optimizing (RandomForestClassifier)
- **What data** we're using (X_train, y_train)
- **What metric** we're minimizing (misclassification rate)
- **How many folds** to use (10-fold CV)

For scikit-learn estimators, FCVOpt provides the convenient `SklearnCVObj` wrapper class.

In [3]:
# Create CV objective for Random Forest
cv_obj = SklearnCVObj(
    estimator=RandomForestClassifier(random_state=42),
    X=X_train, 
    y=y_train,
    loss_metric=zero_one_loss,  # Minimize misclassification rate
    task='binary-classification',
    n_splits=10,  # 5-fold cross-validation
    rng_seed=42
)

print(f"Created CV objective with {cv_obj.cv.get_n_splits()} folds")

Created CV objective with 10 folds


## Step 2: Define the hyperparameter search space

The configuration space specifies which hyperparameters to optimize and their valid ranges. For Random Forest, we'll tune:

| Hyperparameter | Range | Scale | Description |
|----------------|-------|-------|-------------|
| `n_estimators` | [50, 1000] | Log | Number of trees in the forest |
| `max_depth` | [1, 15] | Log | Maximum depth of each tree |
| `max_features` | [0.01, 1.0] | Log | Fraction of features to consider for splits |
| `min_samples_split` | [2, 200] | Log | Minimum samples required to split a node |

In [4]:
# Define hyperparameter search space
config = ConfigurationSpace()
config.add([
    Integer('n_estimators', bounds=(50, 1000), log=True),
    Integer('max_depth', bounds=(1, 15), log=True),
    Float('max_features', bounds=(0.01, 1.0), log=True),
    Integer('min_samples_split', bounds=(2, 200), log=True)
])
config.generate_indices()

print(config)

Configuration space object:
  Hyperparameters:
    max_depth, Type: UniformInteger, Range: [1, 15], Default: 4, on log-scale
    max_features, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.1, on log-scale
    min_samples_split, Type: UniformInteger, Range: [2, 200], Default: 20, on log-scale
    n_estimators, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale



## Step 3: Initialize and Run the Optimizer

Now we're ready to optimize! Key parameters for `FCVOpt`:

- `obj`: The objective function to minimize (CV loss)
- `n_folds`: Number of CV folds in the objective
- `config`: The search space we defined
- `acq_function`: Acquisition function for Bayesian optimization
  - `'LCB'` (Lower Confidence Bound): Faster, good for exploration/exploitation balance
  - `'KG'` (Knowledge Gradient): Often better results but slower
- `tracking_dir`: Directory for MLflow experiment tracking
- `experiment`: Name for this optimization run

We'll run 25 trials, which means evaluating 25 different hyperparameter configurations, each evaluated on a single held-out fold. 

In [5]:
# Initialize FCVOpt optimizer
optimizer = FCVOpt(
    obj=cv_obj.cvloss,
    n_folds=cv_obj.cv.get_n_splits(),
    config=config,
    acq_function='LCB',  # Lower Confidence Bound acquisition
    tracking_dir='./hp_opt_runs/',  # MLflow tracking directory
    experiment='rf_tuning_example',
)

# run for 25 trials 
best_conf = optimizer.optimize(n_trials=25)

2025/10/22 21:54:59 INFO mlflow.tracking.fluent: Experiment with name 'rf_tuning_example' does not exist. Creating a new experiment.



Number of candidates evaluated.....: 25
Observed obj at incumbent..........: 0.09
Estimated obj at incumbent.........: 0.0996302

 Best Configuration at termination:
 Configuration(values={
  'max_depth': 15,
  'max_features': 0.1314003108823,
  'min_samples_split': 5,
  'n_estimators': 1000,
})


## Training the Final Model

Now we train a Random Forest with the optimized hyperparameters on the full training set and evaluate it on the held-out test set.

In [6]:
# get the model with best hyperparmeters found
best_model = cv_obj.construct_model(dict(best_conf))

# train the model on the data
best_model.fit(X_train, y_train)

In [7]:
# Evaluate on training and test sets
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)

train_mcr = zero_one_loss(y_train, y_train_pred)
test_mcr = zero_one_loss(y_test, y_test_pred)

print("Final Model Performance:")
print(f"  Training Misclassification Rate....:{train_mcr:.4f}")
print(f"  Test Misclassification Rate.......: {test_mcr:.4f}")

Final Model Performance:
  Training Misclassification Rate....:0.0020
  Test Misclassification Rate.......: 0.1000
