
# Hyperparameter Optimization for a Random Forest Classifier

This notebook first gives the idea about cross-validation and then demonstrates different methods for hyperparameter tuning:

1. Grid Search (scikit-learn)  
2. Random Search (scikit-learn)  
3. Bayesian Optimization (using Optuna)

We use a real-life binary classification dataset (the breast cancer dataset from scikit-learn) and a Random Forest classifier.

The techniques used in this notebook for hyper-parameter optimization can be used for other types of models as well.


## 1. Setup and Data Loading

In [1]:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}")

Training samples: 455, Test samples: 114


In [2]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678



## 2. Cross-Validation

Cross-validation estimates model performance by splitting the training data into $k$ folds. For each fold:

$$
\text{score} = \frac{1}{k} \sum_{i=1}^{k} \text{performance on fold } i
$$

We typically use $k=5$ or $k=10$.


In [None]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(random_state=42)   # default hyperparameters

scores = cross_val_score(rf, X_train, y_train, cv=5, scoring="accuracy")
print("Cross-validation accuracy scores:", scores)
print("Mean accuracy: {:.4f} ± {:.4f}".format(scores.mean(), scores.std()))


Cross-validation accuracy scores: [0.96703297 0.98901099 0.92307692 0.93406593 0.95604396]
Mean accuracy: 0.9538 ± 0.0235



## 3. Grid Search (scikit-learn)

Grid Search exhaustively explores a parameter grid. 

Denote parameters as $\theta = \{n_{\text{estimators}}, \max\_depth, \ldots\}$. We evaluate:

$$
\hat{\theta} = \arg\max_{\theta \in \Theta} \text{CVScore}(\theta)
$$


In [4]:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42), # we can use other model as well
    param_grid,
    cv=5,
    scoring='accuracy',   # we can use other metrics like 'f1', 'roc_auc', etc.
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")

Best parameters: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 50}
Best CV accuracy: 0.9582


In [6]:
best_rf = grid_search.best_estimator_
print(best_rf)
test_acc = best_rf.score(X_test, y_test)
print(f"Test set accuracy: {test_acc:.4f}")

RandomForestClassifier(min_samples_split=5, n_estimators=50, random_state=42)
Test set accuracy: 0.9474



## 4. Random Search (scikit-learn)

Random Search samples parameter combinations randomly—more efficient when the grid is large.


In [13]:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': [None] + list(range(3, 15)),
    'min_samples_split': randint(2, 10)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=123,
    n_jobs=-1
)

random_search.fit(X_train, y_train)
print("Best parameters (Random Search):", random_search.best_params_)
print(f"Best CV accuracy: {random_search.best_score_:.4f}")


Best parameters (Random Search): {'max_depth': 11, 'min_samples_split': 2, 'n_estimators': 161}
Best CV accuracy: 0.9604


In [14]:
best_rf_rand = random_search.best_estimator_
test_acc_rand = best_rf_rand.score(X_test, y_test)
print(f"Test set accuracy (Random Search): {test_acc_rand:.4f}")

Test set accuracy (Random Search): 0.9561



## 5. Bayesian Optimization (Optuna)

These methods model the search process more intelligently. We'll demonstrate Optuna.


In [15]:

!pip install optuna





[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [16]:

import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 50, 200)
    max_depth = trial.suggest_int('max_depth', 3, 15)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)

    clf = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42
    )
    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)

print("Optuna best params:", study.best_params)
print(f"Optuna best CV accuracy: {study.best_value:.4f}")


  from .autonotebook import tqdm as notebook_tqdm
[I 2025-08-09 12:19:24,723] A new study created in memory with name: no-name-0e614e66-6ab0-4709-bb17-3bb1183fa658
[I 2025-08-09 12:19:26,386] Trial 0 finished with value: 0.9538461538461538 and parameters: {'n_estimators': 162, 'max_depth': 13, 'min_samples_split': 5}. Best is trial 0 with value: 0.9538461538461538.
[I 2025-08-09 12:19:27,118] Trial 1 finished with value: 0.9560439560439562 and parameters: {'n_estimators': 77, 'max_depth': 15, 'min_samples_split': 3}. Best is trial 1 with value: 0.9560439560439562.
[I 2025-08-09 12:19:28,914] Trial 2 finished with value: 0.953846153846154 and parameters: {'n_estimators': 198, 'max_depth': 8, 'min_samples_split': 9}. Best is trial 1 with value: 0.9560439560439562.
[I 2025-08-09 12:19:29,696] Trial 3 finished with value: 0.9582417582417584 and parameters: {'n_estimators': 87, 'max_depth': 11, 'min_samples_split': 6}. Best is trial 3 with value: 0.9582417582417584.
[I 2025-08-09 12:19:31,2

Optuna best params: {'n_estimators': 87, 'max_depth': 11, 'min_samples_split': 6}
Optuna best CV accuracy: 0.9582


In [17]:

optuna_rf = RandomForestClassifier(
    **study.best_params,
    random_state=42
)
optuna_rf.fit(X_train, y_train)
optuna_test_acc = optuna_rf.score(X_test, y_test)
print(f"Test set accuracy (Optuna): {optuna_test_acc:.4f}")


Test set accuracy (Optuna): 0.9561



## 6. Summary and Comparison

The following numbers may vary due to the inherent randomness of data splitting, model training, cross validation and optimization.

| Method             | Best CV Accuracy | Test Accuracy |
|--------------------|------------------|---------------|
| Grid Search        | 0.9582           |0.9474         |
| Random Search      | 0.9604           |0.9561         |
| Optuna (Bayesian)  | 0.9582           |0.9561         |

For large datasets and large hyper-parameter space, usually random search / bayesian search performs best. 


## 7. Saving and Loading models in sklearn using joblib

In [18]:
import joblib

In [19]:
# save the model
# joblib.dump(model, "path")

joblib.dump(optuna_rf, 'best_rf_model.joblib')

['best_rf_model.joblib']

In [20]:
# Load the model
# joblib.load("model path")

loaded_model = joblib.load("best_rf_model.joblib")

print(loaded_model)

print(loaded_model.score(X_test, y_test))

RandomForestClassifier(max_depth=11, min_samples_split=6, n_estimators=87,
                       random_state=42)
0.956140350877193
