Skip to content

HistGradientBoosting should shuffle features when exploring for new splits #26428

@glemaitre

Description

@glemaitre

Summary

Our implementation of HistGradientBoosting does not shuffle the feature at each node to find the best split. Note that our GradientBoosting, RandomForest, and DecisionTree use a Fisher-Yates permutation.

Not permuting features introduces a bias that can have some impact. For instance, it will impact the inspection and could lead to misinterpretation if one does not know about this implementation detail.

Below, I show an example where correlated features will have close importance in the case of a GradientBoosting. For HistGradientBoosting, a single feature (in an arbitrary manner) will be used and thus only a single feature is reported as important.

Reproducer

# %%
import numpy as np

rng = np.random.default_rng(42)

n_samples, n_features = 100, 3
X = rng.normal(size=(n_samples, n_features))
X = np.concatenate([X, X], axis=1)  # create correlated features
y = X[:, 0] + X[:, 1] + rng.normal(scale=0.1, size=n_samples)

# %%
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor

gbdt = GradientBoostingRegressor(n_estimators=100, random_state=42).fit(X, y)
hgbdt = HistGradientBoostingRegressor(max_iter=100, random_state=42).fit(X, y)

# %%
import pandas as pd
from sklearn.inspection import permutation_importance

importances = {}
for name, model in zip(["gbdt", "hgbdt"], [gbdt, hgbdt]):
    result = permutation_importance(model, X, y, n_repeats=100, random_state=42, n_jobs=-1)
    importances[name] = pd.DataFrame(
        result.importances.T, columns=[f"Features #{i}" for i in range(X.shape[1])]
    )

# %%
import matplotlib.pyplot as plt

fig, axes = plt.subplots(ncols=2, figsize=(10, 4))
for ax, (name, importance) in zip(axes, importances.items()):
    importance.plot.box(ax=ax, vert=False, whis=100)
    ax.set_title(name)
    ax.set_xlabel("Decrease in R2 score")
fig.tight_layout()

image

Resolution

We can use a shuffling in _find_node_split to alleviate this bias and have a similar implementation as in the current tree implementation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions