-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Summary
Our implementation of HistGradientBoosting
does not shuffle the feature at each node to find the best split. Note that our GradientBoosting
, RandomForest
, and DecisionTree
use a Fisher-Yates permutation.
Not permuting features introduces a bias that can have some impact. For instance, it will impact the inspection and could lead to misinterpretation if one does not know about this implementation detail.
Below, I show an example where correlated features will have close importance in the case of a GradientBoosting
. For HistGradientBoosting
, a single feature (in an arbitrary manner) will be used and thus only a single feature is reported as important.
Reproducer
# %%
import numpy as np
rng = np.random.default_rng(42)
n_samples, n_features = 100, 3
X = rng.normal(size=(n_samples, n_features))
X = np.concatenate([X, X], axis=1) # create correlated features
y = X[:, 0] + X[:, 1] + rng.normal(scale=0.1, size=n_samples)
# %%
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor
gbdt = GradientBoostingRegressor(n_estimators=100, random_state=42).fit(X, y)
hgbdt = HistGradientBoostingRegressor(max_iter=100, random_state=42).fit(X, y)
# %%
import pandas as pd
from sklearn.inspection import permutation_importance
importances = {}
for name, model in zip(["gbdt", "hgbdt"], [gbdt, hgbdt]):
result = permutation_importance(model, X, y, n_repeats=100, random_state=42, n_jobs=-1)
importances[name] = pd.DataFrame(
result.importances.T, columns=[f"Features #{i}" for i in range(X.shape[1])]
)
# %%
import matplotlib.pyplot as plt
fig, axes = plt.subplots(ncols=2, figsize=(10, 4))
for ax, (name, importance) in zip(axes, importances.items()):
importance.plot.box(ax=ax, vert=False, whis=100)
ax.set_title(name)
ax.set_xlabel("Decrease in R2 score")
fig.tight_layout()
Resolution
We can use a shuffling in _find_node_split
to alleviate this bias and have a similar implementation as in the current tree implementation.