New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add warm_start parameter to HistGradientBoosting #14012
[MRG] Add warm_start parameter to HistGradientBoosting #14012
Conversation
I have a question regarding the use of validation data for early stopping. When the user sets What are your thoughts @NicolasHug ? |
IMO: Assume the same training data and document that assumption and its
implications.
|
I also have two questions unrelated to this PR:
|
I don't understand?
Indeed, the docstring is wrong since n_iter <= max_iter And yes we should assume that warm start is used with the same training data |
I don't think that the code is good enough for a merge right now, but a review could be helpful. Here are my remarks:
Any feedback is welcomed. Edit: it looks like the |
Whoops, brain lag. Let me try one more time: Why is not
I am going to fix that the next time I push (no need to run the whole CI just for that, since I will probably make more changes after the review). |
That's just a choice we made. It's just not worth the extra computation unless users explicitly ask for it. If they want train_score_ without early stopping, they can just set n_iter_no_change to max_iter. In any case, GBDTs should almost never be used without early stopping. We probably should make that clearer since our default is to not early stop. |
Thanks @johannfaouzi I'll take a look later today |
A first glance.
There are a lot of tests in sklearn/ensemble/tests/test_gradient_boosting.py
. You can port them (please use with pytest.raises(match=):...
instead of assert_raises
)
Checking the predictors might be interesting but as a first pass we can just check the predictions.
About the binning done twice: let's leave it as-is for now. That might be a good incentive to allow pre-binned data, but that's something we can worry about later ;)
|
||
def _tolist_state(self): | ||
"""Convert all array attributes to lists.""" | ||
# self.n_estimators is the number of additional est to fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also these checks shouldn't be done here (not sure where yet).
|
||
# Compute raw predictions | ||
raw_predictions = self._raw_predict(X_binned_train) | ||
if hasattr(self, '_indices'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm is there a way to avoid this? We try to avoid saving these kind of stateful attributes as much as possible (the random state is an exception)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It can be done by:
- making sure that the conditions are met:
if self.do_early_stopping_ and self.scoring != 'loss':
- computing the indices again:
subsample_size = 10000 # should we expose this parameter?
indices = np.arange(X_binned_train.shape[0])
self._indices = indices
if X_binned_train.shape[0] > subsample_size:
# TODO: not critical but stratify using resample()
indices = rng.choice(indices, subsample_size, replace=False)
Since we already recompute several things (binning step), we can also do it for the indices. Should I replace choice
with sklearn.utils.resample
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok maybe use a small helper to compute the indices then.
Should I replace choice with sklearn.utils.resample?
I'd rather have another PR for this so we can keep track of each change individually (feel free to open it ;) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a drawback when the indices are not saved in a private attribute: if random_state
is a RandomState instance, the RandomState instance is mutated when it is used (see the comment of jnothman), and thus the indices are different when they are computed the second time. However, the indices are identical when using an integer for random_state
.
The current test does not check this, as the training set has 100 samples (and the subsampling is only used when they are more than 10k samples).
Maybe the best solution is to mention that random_state
should be an integer when using warm_start
(and to force it in the validation of the parameters).
I am trying to write a test but it fails because of RNG. The test fails with import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression
X, y = make_regression()
rng = np.random.RandomState(0)
clf_1 = HistGradientBoostingRegressor(
max_iter=100, n_iter_no_change=200, random_state=rng)
clf_1.fit(X, y)
clf_2 = HistGradientBoostingRegressor(
max_iter=100, n_iter_no_change=200, random_state=rng)
clf_2.fit(X, y)
for (pred_ith_1, pred_ith_2) in zip(clf_1._predictors, clf_2._predictors):
for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2):
np.testing.assert_array_equal(
predictor_1.nodes,
predictor_2.nodes
) The test passes with import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression
X, y = make_regression()
rng = 42
clf_1 = HistGradientBoostingRegressor(
max_iter=100, n_iter_no_change=200, random_state=rng)
clf_1.fit(X, y)
clf_2 = HistGradientBoostingRegressor(
max_iter=100, n_iter_no_change=200, random_state=rng)
clf_2.fit(X, y)
for (pred_ith_1, pred_ith_2) in zip(clf_1._predictors, clf_2._predictors):
for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2):
np.testing.assert_array_equal(
predictor_1.nodes,
predictor_2.nodes
)
|
It looks like sklearn.model_selection.train_test_split is the reason why it fails. This test passes: import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(random_state=0)
rng = 42
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=rng)
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X, y, random_state=rng)
np.testing.assert_array_equal(X_train, X_train_new)
np.testing.assert_array_equal(X_val, X_val_new)
np.testing.assert_array_equal(y_train, y_train_new)
np.testing.assert_array_equal(y_val, y_val_new) This test fails: import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(random_state=0)
rng = np.random.RandomState(42)
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=rng)
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X, y, random_state=rng)
np.testing.assert_array_equal(X_train, X_train_new)
np.testing.assert_array_equal(X_val, X_val_new)
np.testing.assert_array_equal(y_train, y_train_new)
np.testing.assert_array_equal(y_val, y_val_new) |
That's expected behaviour. The RandomState object is mutated when it is
used.
|
I didn't know that it was the expected behavior. Is it better to provide an integer rather than a scikit-learn/sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py Lines 151 to 174 in 6675c9e
I am a bit confused with this new information (for me). Edit: the documentation for
So if the random number generator is provided, why are the results different? I'm really confused. Edit2: I only understood your answer now. Thanks for the explanation! |
Use an int if you need identical outputs.
|
Great point about the RNG. I think this is OK not to expect exactly the same indices, as long as we document it. |
A few comments but this is looking pretty good.
At the end of the docstring for the train_score_
parameter, please add something like
" [If scoring
is
not 'loss', scores are computed on a subset of at most 10 000
samples.] This subset may vary between different calls to fit
if warm_start is True."
This could use a few more tests though:
Please port test_warm_start_max_depth
Please add a test that make sure early stopping works as expected, e.g.:
- set max_iter to 10000 and n_iter_no_change to 5
- estimator should early stop somewhere
- call fit again: the number of itertions should be 5 (or maybe only slightly higher) due to early stopping
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
self.train_score_ = self.train_score_.tolist() | ||
self.validation_score_ = self.validation_score_.tolist() | ||
|
||
def _compute_subsample_indices(self, X_binned_train, rng, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call it _compute_small_trainset_indices
Also can you add a comment in the docstring:
For efficiency, we need to subsample the training set to compute scores with scorers.
Also note that the returned indices are not expected to be the same between different calls to fit() in a warm start context, since the rng may have been consumed.
) | ||
|
||
# Convert array attributes to lists | ||
self._tolist_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only have one call, let's not make it a function and directly convert to lists here
@@ -578,6 +669,15 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin): | |||
verbose: int, optional (default=0) | |||
The verbosity level. If not zero, print some information about the | |||
fitting process. | |||
warm_start : bool, optional (default=False) | |||
When set to ``True``, reuse the solution of the previous call to fit | |||
and add more estimators to the ensemble, otherwise, just erase the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the "otherwise" part
Also I think that this would be enough (no need to go case-by-case) "For results to be valid, the estimator should be re-trained on the same data only."
Are you sure about this point? Because there will be cases when you have samples in the validation set for the first fit that will end up in the training set in the second fit, and vice versa. Since HistGBM uses boosting, it will imply data leakage imho. |
It's not an issue for subsampling the training set, which is what the new helper is used for. You're right that it's an issue regarding the train/val split, good catch! Looks like the previous gradient boosting also has this issue but it's not documented anywhere it seems. I'll get back to you on this |
OK I opened #14034 , let's wait for the others feedback regarding this |
I made changes following your comments, except for the following points:
I did not understand what you expect.
I replied in the section, so I did not change this for the moment. I agree that |
Ok let's keep the strict equality checks then.
Regarding test_warm_start_max_depth
: it's a test in ensemble/tests/test_gradient_boosting.py
. It's testing the "old" version of our gradient boosting estimators. You just need to adapt it to this new one.
There are other tests here, feel free to also port those that you deem relevant.
Regarding the train/val split leak: we discussed it with @amueller, maybe the best option would be to:
- store a seed attribute e.g.
_train_val_split_seed
that would be generated once, the first timefit
is called - pass this seed as the
random_state
parameter totrain_test_split()
. - add a small test making sure this parameter stays constant between different calls to
fit
Please add an entry to the change log at doc/whats_new/v*.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself with :user:
.
I think you can mark it [MRG] now :)
""" | ||
# if not warmstart - clear the estimator state | ||
if not self.warm_start: | ||
self._clear_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is still useful? We clear the state below
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
n_iter_first_fit = gb.n_iter_ | ||
gb.fit(X, y) | ||
n_iter_second_fit = gb.n_iter_ | ||
assert n_iter_second_fit - n_iter_first_fit < 2 * n_iter_no_change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity what's the actual value of n_iter_second_fit - n_iter_first_fit
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the value of tol
(tol=1e-3
instead of the default value tol=1e-7
) because it fails otherwise for the regression...:
tol=1e-3
: classification (6 -> 7), regression (103 -> 105)tol=1e-7
: classification (6 -> 7), regression (105 -> 141)
It feels a bit tricky to change the default value of tol
, but when the sample size is small (100 by default in make_classification
and make_regression
), the validation set is very small (10 by default) and the performance evaluation is a bit noisy (high variance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Then please remove the factor 2.
I made some changes following your remarks. I added a seed for:
|
A few more comments. I'll mark it [MRG] so we can get other reviews
from sklearn.datasets import make_classification, make_regression | ||
from sklearn.utils.testing import assert_equal, assert_not_equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're moving away from these, you can just use assert ==
and assert !=
(HistGradientBoostingRegressor, X_regression, y_regression) | ||
]) | ||
def test_identical_train_val_split_int(GradientBoosting, X, y): | ||
# Test if identical splits are generated when random_state is an int. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could factorize all the tests below with the following one:
@pytest.mark.parametrize('GradientBoosting, X, y', [
(HistGradientBoostingClassifier, X_classification, y_classification),
(HistGradientBoostingRegressor, X_regression, y_regression)
])
@pytest.mark.parametrize('rng_type', ('int', 'instance', None))
def test_random_seeds_warm_start(GradientBoosting, X, y, rng_type):
# Make sure the seeds for train/val split and small trainset subsampling
# are correctly set in a warm start context.
def _get_rng(rng_type):
# Helper to avoid consuming rngs
if rng_type == 'int':
return 42
elif rng_type == 'instance':
return np.random.RandomState(0)
else:
return None
random_state = _get_rng(rng_type)
gb_1 = GradientBoosting(n_iter_no_change=5, random_state=random_state)
gb_1.fit(X, y)
train_val_seed_1 = gb_1._train_val_split_seed
small_trainset_seed_1 = gb_1._small_trainset_seed
random_state = _get_rng(rng_type)
gb_2 = GradientBoosting(n_iter_no_change=5, random_state=random_state,
warm_start=True)
gb_2.fit(X, y) # inits state
train_val_seed_2 = gb_2._train_val_split_seed
small_trainset_seed_2 = gb_2._small_trainset_seed
gb_2.fit(X, y) # clears old state and equals est
train_val_seed_3 = gb_2._train_val_split_seed
small_trainset_seed_3 = gb_2._small_trainset_seed
# Check that all seeds are equal
if rng_type is None:
assert train_val_seed_1 != train_val_seed_2
assert small_trainset_seed_1 != small_trainset_seed_2
else:
assert train_val_seed_1 == train_val_seed_2
assert small_trainset_seed_1 == small_trainset_seed_2
assert train_val_seed_2 == train_val_seed_3
assert small_trainset_seed_2 == small_trainset_seed_3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! But we still need to use a large dataset (> 10k samples) to test the seed for the subsample training set, so I think that keeping the large datasets (and setting max_iter
and max_depth
with low values so that tests are still fast) is still relevant.
Thanks for your remarks, I made some changes according to them. |
It looks like the tests are failing sometimes: when |
Yeah the seeds are probably the same. Let's remove the inequality check for None then. |
A few more comments, probably the last (or close ;) )
Since there are quite a lot of new tests, I'd be in favor of putting them all in a new test_warm_start.py
file.
Nice work @johannfaouzi and thanks for sticking to it!
The scores at each iteration on the training data. The first entry | ||
is the score of the ensemble before the first iteration. Scores are | ||
computed according to the ``scoring`` parameter. If ``scoring`` is | ||
not 'loss', scores are computed on a subset of at most 10 000 | ||
samples. Empty if no early stopping. | ||
validation_score_ : ndarray, shape (max_iter + 1,) | ||
samples. This subset may vary between different calls to ``fit`` if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not true anymore (same for Classifier below)
|
||
|
||
@pytest.mark.parametrize('GradientBoosting, X, y', [ | ||
(HistGradientBoostingClassifier, X_classification_large, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need large datasets? The small ones should be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, the seed is always generated, even if there are fewer than 10k samples in the training set. I had in mind the fact that it was generated only if there are more than 10k samples.
# Check that all seeds are equal | ||
if rng_type is None: | ||
assert train_val_seed_1 != train_val_seed_2 | ||
assert small_trainset_seed_1 != small_trainset_seed_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove these 2 lines above since None can sometimes give the same seed
# Get the predictors from the previous fit | ||
predictors = self._predictors | ||
|
||
begin_at_stage = len(predictors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
begin_at_stage = len(predictors) | |
begin_at_stage = self.n_iter_ |
(same but more explicit IMO)
@@ -761,13 +852,14 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting, | |||
The number of tree that are built at each iteration. This is equal to 1 | |||
for binary classification, and to ``n_classes`` for multiclass | |||
classification. | |||
train_score_ : ndarray, shape (max_iter + 1,) | |||
train_score_ : ndarray, shape (n_iter_ + 1,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why our rendered html generates a (void) reference for this.
@adrinjalali could this be linked to the recent changes made to sphinx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really doubt that, the change was only not to raise a warning if a reference is not found, nothing else is changed AFAIK.
(HistGradientBoostingRegressor, X_regression, y_regression) | ||
]) | ||
def test_warm_start_clear(GradientBoosting, X, y): | ||
# Test if fit clears state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also check that train_scores
and val_scores
are the same, to directly check _clear_state()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I understand what you mean. I will add assertions that check that all the attributes are the same in _assert_predictor_equal()
, let me know if it's what you expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that since _clear_state()
only clears train_scores
, val_scores
and the _rng
, that's what this test should check. You just need to set n_iter_no_change=5
(or something else) and make sure the attributes are the same.
(please revert the changes made to _assert_predictor_equal, it should only check predictors and predictions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I make some changes, let me know if it's alright.
X_classification_large, y_classification_large = make_classification( | ||
n_samples=20000, random_state=0) | ||
X_regression_large, y_regression_large = make_regression( | ||
n_samples=20000, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should need these
It's getting close ;) Thank you very much for your remarks! It was more time-consuming than I expected, but also a great learning experience :) |
ping @adrinjalali @glemaitre @thomasjpfan if one of you wants to review this? LGTM already. |
# Check identical nodes for each tree | ||
for (pred_ith_1, pred_ith_2) in zip(gb_1._predictors, gb_2._predictors): | ||
for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2): | ||
np.testing.assert_array_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug Do we agreed on not using sklearn.utils.testing.assert_*
and use the np.testing
utils?
I don't recall but I might have missed the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't use assert_array_equal
from sklearn.utils
and use these from np.testing
now.
But I don't know. I'm not sure.
It has never been clear to me cf #13180.
I have decided not to care about this anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before to merge, I will just make an explicit import
from numpy.testing import assert_array_equal
Like this only the import will change if we are going to make it homogeneous.
I tried to fix the remaining remarks, let me know if you think that more changes are required. |
Comments were addressed @glemaitre, wanna give this a last round? |
Thanks @johannfaouzi |
Reference Issues/PRs
Fixes #13967.
What does this implement/fix? Explain your changes.
This PR adds a
warm_start
parameter to sklearn.ensemble.HistGradientBoostingClassifier and sklearn.ensemble.HistGradientBoostingRegressor, similarly to sklearn.ensemble.GradientBoosting and sklearn.ensemble.GradientBoostingRegressor.TODO list
warm_start
parameter withFalse
as default value in the__init__
method of both classes.warm_start
parameter in the docstring of both classes.warm_start
parameter inBaseHistGradientBoosting
._is_initialized()
to determine if the estimator is initialized or not._clear_state()
to clear the results (done but changes probably needed)if else
loop for the cases where the estimator is already initialized and where it is not.Any other comments?
This is my first PR.