Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add warm_start parameter to HistGradientBoosting #14012

Merged
merged 27 commits into from Jun 21, 2019

Conversation

@johannfaouzi
Copy link
Contributor

johannfaouzi commented Jun 3, 2019

Reference Issues/PRs

Fixes #13967.

What does this implement/fix? Explain your changes.

This PR adds a warm_start parameter to sklearn.ensemble.HistGradientBoostingClassifier and sklearn.ensemble.HistGradientBoostingRegressor, similarly to sklearn.ensemble.GradientBoosting and sklearn.ensemble.GradientBoostingRegressor.

TODO list

  • Add a warm_start parameter with False as default value in the __init__ method of both classes.
  • Add a warm_start parameter in the docstring of both classes.
  • Add a warm_start parameter in BaseHistGradientBoosting.
  • Add a private method _is_initialized() to determine if the estimator is initialized or not.
  • Add a private method _clear_state() to clear the results (done but changes probably needed)
  • Add a if else loop for the cases where the estimator is already initialized and where it is not.

Any other comments?

This is my first PR.

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 4, 2019

I have a question regarding the use of validation data for early stopping.

When the user sets warm_start=True, do we make the assumption that the user uses the same data or not? I think that the use of the same data is the most common case (and that is the case mentioned in the related issue), but we could also imagine a case of transfer learning (though it is more common in deep learning). I think that it would be better to avoid any data leakage, but forcing the use of the same data might be too restrictive (and maybe a bit costly for memory if the dataset is large).

What are your thoughts @NicolasHug ?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 4, 2019

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 4, 2019

I also have two questions unrelated to this PR:

  • Why is not train_score_ computed even if it's not empty? Is the cost too important compared to the information that it provides?
  • Should not the shapes of train_score_ and validation_score_ be (n_iter_+ 1) instead of (max_iter + 1,)? Because the number of iterations is equal to n_iter_ (and you add 1 for the initial prediction).
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 4, 2019

Why is not train_score_ computed even if it's not empty?

I don't understand?

Should not the shapes of train_score_ and validation_score_ be (n_iter_+ 1) instead of (max_iter + 1,)?

Indeed, the docstring is wrong since n_iter <= max_iter

And yes we should assume that warm start is used with the same training data

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 4, 2019

I don't think that the code is good enough for a merge right now, but a review could be helpful. Here are my remarks:

  • The tests in _hist_gradient_boosting are passing.
  • I did not add any test. I am not sure what can be tested, maybe for instance than fitting 50 trees in one fit, then fitting 50 other trees with warm_start=True is identical to fitting 100 trees at once. However, I am not sure how to compare the _predictors attributes (which is a list of lists of TreePredictor objects).
  • Currently, a lot of "preprocessing" (binning the data, doing the train/val split) is done over. This might not be needed if we make the assumption that the same data is used, but this assumption is never checked.

Any feedback is welcomed.

Edit: it looks like the TreePredictor objects have a nodes attribute which is a structured array, and that could be used to perform that comparison.

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 4, 2019

Why is not train_score_ computed even if it's not empty?

I don't understand?

Whoops, brain lag. Let me try one more time: Why is not train_score_ computed even if there is no early stopping?

Should not the shapes of train_score_ and validation_score_ be (n_iter_+ 1) instead of (max_iter + 1,)?

Indeed, the docstring is wrong since n_iter <= max_iter

I am going to fix that the next time I push (no need to run the whole CI just for that, since I will probably make more changes after the review).

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 4, 2019

Why is not train_score_ computed even if there is no early stopping?

That's just a choice we made. It's just not worth the extra computation unless users explicitly ask for it. If they want train_score_ without early stopping, they can just set n_iter_no_change to max_iter.

In any case, GBDTs should almost never be used without early stopping. We probably should make that clearer since our default is to not early stop.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 4, 2019

Thanks @johannfaouzi I'll take a look later today

Copy link
Contributor

NicolasHug left a comment

A first glance.

There are a lot of tests in sklearn/ensemble/tests/test_gradient_boosting.py. You can port them (please use with pytest.raises(match=):... instead of assert_raises)

Checking the predictors might be interesting but as a first pass we can just check the predictions.

About the binning done twice: let's leave it as-is for now. That might be a good incentive to allow pre-binned data, but that's something we can worry about later ;)


def _tolist_state(self):
"""Convert all array attributes to lists."""
# self.n_estimators is the number of additional est to fit

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 4, 2019

Contributor

revert comment

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 4, 2019

Contributor

Also these checks shouldn't be done here (not sure where yet).


# Compute raw predictions
raw_predictions = self._raw_predict(X_binned_train)
if hasattr(self, '_indices'):

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 4, 2019

Contributor

hmm is there a way to avoid this? We try to avoid saving these kind of stateful attributes as much as possible (the random state is an exception)

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jun 5, 2019

Author Contributor

Yes. It can be done by:

  • making sure that the conditions are met:
if self.do_early_stopping_ and self.scoring != 'loss':
  • computing the indices again:
subsample_size = 10000  # should we expose this parameter?
indices = np.arange(X_binned_train.shape[0])
self._indices = indices
if X_binned_train.shape[0] > subsample_size:
    # TODO: not critical but stratify using resample()
    indices = rng.choice(indices, subsample_size,  replace=False)

Since we already recompute several things (binning step), we can also do it for the indices. Should I replace choice with sklearn.utils.resample?

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 5, 2019

Contributor

Ok maybe use a small helper to compute the indices then.

Should I replace choice with sklearn.utils.resample?

I'd rather have another PR for this so we can keep track of each change individually (feel free to open it ;) )

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jun 6, 2019

Author Contributor

There is a drawback when the indices are not saved in a private attribute: if random_state is a RandomState instance, the RandomState instance is mutated when it is used (see the comment of jnothman), and thus the indices are different when they are computed the second time. However, the indices are identical when using an integer for random_state.

The current test does not check this, as the training set has 100 samples (and the subsampling is only used when they are more than 10k samples).

Maybe the best solution is to mention that random_state should be an integer when using warm_start (and to force it in the validation of the parameters).

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 5, 2019

I am trying to write a test but it fails because of RNG.

The test fails with rng = np.random.RandomState(0):

import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression


X, y = make_regression()

rng = np.random.RandomState(0)

clf_1 = HistGradientBoostingRegressor(
    max_iter=100, n_iter_no_change=200, random_state=rng)
clf_1.fit(X, y)

clf_2 = HistGradientBoostingRegressor(
    max_iter=100, n_iter_no_change=200, random_state=rng)
clf_2.fit(X, y)

for (pred_ith_1, pred_ith_2) in zip(clf_1._predictors, clf_2._predictors):
    for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2):
        np.testing.assert_array_equal(
            predictor_1.nodes,
            predictor_2.nodes
        )

The test passes with rng = 42:

import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression


X, y = make_regression()

rng = 42

clf_1 = HistGradientBoostingRegressor(
    max_iter=100, n_iter_no_change=200, random_state=rng)
clf_1.fit(X, y)

clf_2 = HistGradientBoostingRegressor(
    max_iter=100, n_iter_no_change=200, random_state=rng)
clf_2.fit(X, y)

for (pred_ith_1, pred_ith_2) in zip(clf_1._predictors, clf_2._predictors):
    for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2):
        np.testing.assert_array_equal(
            predictor_1.nodes,
            predictor_2.nodes
        )

_bin_data has a rng parameter that is not used, although I don't think that it is the problem. I will investigate:

https://github.com/johannfaouzi/scikit-learn/blob/10bcaa976ebacf85e1ccf369d1a8ef8024c449d8/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L450-L474

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 5, 2019

It looks like sklearn.model_selection.train_test_split is the reason why it fails.

This test passes:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split


X, y = make_regression(random_state=0)
rng = 42

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=rng)
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X, y, random_state=rng)

np.testing.assert_array_equal(X_train, X_train_new)
np.testing.assert_array_equal(X_val, X_val_new)
np.testing.assert_array_equal(y_train, y_train_new)
np.testing.assert_array_equal(y_val, y_val_new)

This test fails:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split


X, y = make_regression(random_state=0)
rng = np.random.RandomState(42)

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=rng)
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X, y, random_state=rng)

np.testing.assert_array_equal(X_train, X_train_new)
np.testing.assert_array_equal(X_val, X_val_new)
np.testing.assert_array_equal(y_train, y_train_new)
np.testing.assert_array_equal(y_val, y_val_new)
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 5, 2019

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 5, 2019

I didn't know that it was the expected behavior. Is it better to provide an integer rather than a RandomState object when writing a test thus? I looked at another test for HistGBM and a RandomState object was used:

def test_binning_train_validation_are_separated():
# Make sure training and validation data are binned separately.
# See issue 13926
rng = np.random.RandomState(0)
validation_fraction = .2
gb = HistGradientBoostingClassifier(
n_iter_no_change=5,
validation_fraction=validation_fraction,
random_state=rng
)
gb.fit(X_classification, y_classification)
mapper_training_data = gb.bin_mapper_
# Note that since the data is small there is no subsampling and the
# random_state doesn't matter
mapper_whole_data = _BinMapper(random_state=0)
mapper_whole_data.fit(X_classification)
n_samples = X_classification.shape[0]
assert np.all(mapper_training_data.actual_n_bins_ ==
int((1 - validation_fraction) * n_samples))
assert np.all(mapper_training_data.actual_n_bins_ !=
mapper_whole_data.actual_n_bins_)

I am a bit confused with this new information (for me).

Edit: the documentation for random_state states:

If RandomState instance, random_state is the random number generator.

So if the random number generator is provided, why are the results different? I'm really confused.

Edit2: I only understood your answer now. Thanks for the explanation!

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 5, 2019

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 6, 2019

There is a drawback when the indices are not saved in a private attribute ....

Great point about the RNG. I think this is OK not to expect exactly the same indices, as long as we document it.

Copy link
Contributor

NicolasHug left a comment

A few comments but this is looking pretty good.

At the end of the docstring for the train_score_ parameter, please add something like
" [If scoring is
not 'loss', scores are computed on a subset of at most 10 000
samples.] This subset may vary between different calls to fit if warm_start is True."


This could use a few more tests though:

Please port test_warm_start_max_depth

Please add a test that make sure early stopping works as expected, e.g.:

  • set max_iter to 10000 and n_iter_no_change to 5
  • estimator should early stop somewhere
  • call fit again: the number of itertions should be 5 (or maybe only slightly higher) due to early stopping
self.train_score_ = self.train_score_.tolist()
self.validation_score_ = self.validation_score_.tolist()

def _compute_subsample_indices(self, X_binned_train, rng,

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 6, 2019

Contributor

Let's call it _compute_small_trainset_indices

Also can you add a comment in the docstring:

For efficiency, we need to subsample the training set to compute scores with scorers.
Also note that the returned indices are not expected to be the same between different calls to fit() in a warm start context, since the rng may have been consumed.

)

# Convert array attributes to lists
self._tolist_state()

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 6, 2019

Contributor

Since we only have one call, let's not make it a function and directly convert to lists here

@@ -578,6 +669,15 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin):
verbose: int, optional (default=0)
The verbosity level. If not zero, print some information about the
fitting process.
warm_start : bool, optional (default=False)
When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just erase the

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 6, 2019

Contributor

I would remove the "otherwise" part

Also I think that this would be enough (no need to go case-by-case) "For results to be valid, the estimator should be re-trained on the same data only."

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 6, 2019

I think this is OK not to expect exactly the same indices, as long as we document it.

Are you sure about this point? Because there will be cases when you have samples in the validation set for the first fit that will end up in the training set in the second fit, and vice versa. Since HistGBM uses boosting, it will imply data leakage imho.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 6, 2019

It's not an issue for subsampling the training set, which is what the new helper is used for.

You're right that it's an issue regarding the train/val split, good catch! Looks like the previous gradient boosting also has this issue but it's not documented anywhere it seems. I'll get back to you on this

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 6, 2019

OK I opened #14034 , let's wait for the others feedback regarding this

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 7, 2019

I made changes following your comments, except for the following points:

Please port test_warm_start_max_depth

I did not understand what you expect.

I think we should play it safe and use np.testing.assert_allclose instead (here and below), since some of the values are floats.

I replied in the section, so I did not change this for the moment. I agree that np.testing.assert_allclose is more suited for floats, but I don't know how to use it for structured arrays. Help appreciated!

Copy link
Contributor

NicolasHug left a comment

Ok let's keep the strict equality checks then.

Regarding test_warm_start_max_depth: it's a test in ensemble/tests/test_gradient_boosting.py. It's testing the "old" version of our gradient boosting estimators. You just need to adapt it to this new one.
There are other tests here, feel free to also port those that you deem relevant.

Regarding the train/val split leak: we discussed it with @amueller, maybe the best option would be to:

  • store a seed attribute e.g. _train_val_split_seed that would be generated once, the first time fit is called
  • pass this seed as the random_state parameter to train_test_split().
  • add a small test making sure this parameter stays constant between different calls to fit

Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

I think you can mark it [MRG] now :)

"""
# if not warmstart - clear the estimator state
if not self.warm_start:
self._clear_state()

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 7, 2019

Contributor

I'm not sure this is still useful? We clear the state below

n_iter_first_fit = gb.n_iter_
gb.fit(X, y)
n_iter_second_fit = gb.n_iter_
assert n_iter_second_fit - n_iter_first_fit < 2 * n_iter_no_change

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 7, 2019

Contributor

out of curiosity what's the actual value of n_iter_second_fit - n_iter_first_fit here?

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jun 11, 2019

Author Contributor

I changed the value of tol (tol=1e-3 instead of the default value tol=1e-7) because it fails otherwise for the regression...:

  • tol=1e-3 : classification (6 -> 7), regression (103 -> 105)
  • tol=1e-7 : classification (6 -> 7), regression (105 -> 141)

It feels a bit tricky to change the default value of tol, but when the sample size is small (100 by default in make_classification and make_regression), the validation set is very small (10 by default) and the performance evaluation is a bit noisy (high variance).

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 11, 2019

Contributor

Sounds good. Then please remove the factor 2.

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 12, 2019

I made some changes following your remarks.

I added a seed for:

  • the training and validation split (self._train_val_split_seed)
  • the small training subset (self._small_trainset_seed)
Copy link
Contributor

NicolasHug left a comment

A few more comments. I'll mark it [MRG] so we can get other reviews

from sklearn.datasets import make_classification, make_regression
from sklearn.utils.testing import assert_equal, assert_not_equal

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor

We're moving away from these, you can just use assert == and assert !=

(HistGradientBoostingRegressor, X_regression, y_regression)
])
def test_identical_train_val_split_int(GradientBoosting, X, y):
# Test if identical splits are generated when random_state is an int.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor

I think we could factorize all the tests below with the following one:

@pytest.mark.parametrize('GradientBoosting, X, y', [
    (HistGradientBoostingClassifier, X_classification, y_classification),
    (HistGradientBoostingRegressor, X_regression, y_regression)
])
@pytest.mark.parametrize('rng_type', ('int', 'instance', None))
def test_random_seeds_warm_start(GradientBoosting, X, y, rng_type):
    # Make sure the seeds for train/val split and small trainset subsampling
    # are correctly set in a warm start context.
    def _get_rng(rng_type):
        # Helper to avoid consuming rngs
        if rng_type == 'int':
            return 42
        elif rng_type == 'instance':
            return np.random.RandomState(0)
        else:
            return None

    random_state = _get_rng(rng_type)
    gb_1 = GradientBoosting(n_iter_no_change=5, random_state=random_state)
    gb_1.fit(X, y)
    train_val_seed_1 = gb_1._train_val_split_seed
    small_trainset_seed_1 = gb_1._small_trainset_seed

    random_state = _get_rng(rng_type)
    gb_2 = GradientBoosting(n_iter_no_change=5, random_state=random_state,
                            warm_start=True)
    gb_2.fit(X, y)  # inits state
    train_val_seed_2 = gb_2._train_val_split_seed
    small_trainset_seed_2 = gb_2._small_trainset_seed
    gb_2.fit(X, y)  # clears old state and equals est
    train_val_seed_3 = gb_2._train_val_split_seed
    small_trainset_seed_3 = gb_2._small_trainset_seed

    # Check that all seeds are equal
    if rng_type is None:
        assert train_val_seed_1 != train_val_seed_2
        assert small_trainset_seed_1 != small_trainset_seed_2
    else:
        assert train_val_seed_1 == train_val_seed_2
        assert small_trainset_seed_1 == small_trainset_seed_2

    assert train_val_seed_2 == train_val_seed_3
    assert small_trainset_seed_2 == small_trainset_seed_3

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jun 12, 2019

Author Contributor

Great idea! But we still need to use a large dataset (> 10k samples) to test the seed for the subsample training set, so I think that keeping the large datasets (and setting max_iter and max_depth with low values so that tests are still fast) is still relevant.

@NicolasHug NicolasHug changed the title [WIP] Add warm_start parameter to HistGradientBoosting [MRG] Add warm_start parameter to HistGradientBoosting Jun 12, 2019
@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 12, 2019

Thanks for your remarks, I made some changes according to them.

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 12, 2019

It looks like the tests are failing sometimes: when random_state=None, the same seeds are returned. I am not very familiar with the np.random module, but I think that it uses the clock of the system in that scenario. Maybe we should increase the space of seeds (2**20 instead of 2**10) or use sleep for the tests (there is always the possibility that we have bad luck and that all the seeds are equal, which is probably not ideal for a test).

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 12, 2019

Yeah the seeds are probably the same. Let's remove the inequality check for None then.

Copy link
Contributor

NicolasHug left a comment

A few more comments, probably the last (or close ;) )

Since there are quite a lot of new tests, I'd be in favor of putting them all in a new test_warm_start.py file.

Nice work @johannfaouzi and thanks for sticking to it!

The scores at each iteration on the training data. The first entry
is the score of the ensemble before the first iteration. Scores are
computed according to the ``scoring`` parameter. If ``scoring`` is
not 'loss', scores are computed on a subset of at most 10 000
samples. Empty if no early stopping.
validation_score_ : ndarray, shape (max_iter + 1,)
samples. This subset may vary between different calls to ``fit`` if

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor

Not true anymore (same for Classifier below)



@pytest.mark.parametrize('GradientBoosting, X, y', [
(HistGradientBoostingClassifier, X_classification_large,

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor

Why do you need large datasets? The small ones should be enough?

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jun 13, 2019

Author Contributor

Nevermind, the seed is always generated, even if there are fewer than 10k samples in the training set. I had in mind the fact that it was generated only if there are more than 10k samples.

# Check that all seeds are equal
if rng_type is None:
assert train_val_seed_1 != train_val_seed_2
assert small_trainset_seed_1 != small_trainset_seed_2

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor

let's remove these 2 lines above since None can sometimes give the same seed

# Get the predictors from the previous fit
predictors = self._predictors

begin_at_stage = len(predictors)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor
Suggested change
begin_at_stage = len(predictors)
begin_at_stage = self.n_iter_

(same but more explicit IMO)

@@ -761,13 +852,14 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting,
The number of tree that are built at each iteration. This is equal to 1
for binary classification, and to ``n_classes`` for multiclass
classification.
train_score_ : ndarray, shape (max_iter + 1,)
train_score_ : ndarray, shape (n_iter_ + 1,)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor

I don't understand why our rendered html generates a (void) reference for this.

@adrinjalali could this be linked to the recent changes made to sphinx?

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Jun 14, 2019

Member

I really doubt that, the change was only not to raise a warning if a reference is not found, nothing else is changed AFAIK.

(HistGradientBoostingRegressor, X_regression, y_regression)
])
def test_warm_start_clear(GradientBoosting, X, y):
# Test if fit clears state.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor

Please also check that train_scores and val_scores are the same, to directly check _clear_state()

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jun 13, 2019

Author Contributor

I'm not sure that I understand what you mean. I will add assertions that check that all the attributes are the same in _assert_predictor_equal(), let me know if it's what you expected.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 13, 2019

Contributor

What I mean is that since _clear_state() only clears train_scores, val_scores and the _rng, that's what this test should check. You just need to set n_iter_no_change=5 (or something else) and make sure the attributes are the same.

(please revert the changes made to _assert_predictor_equal, it should only check predictors and predictions)

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jun 13, 2019

Author Contributor

Okay, I make some changes, let me know if it's alright.

X_classification_large, y_classification_large = make_classification(
n_samples=20000, random_state=0)
X_regression_large, y_regression_large = make_regression(
n_samples=20000, random_state=0)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 12, 2019

Contributor

I don't think we should need these

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 13, 2019

It's getting close ;)

Thank you very much for your remarks! It was more time-consuming than I expected, but also a great learning experience :)

Copy link
Contributor

NicolasHug left a comment

LGTM apart from this this minor #14012 (comment) !

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 13, 2019

ping @adrinjalali @glemaitre @thomasjpfan if one of you wants to review this? LGTM already.
(diff is a bit scary but most of the code was simply moved, not changed)

Copy link
Contributor

glemaitre left a comment

This looks good. Just a couple of questions.

# Check identical nodes for each tree
for (pred_ith_1, pred_ith_2) in zip(gb_1._predictors, gb_2._predictors):
for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2):
np.testing.assert_array_equal(

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jun 13, 2019

Contributor

@NicolasHug Do we agreed on not using sklearn.utils.testing.assert_* and use the np.testing utils?
I don't recall but I might have missed the discussion.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 13, 2019

Contributor

I think we don't use assert_array_equal from sklearn.utils and use these from np.testing now.

But I don't know. I'm not sure.
It has never been clear to me cf #13180.

I have decided not to care about this anymore.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jun 21, 2019

Contributor

Before to merge, I will just make an explicit import

from numpy.testing import assert_array_equal

Like this only the import will change if we are going to make it homogeneous.

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jun 14, 2019

I tried to fix the remaining remarks, let me know if you think that more changes are required.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 21, 2019

Comments were addressed @glemaitre, wanna give this a last round?

@glemaitre glemaitre self-assigned this Jun 21, 2019
@glemaitre glemaitre merged commit da96da9 into scikit-learn:master Jun 21, 2019
11 of 13 checks passed
11 of 13 checks passed
ci/circleci: doc CircleCI is running your tests
Details
ci/circleci: doc-min-dependencies CircleCI is running your tests
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
scikit-learn.scikit-learn Build #20190621.12 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jun 21, 2019

Thanks @johannfaouzi

@johannfaouzi johannfaouzi deleted the johannfaouzi:add_warm_start_to_hist_gradient_boost branch Jun 21, 2019
koenvandevelde added a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.