[MRG] Add warm_start parameter to HistGradientBoosting #14012

johannfaouzi · 2019-06-03T15:16:58Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR adds a warm_start parameter to sklearn.ensemble.HistGradientBoostingClassifier and sklearn.ensemble.HistGradientBoostingRegressor, similarly to sklearn.ensemble.GradientBoosting and sklearn.ensemble.GradientBoostingRegressor.

TODO list

Add a warm_start parameter with False as default value in the __init__ method of both classes.
Add a warm_start parameter in the docstring of both classes.
Add a warm_start parameter in BaseHistGradientBoosting.
Add a private method _is_initialized() to determine if the estimator is initialized or not.
Add a private method _clear_state() to clear the results (done but changes probably needed)
Add a if else loop for the cases where the estimator is already initialized and where it is not.

Any other comments?

This is my first PR.

johannfaouzi · 2019-06-04T07:19:45Z

I have a question regarding the use of validation data for early stopping.

When the user sets warm_start=True, do we make the assumption that the user uses the same data or not? I think that the use of the same data is the most common case (and that is the case mentioned in the related issue), but we could also imagine a case of transfer learning (though it is more common in deep learning). I think that it would be better to avoid any data leakage, but forcing the use of the same data might be too restrictive (and maybe a bit costly for memory if the dataset is large).

What are your thoughts @NicolasHug ?

jnothman · 2019-06-04T07:48:32Z

IMO: Assume the same training data and document that assumption and its implications.

johannfaouzi · 2019-06-04T07:52:08Z

I also have two questions unrelated to this PR:

Why is not train_score_ computed even if it's not empty? Is the cost too important compared to the information that it provides?
Should not the shapes of train_score_ and validation_score_ be (n_iter_+ 1) instead of (max_iter + 1,)? Because the number of iterations is equal to n_iter_ (and you add 1 for the initial prediction).

NicolasHug · 2019-06-04T12:37:16Z

Why is not train_score_ computed even if it's not empty?

I don't understand?

Should not the shapes of train_score_ and validation_score_ be (n_iter_+ 1) instead of (max_iter + 1,)?

Indeed, the docstring is wrong since n_iter <= max_iter

And yes we should assume that warm start is used with the same training data

johannfaouzi · 2019-06-04T13:01:49Z

I don't think that the code is good enough for a merge right now, but a review could be helpful. Here are my remarks:

The tests in _hist_gradient_boosting are passing.
I did not add any test. I am not sure what can be tested, maybe for instance than fitting 50 trees in one fit, then fitting 50 other trees with warm_start=True is identical to fitting 100 trees at once. However, I am not sure how to compare the _predictors attributes (which is a list of lists of TreePredictor objects).
Currently, a lot of "preprocessing" (binning the data, doing the train/val split) is done over. This might not be needed if we make the assumption that the same data is used, but this assumption is never checked.

Any feedback is welcomed.

Edit: it looks like the TreePredictor objects have a nodes attribute which is a structured array, and that could be used to perform that comparison.

johannfaouzi · 2019-06-04T13:27:32Z

Why is not train_score_ computed even if it's not empty?

I don't understand?

Whoops, brain lag. Let me try one more time: Why is not train_score_ computed even if there is no early stopping?

Should not the shapes of train_score_ and validation_score_ be (n_iter_+ 1) instead of (max_iter + 1,)?

Indeed, the docstring is wrong since n_iter <= max_iter

I am going to fix that the next time I push (no need to run the whole CI just for that, since I will probably make more changes after the review).

NicolasHug · 2019-06-04T13:33:56Z

Why is not train_score_ computed even if there is no early stopping?

That's just a choice we made. It's just not worth the extra computation unless users explicitly ask for it. If they want train_score_ without early stopping, they can just set n_iter_no_change to max_iter.

In any case, GBDTs should almost never be used without early stopping. We probably should make that clearer since our default is to not early stop.

NicolasHug · 2019-06-04T13:34:12Z

Thanks @johannfaouzi I'll take a look later today

NicolasHug

A first glance.

There are a lot of tests in sklearn/ensemble/tests/test_gradient_boosting.py. You can port them (please use with pytest.raises(match=):... instead of assert_raises)

Checking the predictors might be interesting but as a first pass we can just check the predictions.

About the binning done twice: let's leave it as-is for now. That might be a good incentive to allow pre-binned data, but that's something we can worry about later ;)

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

NicolasHug · 2019-06-04T16:22:39Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+
+    def _tolist_state(self):
+        """Convert all array attributes to lists."""
+        # self.n_estimators is the number of additional est to fit


revert comment

Also these checks shouldn't be done here (not sure where yet).

NicolasHug · 2019-06-04T16:26:17Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+
+            # Compute raw predictions
+            raw_predictions = self._raw_predict(X_binned_train)
+            if hasattr(self, '_indices'):


hmm is there a way to avoid this? We try to avoid saving these kind of stateful attributes as much as possible (the random state is an exception)

Yes. It can be done by:

making sure that the conditions are met:

if self.do_early_stopping_ and self.scoring != 'loss':

computing the indices again:

subsample_size = 10000 # should we expose this parameter? indices = np.arange(X_binned_train.shape[0]) self._indices = indices if X_binned_train.shape[0] > subsample_size: # TODO: not critical but stratify using resample() indices = rng.choice(indices, subsample_size, replace=False)

Since we already recompute several things (binning step), we can also do it for the indices. Should I replace choice with sklearn.utils.resample?

Ok maybe use a small helper to compute the indices then.

Should I replace choice with sklearn.utils.resample?

I'd rather have another PR for this so we can keep track of each change individually (feel free to open it ;) )

There is a drawback when the indices are not saved in a private attribute: if random_state is a RandomState instance, the RandomState instance is mutated when it is used (see the comment of jnothman), and thus the indices are different when they are computed the second time. However, the indices are identical when using an integer for random_state.

The current test does not check this, as the training set has 100 samples (and the subsampling is only used when they are more than 10k samples).

Maybe the best solution is to mention that random_state should be an integer when using warm_start (and to force it in the validation of the parameters).

johannfaouzi · 2019-06-05T08:09:29Z

I am trying to write a test but it fails because of RNG.

The test fails with rng = np.random.RandomState(0):

import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression


X, y = make_regression()

rng = np.random.RandomState(0)

clf_1 = HistGradientBoostingRegressor(
    max_iter=100, n_iter_no_change=200, random_state=rng)
clf_1.fit(X, y)

clf_2 = HistGradientBoostingRegressor(
    max_iter=100, n_iter_no_change=200, random_state=rng)
clf_2.fit(X, y)

for (pred_ith_1, pred_ith_2) in zip(clf_1._predictors, clf_2._predictors):
    for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2):
        np.testing.assert_array_equal(
            predictor_1.nodes,
            predictor_2.nodes
        )

The test passes with rng = 42:

import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression


X, y = make_regression()

rng = 42

clf_1 = HistGradientBoostingRegressor(
    max_iter=100, n_iter_no_change=200, random_state=rng)
clf_1.fit(X, y)

clf_2 = HistGradientBoostingRegressor(
    max_iter=100, n_iter_no_change=200, random_state=rng)
clf_2.fit(X, y)

for (pred_ith_1, pred_ith_2) in zip(clf_1._predictors, clf_2._predictors):
    for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2):
        np.testing.assert_array_equal(
            predictor_1.nodes,
            predictor_2.nodes
        )

_bin_data has a rng parameter that is not used, although I don't think that it is the problem. I will investigate:

https://github.com/johannfaouzi/scikit-learn/blob/10bcaa976ebacf85e1ccf369d1a8ef8024c449d8/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L450-L474

johannfaouzi · 2019-06-05T08:32:07Z

It looks like sklearn.model_selection.train_test_split is the reason why it fails.

This test passes:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split


X, y = make_regression(random_state=0)
rng = 42

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=rng)
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X, y, random_state=rng)

np.testing.assert_array_equal(X_train, X_train_new)
np.testing.assert_array_equal(X_val, X_val_new)
np.testing.assert_array_equal(y_train, y_train_new)
np.testing.assert_array_equal(y_val, y_val_new)

This test fails:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split


X, y = make_regression(random_state=0)
rng = np.random.RandomState(42)

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=rng)
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X, y, random_state=rng)

np.testing.assert_array_equal(X_train, X_train_new)
np.testing.assert_array_equal(X_val, X_val_new)
np.testing.assert_array_equal(y_train, y_train_new)
np.testing.assert_array_equal(y_val, y_val_new)

jnothman · 2019-06-05T08:56:14Z

That's expected behaviour. The RandomState object is mutated when it is used.

johannfaouzi · 2019-06-05T09:14:05Z

I didn't know that it was the expected behavior. Is it better to provide an integer rather than a RandomState object when writing a test thus? I looked at another test for HistGBM and a RandomState object was used:

scikit-learn/sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

Lines 151 to 174 in 6675c9e

    
           def test_binning_train_validation_are_separated(): 
        
               # Make sure training and validation data are binned separately. 
        
               # See issue 13926 
        
               rng = np.random.RandomState(0) 
        
               validation_fraction = .2 
        
               gb = HistGradientBoostingClassifier( 
        
                   n_iter_no_change=5, 
        
                   validation_fraction=validation_fraction, 
        
                   random_state=rng 
        
               ) 
        
               gb.fit(X_classification, y_classification) 
        
               mapper_training_data = gb.bin_mapper_ 
        
               # Note that since the data is small there is no subsampling and the 
        
               # random_state doesn't matter 
        
               mapper_whole_data = _BinMapper(random_state=0) 
        
               mapper_whole_data.fit(X_classification) 
        
               n_samples = X_classification.shape[0] 
        
               assert np.all(mapper_training_data.actual_n_bins_ == 
        
                             int((1 - validation_fraction) * n_samples)) 
        
               assert np.all(mapper_training_data.actual_n_bins_ != 
        
                             mapper_whole_data.actual_n_bins_)

I am a bit confused with this new information (for me).

Edit: the documentation for random_state states:

If RandomState instance, random_state is the random number generator.

So if the random number generator is provided, why are the results different? I'm really confused.

Edit2: I only understood your answer now. Thanks for the explanation!

jnothman · 2019-06-05T09:34:27Z

Use an int if you need identical outputs.

NicolasHug · 2019-06-06T15:22:42Z

There is a drawback when the indices are not saved in a private attribute ....

Great point about the RNG. I think this is OK not to expect exactly the same indices, as long as we document it.

NicolasHug

A few comments but this is looking pretty good.

At the end of the docstring for the train_score_ parameter, please add something like
" [If scoring is
not 'loss', scores are computed on a subset of at most 10 000
samples.] This subset may vary between different calls to fit if warm_start is True."

This could use a few more tests though:

Please port test_warm_start_max_depth

Please add a test that make sure early stopping works as expected, e.g.:

set max_iter to 10000 and n_iter_no_change to 5
estimator should early stop somewhere
call fit again: the number of itertions should be 5 (or maybe only slightly higher) due to early stopping

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

NicolasHug · 2019-06-06T15:07:54Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+        self.train_score_ = self.train_score_.tolist()
+        self.validation_score_ = self.validation_score_.tolist()
+
+    def _compute_subsample_indices(self, X_binned_train, rng,


Let's call it _compute_small_trainset_indices

Also can you add a comment in the docstring:

For efficiency, we need to subsample the training set to compute scores with scorers.
Also note that the returned indices are not expected to be the same between different calls to fit() in a warm start context, since the rng may have been consumed.

NicolasHug · 2019-06-06T15:10:18Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+                )
+
+            # Convert array attributes to lists
+            self._tolist_state()


Since we only have one call, let's not make it a function and directly convert to lists here

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

NicolasHug · 2019-06-06T15:21:39Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

        fitting process.
+    warm_start : bool, optional (default=False)
+        When set to ``True``, reuse the solution of the previous call to fit
+        and add more estimators to the ensemble, otherwise, just erase the


I would remove the "otherwise" part

Also I think that this would be enough (no need to go case-by-case) "For results to be valid, the estimator should be re-trained on the same data only."

johannfaouzi · 2019-06-06T15:27:49Z

I think this is OK not to expect exactly the same indices, as long as we document it.

Are you sure about this point? Because there will be cases when you have samples in the validation set for the first fit that will end up in the training set in the second fit, and vice versa. Since HistGBM uses boosting, it will imply data leakage imho.

NicolasHug · 2019-06-06T15:43:05Z

It's not an issue for subsampling the training set, which is what the new helper is used for.

You're right that it's an issue regarding the train/val split, good catch! Looks like the previous gradient boosting also has this issue but it's not documented anywhere it seems. I'll get back to you on this

NicolasHug · 2019-06-06T16:11:12Z

OK I opened #14034 , let's wait for the others feedback regarding this

johannfaouzi · 2019-06-07T11:34:21Z

I made changes following your comments, except for the following points:

Please port test_warm_start_max_depth

I did not understand what you expect.

I think we should play it safe and use np.testing.assert_allclose instead (here and below), since some of the values are floats.

I replied in the section, so I did not change this for the moment. I agree that np.testing.assert_allclose is more suited for floats, but I don't know how to use it for structured arrays. Help appreciated!

NicolasHug

Ok let's keep the strict equality checks then.

Regarding test_warm_start_max_depth: it's a test in ensemble/tests/test_gradient_boosting.py. It's testing the "old" version of our gradient boosting estimators. You just need to adapt it to this new one.
There are other tests here, feel free to also port those that you deem relevant.

Regarding the train/val split leak: we discussed it with @amueller, maybe the best option would be to:

store a seed attribute e.g. _train_val_split_seed that would be generated once, the first time fit is called
pass this seed as the random_state parameter to train_test_split().
add a small test making sure this parameter stays constant between different calls to fit

Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

I think you can mark it [MRG] now :)

NicolasHug · 2019-06-07T13:09:28Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

        """
+        # if not warmstart - clear the estimator state
+        if not self.warm_start:
+            self._clear_state()


I'm not sure this is still useful? We clear the state below

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

NicolasHug · 2019-06-07T13:20:03Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    n_iter_first_fit = gb.n_iter_
+    gb.fit(X, y)
+    n_iter_second_fit = gb.n_iter_
+    assert n_iter_second_fit - n_iter_first_fit < 2 * n_iter_no_change


out of curiosity what's the actual value of n_iter_second_fit - n_iter_first_fit here?

I changed the value of tol (tol=1e-3 instead of the default value tol=1e-7) because it fails otherwise for the regression...:

tol=1e-3 : classification (6 -> 7), regression (103 -> 105)

tol=1e-7 : classification (6 -> 7), regression (105 -> 141)

It feels a bit tricky to change the default value of tol, but when the sample size is small (100 by default in make_classification and make_regression), the validation set is very small (10 by default) and the performance evaluation is a bit noisy (high variance).

Sounds good. Then please remove the factor 2.

johannfaouzi · 2019-06-12T09:32:12Z

I made some changes following your remarks.

I added a seed for:

the training and validation split (self._train_val_split_seed)
the small training subset (self._small_trainset_seed)

NicolasHug

A few more comments. I'll mark it [MRG] so we can get other reviews

NicolasHug · 2019-06-12T12:56:04Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

 import pytest
+from sklearn.base import clone
 from sklearn.datasets import make_classification, make_regression
+from sklearn.utils.testing import assert_equal, assert_not_equal


We're moving away from these, you can just use assert == and assert !=

NicolasHug · 2019-06-12T13:11:38Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    (HistGradientBoostingRegressor, X_regression, y_regression)
+])
+def test_identical_train_val_split_int(GradientBoosting, X, y):
+    # Test if identical splits are generated when random_state is an int.


I think we could factorize all the tests below with the following one:

@pytest.mark.parametrize('GradientBoosting, X, y', [ (HistGradientBoostingClassifier, X_classification, y_classification), (HistGradientBoostingRegressor, X_regression, y_regression) ]) @pytest.mark.parametrize('rng_type', ('int', 'instance', None)) def test_random_seeds_warm_start(GradientBoosting, X, y, rng_type): # Make sure the seeds for train/val split and small trainset subsampling # are correctly set in a warm start context. def _get_rng(rng_type): # Helper to avoid consuming rngs if rng_type == 'int': return 42 elif rng_type == 'instance': return np.random.RandomState(0) else: return None random_state = _get_rng(rng_type) gb_1 = GradientBoosting(n_iter_no_change=5, random_state=random_state) gb_1.fit(X, y) train_val_seed_1 = gb_1._train_val_split_seed small_trainset_seed_1 = gb_1._small_trainset_seed random_state = _get_rng(rng_type) gb_2 = GradientBoosting(n_iter_no_change=5, random_state=random_state, warm_start=True) gb_2.fit(X, y) # inits state train_val_seed_2 = gb_2._train_val_split_seed small_trainset_seed_2 = gb_2._small_trainset_seed gb_2.fit(X, y) # clears old state and equals est train_val_seed_3 = gb_2._train_val_split_seed small_trainset_seed_3 = gb_2._small_trainset_seed # Check that all seeds are equal if rng_type is None: assert train_val_seed_1 != train_val_seed_2 assert small_trainset_seed_1 != small_trainset_seed_2 else: assert train_val_seed_1 == train_val_seed_2 assert small_trainset_seed_1 == small_trainset_seed_2 assert train_val_seed_2 == train_val_seed_3 assert small_trainset_seed_2 == small_trainset_seed_3

Great idea! But we still need to use a large dataset (> 10k samples) to test the seed for the subsample training set, so I think that keeping the large datasets (and setting max_iter and max_depth with low values so that tests are still fast) is still relevant.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

johannfaouzi · 2019-06-12T14:31:48Z

Thanks for your remarks, I made some changes according to them.

johannfaouzi · 2019-06-12T15:04:13Z

It looks like the tests are failing sometimes: when random_state=None, the same seeds are returned. I am not very familiar with the np.random module, but I think that it uses the clock of the system in that scenario. Maybe we should increase the space of seeds (2**20 instead of 2**10) or use sleep for the tests (there is always the possibility that we have bad luck and that all the seeds are equal, which is probably not ideal for a test).

NicolasHug · 2019-06-12T15:27:28Z

Yeah the seeds are probably the same. Let's remove the inequality check for None then.

NicolasHug

A few more comments, probably the last (or close ;) )

Since there are quite a lot of new tests, I'd be in favor of putting them all in a new test_warm_start.py file.

Nice work @johannfaouzi and thanks for sticking to it!

NicolasHug · 2019-06-12T17:45:48Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

        not 'loss', scores are computed on a subset of at most 10 000
-        samples. Empty if no early stopping.
-    validation_score_ : ndarray, shape (max_iter + 1,)
+        samples. This subset may vary between different calls to ``fit`` if


Not true anymore (same for Classifier below)

NicolasHug · 2019-06-12T17:49:10Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+
+
+@pytest.mark.parametrize('GradientBoosting, X, y', [
+    (HistGradientBoostingClassifier, X_classification_large,


Why do you need large datasets? The small ones should be enough?

Nevermind, the seed is always generated, even if there are fewer than 10k samples in the training set. I had in mind the fact that it was generated only if there are more than 10k samples.

NicolasHug · 2019-06-12T17:49:35Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    # Check that all seeds are equal
+    if rng_type is None:
+        assert train_val_seed_1 != train_val_seed_2
+        assert small_trainset_seed_1 != small_trainset_seed_2


let's remove these 2 lines above since None can sometimes give the same seed

NicolasHug · 2019-06-12T17:56:54Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+            # Get the predictors from the previous fit
+            predictors = self._predictors
+
+            begin_at_stage = len(predictors)


Suggested change

begin_at_stage = len(predictors)

begin_at_stage = self.n_iter_

(same but more explicit IMO)

NicolasHug · 2019-06-12T18:01:51Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

        for binary classification, and to ``n_classes`` for multiclass
        classification.
-    train_score_ : ndarray, shape (max_iter + 1,)
+    train_score_ : ndarray, shape (n_iter_ + 1,)


I don't understand why our rendered html generates a (void) reference for this.

@adrinjalali could this be linked to the recent changes made to sphinx?

I really doubt that, the change was only not to raise a warning if a reference is not found, nothing else is changed AFAIK.

NicolasHug · 2019-06-12T18:08:35Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    (HistGradientBoostingRegressor, X_regression, y_regression)
+])
+def test_warm_start_clear(GradientBoosting, X, y):
+    # Test if fit clears state.


Please also check that train_scores and val_scores are the same, to directly check _clear_state()

I'm not sure that I understand what you mean. I will add assertions that check that all the attributes are the same in _assert_predictor_equal(), let me know if it's what you expected.

What I mean is that since _clear_state() only clears train_scores, val_scores and the _rng, that's what this test should check. You just need to set n_iter_no_change=5 (or something else) and make sure the attributes are the same.

(please revert the changes made to _assert_predictor_equal, it should only check predictors and predictions)

Okay, I make some changes, let me know if it's alright.

NicolasHug · 2019-06-12T18:09:25Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+X_classification_large, y_classification_large = make_classification(
+    n_samples=20000, random_state=0)
+X_regression_large, y_regression_large = make_regression(
+    n_samples=20000, random_state=0)


I don't think we should need these

johannfaouzi · 2019-06-13T10:42:06Z

It's getting close ;)

Thank you very much for your remarks! It was more time-consuming than I expected, but also a great learning experience :)

NicolasHug

LGTM apart from this this minor #14012 (comment) !

NicolasHug · 2019-06-13T12:48:49Z

ping @adrinjalali @glemaitre @thomasjpfan if one of you wants to review this? LGTM already.
(diff is a bit scary but most of the code was simply moved, not changed)

glemaitre

This looks good. Just a couple of questions.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

glemaitre · 2019-06-13T14:29:08Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_warm_start.py

+    # Check identical nodes for each tree
+    for (pred_ith_1, pred_ith_2) in zip(gb_1._predictors, gb_2._predictors):
+        for (predictor_1, predictor_2) in zip(pred_ith_1, pred_ith_2):
+            np.testing.assert_array_equal(


@NicolasHug Do we agreed on not using sklearn.utils.testing.assert_* and use the np.testing utils?
I don't recall but I might have missed the discussion.

I think we don't use assert_array_equal from sklearn.utils and use these from np.testing now.

But I don't know. I'm not sure.
It has never been clear to me cf #13180.

I have decided not to care about this anymore.

Before to merge, I will just make an explicit import

from numpy.testing import assert_array_equal

Like this only the import will change if we are going to make it homogeneous.

…b.com/johannfaouzi/scikit-learn into add_warm_start_to_hist_gradient_boost

johannfaouzi · 2019-06-14T13:19:40Z

I tried to fix the remaining remarks, let me know if you think that more changes are required.

NicolasHug · 2019-06-21T12:58:50Z

Comments were addressed @glemaitre, wanna give this a last round?

glemaitre · 2019-06-21T13:18:42Z

Thanks @johannfaouzi

)

Add warm_start parameter to HistGradientBoosting

4da6e96

Add more code for warm start in HistGBM.

10bcaa9

NicolasHug reviewed Jun 4, 2019

View reviewed changes

johann.faouzi added 3 commits June 5, 2019 08:29

Fix train_score_ and validation_score_ shapes in docstrings.

ad40048

Add tests for warm_start in HistGBM.

0ed1535

Improvements after first review of adding warm_start to HistGBM.

b246621

johann.faouzi added 3 commits June 5, 2019 11:54

Use an int for rng in test_warm_start_yields_identical_results.

0dc6247

Remove unnecessary private attributes in BaseHistGradientBoosting.

4bcb9c4

Remove _last_gradients and _last_hessians from BaseHistGradientBoosting.

94856d0

NicolasHug reviewed Jun 6, 2019

View reviewed changes

Update code after the second review.

62c9ce1

NicolasHug reviewed Jun 7, 2019

View reviewed changes

Revert trailing whitespaces.

1f4c6b7

NicolasHug reviewed Jun 12, 2019

View reviewed changes

NicolasHug changed the title ~~[WIP] Add warm_start parameter to HistGradientBoosting~~ [MRG] Add warm_start parameter to HistGradientBoosting Jun 12, 2019

Update code after review

99514fb

NicolasHug reviewed Jun 12, 2019

View reviewed changes

johann.faouzi added 2 commits June 13, 2019 11:44

Add minor corrections to code and docstring.

d0e307c

Move warm starting tests in separate file.

c81fc8d

NicolasHug approved these changes Jun 13, 2019

View reviewed changes

johann.faouzi added 3 commits June 13, 2019 15:52

Revert changes made to _assert_predictor_equal

2636cf0

Change checks in test_warm_start_clear

10f0298

Re-add _assert_predictor_equal in test_warm_start_clear

f59ff5a

glemaitre reviewed Jun 13, 2019

View reviewed changes

johann.faouzi added 4 commits June 14, 2019 09:25

Add _assert_predictor_equal in test_warm_start_clear

7597c61

Merge branch 'add_warm_start_to_hist_gradient_boost' of https://githu…

4438ae2

…b.com/johannfaouzi/scikit-learn into add_warm_start_to_hist_gradient_boost

Change parameter order

3cd0cef

Update code after second review

eedee53

glemaitre self-assigned this Jun 21, 2019

import explicitely numpy test functions

5fc2880

glemaitre merged commit da96da9 into scikit-learn:master Jun 21, 2019

johannfaouzi deleted the add_warm_start_to_hist_gradient_boost branch June 21, 2019 15:33

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

EHN Add warm_start parameter to HistGradientBoosting (scikit-learn#14012

1ea6008

)



		@pytest.mark.parametrize('GradientBoosting, X, y', [
		(HistGradientBoostingClassifier, X_classification_large,

	begin_at_stage = len(predictors)
	begin_at_stage = self.n_iter_

Uh oh!

[MRG] Add warm_start parameter to HistGradientBoosting #14012

[MRG] Add warm_start parameter to HistGradientBoosting #14012

Uh oh!

Conversation

johannfaouzi commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

TODO list

Any other comments?

Uh oh!

johannfaouzi commented Jun 4, 2019

Uh oh!

jnothman commented Jun 4, 2019 via email

Uh oh!

johannfaouzi commented Jun 4, 2019

Uh oh!

NicolasHug commented Jun 4, 2019

Uh oh!

johannfaouzi commented Jun 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johannfaouzi commented Jun 4, 2019

Uh oh!

NicolasHug commented Jun 4, 2019

Uh oh!

NicolasHug commented Jun 4, 2019

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johannfaouzi commented Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johannfaouzi commented Jun 5, 2019

Uh oh!

jnothman commented Jun 5, 2019 via email

Uh oh!

johannfaouzi commented Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jun 5, 2019 via email

Uh oh!

NicolasHug commented Jun 6, 2019

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johannfaouzi commented Jun 6, 2019

Uh oh!

NicolasHug commented Jun 6, 2019

Uh oh!

NicolasHug commented Jun 6, 2019

Uh oh!

johannfaouzi commented Jun 3, 2019 •

edited

Loading

johannfaouzi commented Jun 4, 2019 •

edited

Loading

johannfaouzi commented Jun 5, 2019 •

edited

Loading

johannfaouzi commented Jun 5, 2019 •

edited

Loading

johannfaouzi Jun 11, 2019 •

edited

Loading

johannfaouzi Jun 12, 2019 •

edited

Loading

johannfaouzi commented Jun 12, 2019 •

edited

Loading