[WIP] Fixing data leak with warm starting in GBDT #15032

johannfaouzi · 2019-09-19T16:35:12Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Instead of saving a RandomState instance, which is mutated after use, an integer is saved. This ensures. A small test is added to check that the random seed is the same or different (depending on the parameters).

Any other comments?

There are currently 6 failing tests:

4 are about performance, which is worse in settings with no warm starting
2 are hard-coded tests

I also fixed the random seed for the random mask (out-of-bag samples, relevant lines) (I was a bit scared at looking at the Cython function). I don't know if it is necessary or if it is a mistake.

Feedback is welcomed.

johannfaouzi · 2019-09-20T08:54:24Z

It seems like the number of estimators with early stopping are different between 32-bit and 64-bit versions...
These tests are hard-coded, I am not sure what I can do about that.

…into warm_start_GBDT rebase

johannfaouzi · 2019-10-29T14:57:28Z

Friendly ping @NicolasHug, I'm a bit surprised to see different results for different Python versions :(

NicolasHug · 2019-10-29T16:41:22Z

Thanks for the ping (don't hesitate to ping me more)

Can you check whether the seed is the same for all CIs?
(You can tweek TEST_CMD with e.g. -k your_test in build_tools/azure/test_script.sh to save time)

That being said, maybe the check is just too strict. We typically are more permissive in the Hist GBDT tests.

johannfaouzi · 2019-10-30T10:30:26Z

Thanks for the reply! I tried to print _random_seed for successful tests, but it doesn't work. However, the seed for the failing test (python 3.5, 1608637542) is the same as the one I have on my local machine (python 3.7, macOS, 1608637542), where the test succeeds.

NicolasHug · 2019-10-30T11:21:48Z

I think @thomasjpfan has better tricks but you can assert seed == 1.5 and you should see the value in the error message

johannfaouzi · 2019-10-30T11:53:24Z

I created another virtual environment with Python 3.5 on my local machine and the test succeeds there...

(scikit-learn-temp)  ✘ johann.faouzi  ~/scikit-learn/sklearn/ensemble   warm_start_GBDT  python -m pytest -k test_gradient_boosting_early_stopping -s
==================================================================================================== test session starts =====================================================================================================
platform darwin -- Python 3.5.6, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/johann.faouzi/scikit-learn, inifile: setup.cfg
collected 718 items / 717 deselected / 2 skipped

tests/test_gradient_boosting.py 1608637542
1608637542
1608637542
1608637542
1608637542
1608637542
.

================================================================================================== short test summary info ===================================================================================================
SKIPPED [2] /Users/johann.faouzi/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/tests/test_compare_lightgbm.py:17: could not import 'lightgbm': No module named 'lightgbm'
================================================================================== 1 passed, 2 skipped, 717 deselected, 1 warnings in 3.51s ==================================================================================

johannfaouzi · 2019-10-30T12:08:06Z

Looking at the CI build, it only fails on Linux and Python 3.5. It succeeds on:

Windows py35_pip_openblas_32bit,
Linux pylatest_pip_openblas_pandas, and
Linux pylatest_conda_mkl

johannfaouzi · 2019-10-30T12:51:46Z

It seems to be the same seed (1608637542):

NicolasHug · 2019-10-30T13:01:02Z

Seems like the seed is 1791095845 for the last 2 ones but 1608637542 for the rest?

johannfaouzi · 2019-10-30T13:21:20Z

For Windows py37_conda_mkl, line 12227, I see 1608637542.
For Windows py35_pip_openblas_32bit, line 9724, I see 1608637542.

Am I reading the wrong lines?

NicolasHug · 2019-10-30T14:15:26Z

You're right, I didn't realize you were running all the tests.

If the seed is the same maybe the discrepancy comes from the subsamples then?

@thomasjpfan we're again experiencing some weird differences in the randomness of the CIs :/

johannfaouzi · 2019-10-30T14:50:10Z

Maybe, I will try to print out self._rng:

scikit-learn/sklearn/ensemble/_gb.py

Line 1513 in 3b374cc

self._rng = check_random_state(self.random_state)

This was originally created to save the random state with warm starting, however self._rng is a RandomState instance so it is mutated every time it is used. I'm not sure if it's doing what it would be supposed to do. It is used when fitting the stages:

scikit-learn/sklearn/ensemble/_gb.py

Lines 1535 to 1537 in 3b374cc

    
           n_stages = self._fit_stages( 
        
               X, y, raw_predictions, sample_weight, self._rng, X_val, y_val, 
        
               sample_weight_val, begin_at_stage, monitor, X_idx_sorted)

Also, if you look at this previous build, you can see that the seed is printed three times, which means that the test fails when the tolerance is lowered (it succeeds with tol=1e-1, it fails with tol=1e-3)

johannfaouzi added 7 commits September 19, 2019 18:22

Fix the random seed for warm starting

de79fb3

Add a small test for random seeds

9c5ec34

Revert removed line

5012dac

Revert changes regarding OOB computation

ac541dc

Readd rng seed deletion when clearing state

a9a6fc7

Fix indentation

a357baf

Good indent

65b07c5

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

61d3651

…into warm_start_GBDT rebase

johannfaouzi added 3 commits October 30, 2019 09:35

Print random seed and run this test only

186de18

Remove single quotation mark

c73f042

Output print for successful tests

4b4732f

johannfaouzi added 3 commits October 30, 2019 13:12

Assert seed to print it

1d18623

Use good variable

e6c7ec7

Update _gb.py

cb91c21

github-actions bot added the module:ensemble label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fixing data leak with warm starting in GBDT #15032

[WIP] Fixing data leak with warm starting in GBDT #15032

johannfaouzi commented Sep 19, 2019

johannfaouzi commented Sep 20, 2019

johannfaouzi commented Oct 29, 2019

NicolasHug commented Oct 29, 2019

johannfaouzi commented Oct 30, 2019 •

edited

NicolasHug commented Oct 30, 2019

johannfaouzi commented Oct 30, 2019 •

edited

johannfaouzi commented Oct 30, 2019

johannfaouzi commented Oct 30, 2019

NicolasHug commented Oct 30, 2019

johannfaouzi commented Oct 30, 2019

NicolasHug commented Oct 30, 2019 •

edited

johannfaouzi commented Oct 30, 2019

[WIP] Fixing data leak with warm starting in GBDT #15032

Are you sure you want to change the base?

[WIP] Fixing data leak with warm starting in GBDT #15032

Conversation

johannfaouzi commented Sep 19, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

johannfaouzi commented Sep 20, 2019

johannfaouzi commented Oct 29, 2019

NicolasHug commented Oct 29, 2019

johannfaouzi commented Oct 30, 2019 • edited

NicolasHug commented Oct 30, 2019

johannfaouzi commented Oct 30, 2019 • edited

johannfaouzi commented Oct 30, 2019

johannfaouzi commented Oct 30, 2019

NicolasHug commented Oct 30, 2019

johannfaouzi commented Oct 30, 2019

NicolasHug commented Oct 30, 2019 • edited

johannfaouzi commented Oct 30, 2019

johannfaouzi commented Oct 30, 2019 •

edited

johannfaouzi commented Oct 30, 2019 •

edited

NicolasHug commented Oct 30, 2019 •

edited