Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fixing data leak with warm starting in GBDT #15032

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

johannfaouzi
Copy link
Contributor

Reference Issues/PRs

Fixes #14034

What does this implement/fix? Explain your changes.

Instead of saving a RandomState instance, which is mutated after use, an integer is saved. This ensures. A small test is added to check that the random seed is the same or different (depending on the parameters).

Any other comments?

There are currently 6 failing tests:

  • 4 are about performance, which is worse in settings with no warm starting
  • 2 are hard-coded tests

I also fixed the random seed for the random mask (out-of-bag samples, relevant lines) (I was a bit scared at looking at the Cython function). I don't know if it is necessary or if it is a mistake.

Feedback is welcomed.

@johannfaouzi
Copy link
Contributor Author

It seems like the number of estimators with early stopping are different between 32-bit and 64-bit versions...
These tests are hard-coded, I am not sure what I can do about that.

@johannfaouzi
Copy link
Contributor Author

Friendly ping @NicolasHug, I'm a bit surprised to see different results for different Python versions :(

@NicolasHug
Copy link
Member

Thanks for the ping (don't hesitate to ping me more)

Can you check whether the seed is the same for all CIs?
(You can tweek TEST_CMD with e.g. -k your_test in build_tools/azure/test_script.sh to save time)

That being said, maybe the check is just too strict. We typically are more permissive in the Hist GBDT tests.

@johannfaouzi
Copy link
Contributor Author

johannfaouzi commented Oct 30, 2019

Thanks for the reply! I tried to print _random_seed for successful tests, but it doesn't work. However, the seed for the failing test (python 3.5, 1608637542) is the same as the one I have on my local machine (python 3.7, macOS, 1608637542), where the test succeeds.

@NicolasHug
Copy link
Member

I think @thomasjpfan has better tricks but you can assert seed == 1.5 and you should see the value in the error message

@johannfaouzi
Copy link
Contributor Author

johannfaouzi commented Oct 30, 2019

I created another virtual environment with Python 3.5 on my local machine and the test succeeds there...

(scikit-learn-temp)  ✘ johann.faouzi~/scikit-learn/sklearn/ensemble   warm_start_GBDTpython -m pytest -k test_gradient_boosting_early_stopping -s
==================================================================================================== test session starts =====================================================================================================
platform darwin -- Python 3.5.6, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/johann.faouzi/scikit-learn, inifile: setup.cfg
collected 718 items / 717 deselected / 2 skipped

tests/test_gradient_boosting.py 1608637542
1608637542
1608637542
1608637542
1608637542
1608637542
.

================================================================================================== short test summary info ===================================================================================================
SKIPPED [2] /Users/johann.faouzi/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/tests/test_compare_lightgbm.py:17: could not import 'lightgbm': No module named 'lightgbm'
================================================================================== 1 passed, 2 skipped, 717 deselected, 1 warnings in 3.51s ==================================================================================

@johannfaouzi
Copy link
Contributor Author

Looking at the CI build, it only fails on Linux and Python 3.5. It succeeds on:

  • Windows py35_pip_openblas_32bit,
  • Linux pylatest_pip_openblas_pandas, and
  • Linux pylatest_conda_mkl

@johannfaouzi
Copy link
Contributor Author

@NicolasHug
Copy link
Member

Seems like the seed is 1791095845 for the last 2 ones but 1608637542 for the rest?

@johannfaouzi
Copy link
Contributor Author

For Windows py37_conda_mkl, line 12227, I see 1608637542.
For Windows py35_pip_openblas_32bit, line 9724, I see 1608637542.

Am I reading the wrong lines?

@NicolasHug
Copy link
Member

NicolasHug commented Oct 30, 2019

You're right, I didn't realize you were running all the tests.

If the seed is the same maybe the discrepancy comes from the subsamples then?

@thomasjpfan we're again experiencing some weird differences in the randomness of the CIs :/

@johannfaouzi
Copy link
Contributor Author

Maybe, I will try to print out self._rng:

self._rng = check_random_state(self.random_state)

This was originally created to save the random state with warm starting, however self._rng is a RandomState instance so it is mutated every time it is used. I'm not sure if it's doing what it would be supposed to do. It is used when fitting the stages:

n_stages = self._fit_stages(
X, y, raw_predictions, sample_weight, self._rng, X_val, y_val,
sample_weight_val, begin_at_stage, monitor, X_idx_sorted)

Also, if you look at this previous build, you can see that the seed is printed three times, which means that the test fails when the tolerance is lowered (it succeeds with tol=1e-1, it fails with tol=1e-3)

Base automatically changed from master to main January 22, 2021 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data leak in GBDT due to warm start
2 participants