New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fixing data leak with warm starting in GBDT #15032
base: main
Are you sure you want to change the base?
Conversation
It seems like the number of estimators with early stopping are different between 32-bit and 64-bit versions... |
…into warm_start_GBDT rebase
Friendly ping @NicolasHug, I'm a bit surprised to see different results for different Python versions :( |
Thanks for the ping (don't hesitate to ping me more) Can you check whether the seed is the same for all CIs? That being said, maybe the check is just too strict. We typically are more permissive in the Hist GBDT tests. |
Thanks for the reply! I tried to print |
I think @thomasjpfan has better tricks but you can assert seed == 1.5 and you should see the value in the error message |
I created another virtual environment with Python 3.5 on my local machine and the test succeeds there... (scikit-learn-temp) ✘ johann.faouzi ~/scikit-learn/sklearn/ensemble warm_start_GBDT python -m pytest -k test_gradient_boosting_early_stopping -s
==================================================================================================== test session starts =====================================================================================================
platform darwin -- Python 3.5.6, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/johann.faouzi/scikit-learn, inifile: setup.cfg
collected 718 items / 717 deselected / 2 skipped
tests/test_gradient_boosting.py 1608637542
1608637542
1608637542
1608637542
1608637542
1608637542
.
================================================================================================== short test summary info ===================================================================================================
SKIPPED [2] /Users/johann.faouzi/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/tests/test_compare_lightgbm.py:17: could not import 'lightgbm': No module named 'lightgbm'
================================================================================== 1 passed, 2 skipped, 717 deselected, 1 warnings in 3.51s ================================================================================== |
Looking at the CI build, it only fails on Linux and Python 3.5. It succeeds on:
|
It seems to be the same seed (1608637542): |
Seems like the seed is 1791095845 for the last 2 ones but 1608637542 for the rest? |
For Am I reading the wrong lines? |
You're right, I didn't realize you were running all the tests. If the seed is the same maybe the discrepancy comes from the subsamples then? @thomasjpfan we're again experiencing some weird differences in the randomness of the CIs :/ |
Maybe, I will try to print out scikit-learn/sklearn/ensemble/_gb.py Line 1513 in 3b374cc
This was originally created to save the random state with warm starting, however scikit-learn/sklearn/ensemble/_gb.py Lines 1535 to 1537 in 3b374cc
Also, if you look at this previous build, you can see that the seed is printed three times, which means that the test fails when the tolerance is lowered (it succeeds with |
Reference Issues/PRs
Fixes #14034
What does this implement/fix? Explain your changes.
Instead of saving a RandomState instance, which is mutated after use, an integer is saved. This ensures. A small test is added to check that the random seed is the same or different (depending on the parameters).
Any other comments?
There are currently 6 failing tests:
I also fixed the random seed for the random mask (out-of-bag samples, relevant lines) (I was a bit scared at looking at the Cython function). I don't know if it is necessary or if it is a mistake.
Feedback is welcomed.