-
-
Notifications
You must be signed in to change notification settings - Fork 26k
MNT support cross 32bit/64bit pickles for HGBT #28074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT support cross 32bit/64bit pickles for HGBT #28074
Conversation
Shall we had a non-regression test for this? One way would be to commit a few small pickle test files:
The generation of those models should be scripted (to allow reproducibility, e.g. to regenerate those files in case we refactor the internals of those estimators in the future) but the result would be committed in the Then we would write a test to check that loading those pikcles and calling |
Why does this work? In the sense that if you are on a platform that does not support int64, I'd expect unpickling to fail. Because what type does the unpickling create when it reads the file if the file says "this is a int64" and there isn't one? Can someone explain where my thinking is going wrong? Agree that having a test would be good, though we'd need to switch platform in between CI runs no? Or maybe we can include a pickle generated on a 32 and a 64 bit platform in the "static" data available to tests? That way you'd avoid having to re-generate it (and switch platforms) at the cost of having to store a binary file in the repo. edit: Olivier just suggested this :D |
32-bit platforms support loading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has someone manually checked that this fixes the problem with pyodide? If so +1 for merging on my side (with a changelog entry targeting either 1.4.0 or 1.4.1 as this is a bugfix).
I as commented above, I think it would be great to have non-regression tests for this but as writing such tests in a maintainable way might be more significantly more complex than the fix itself, I am fine with merging without tests for a start and maybe later design a common test for cross-bitness model serialization later.
Thanks for the explanation! |
I also prefer to merge without a test. Creating a pickle file once, checking it in via git and using it in a test does also not work as we don't guarantee forward compatibility of saved models. One take away for me is that it shows a design flaw that we have: authorative Python references tell you to validate inside Note that one alternative, a bit orthogonal to this PR, would be to use platform independent dtypes only, i.e. int32 instead of intp (I guess int32 would be enough by far!). |
Indeed, along with an inline comment to explain why we hardcode the use of int32 in such code. |
Note that we added tests for the trees in #21552. Something similar could be done here. About common tests, I don't think this is so easy for 32bit/64bit but it could be doable for little-endian vs big-endian, which could surface some of these issues, I had an old branch about this but I would need to revive it https://github.com/lesteve/scikit-learn/tree/test-common-cross-endianness-pickle ... |
In particular: @lorentzenchr I can help to push such a non-regression test to your PR if you don't have the time to do it yourself. |
I won't have time for it. I would prefer a common test for it - if possible - and merge here without dedicated test. |
It's hard to write a common test that does not involve the maintenance overhead of storing and managing platform specific pickle files somewhere. The strategy used by @lesteve above on the other-hand does not require storing any pickle file anywhere but requires an estimator-specific monkeypatch during the test execution. The test itself is complex to write but significantly lower maintenance. |
@ogrisel I'd highly appreciate your help with such a test. |
I can try to take a look at adding a test, since this fix is useful in a Pyodide context! @lorentzenchr out of interest did you work on this, because you had a use case where the fix was needed, or mostly because you were curious? |
Just curious and trying to remove bugs that matter. |
Here is what I have done:
|
I think we should target 1.4.0. I will do a review today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (just a few suggestions to make the comments more explicit for people not to familiar with the interaction of numpy dtypes and platform bitness).
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Show resolved
Hide resolved
…oosting.py Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…oosting.py Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's include it in 1.4.0. Thanks
# TODO: consider always using platform agnostic dtypes for fitted | ||
# estimator attributes. For this particular estimator, this would | ||
# mean replacing the intp field of PREDICTOR_RECORD_DTYPE by an int32 | ||
# field. Ideally this should be done consistently throughout | ||
# scikit-learn along with a common test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Fixes #27952.
What does this implement/fix? Explain your changes.
This PR enables to fit and save (pickle dump) an HGBT model on a system with one bitness (e.g. 64 bit) and load and apply the model on another system with different bitness (e.g. 32 bit).
The crucial point are
TreePredictor.nodes
, anndarray of PREDICTOR_RECORD_DTYPE
. The fieldfeature_idx
is of dtypenp.intp
which is platform dependent.Any other comments?
A common test for this would be nice.