FIX: deprecate integer valued numerical features for PDP#30409
Merged
glemaitre merged 3 commits intoscikit-learn:mainfrom Dec 8, 2024
Merged
Conversation
thomasjpfan
approved these changes
Dec 6, 2024
Member
thomasjpfan
left a comment
There was a problem hiding this comment.
I'm okay with warning and erroring in the future here.
glemaitre
approved these changes
Dec 8, 2024
Member
glemaitre
left a comment
There was a problem hiding this comment.
Thanks @ogrisel, I should provide my finding, this could have maybe helped.
I agree with the general idea of the PR. I was leaning towards the same solution and either consider integer dtype as a categorical feature or otherwise, use explicit floating point.
Member
|
I just see that I merged this PR but it targets 1.6 indeed. We could always add it in 1.6.1 because this is kind of a fix + deprecation. @jeremiedbb do you think that we still have time to backport the commit? |
Member
|
I merged the release PR but I haven't pushed the tag yet so I guess it's doable. |
stefanogaspari
pushed a commit
to stefanogaspari/scikit-learn
that referenced
this pull request
Dec 9, 2024
virchan
pushed a commit
to virchan/scikit-learn
that referenced
this pull request
Dec 9, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a tentative fix for #30315 and #30378.
The problem is that new pandas versions refuse to assign floating-point values into integer dtyped columns. In PDP, such fractional floating-point values can be generated when creating the grid, even if the original numerical feature is integer valued.
In a first (and later abandoned) version of this PR, I tried to transparently change the dtype of
X_eval(or at least the columns where we want to insert those values whenXis a dataframe). However, this leads to very pandas and numpy specific code paths and this would need to be expanded for polars later. I have the feeling that this will become unmanageable quickly.Furthermore, changing the dtype of (some columns of)
X_evalmeans that we are calling the response method of the estimator with different dtypes than theX_trainused to fit the estimator. I have the feeling that this can cause weird bugs.So this PR tries to explicitly deprecate the support of integer values (used as numerical features) in PDPs and later raise a
ValueErrorinstead: the error message should be explicit enough to let the user know what to do to update their code. I thought this would be simple, but as you can see in this PR, this is a bit more complex than anticipated.After creating this PR I am thinking of a final alternative: instead of changing the dtype in
X_eval, we could round the fractionalThis last alternative will be less intrusive (and in particular will match the implicit behavior of numpy and old pandas versions). But it can be surprising: we generate a fine grid but then it is implicitly coarsified when computing the PDP value. This should be not to complex to implement in a somewhat container agnostic way (using
_safe_indexing/_safe_assignand.astype). However, this might be a bit magic. Not sure what is best.Would love to get feedback from @glemaitre or @lesteve for instance.