-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[WIP] FIX KBinsDiscretizer: allow nans #9341 #17179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…do not produce an extra category-column for NaN)
You will need to add a tag to the estimator to change its behaviour around that test. See other preprocessing estimators that mark that they allow NaNs. You will need to add tests for the added functionality. And you will need to make the linter happy by ensuring your code is PEP8-compliant. |
i think this change is good. only tests are missing... add tests please. @PabloRMira |
@jnothman, @KumarGanesha1996: thank you very much for your tipps! I've just added:
and my linter is now happy as well :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for attempting this. I still wonder if:
- ordinal encoding best passes the NaN downstream rather than replacing it with a number..??
- one-hot encoding should leave a blank row rather than adding another indicator.
I did it this way because then
I would like to keep the additional column for NaN in the onehot-encoding case as it makes the modelling more explicit. Specially if we would like to implement a |
I think I'm okay with -1 in ordinal, at least in the absence of estimators
that handle NaN directly.
I'm not entirely convinced by the need for an additional column in OHE. The
question of coefficients can go either way: the presence of NaN always
implies the absence of the other OHEncoded values, and that interdependence
needs to be taken into account when looking at any of their coefficients,
even if NaN does not have a separate column. But a column representing NaN
does mean a model can select for it more easily than just representing the
absence of each other feature.
|
I think I'm okay with this design, but would be more confident from others'
input.
… |
Hello! Checking in here, how do we push this forward? |
I would still like to merge this. But since then we have got some conflicts with master now :-(
If you give me green light I would then try to resolve the conflicts so that we can merge the changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks sensible. My only concern is that it makes NaNs disappear, which can be bad for feature matrices that have been computed in a way that may produce NaNs in cases of error (which is our fault for adopting NaN as a missing value sentinel).
The new behaviour requires documentation.
Thank you for the feedback, @jnothman! I see your point and if I understand you correctly, your concern is that the KBinsDiscretizer will not return NaNs in the case of some error in the bin assignment within scikit-learn/sklearn/preprocessing/_discretization.py Lines 282 to 295 in ee6896e
If this is the case, I think we should not be concerned that much because my change only replaces NaNs if the input column has NaNs and exactly in the positions where the input column has these NaNs (cp. line 294). Hence, computational errors in the bin assignment should still return some sort of error. Therefore,
If this is the case, I will then document the change in the docstrings and resolve conflicts with master |
I'm not talking about computational errors in the bin assignments, so much as NaNs introduced by division by zero, logarithm of a non-positive number, missing ID matches in table joins, etc. If feature engineering results in true NaNs, rather than their use as a sentinel for missingness, then we should endeavour to make sure that the user knows about those computation errors, rather than them being silently covered up. So far our preprocessing tools maintain NaN values, so if any downstream processing forbids NaNs, the user is given an error to debug, and might find corresponding anomalies in their feature engineering. I think this is valuable behaviour. My minor concern here is that the NaNs get replaced by numbers, without the user explicitly requesting this behaviour. One option would be to require the user to switch on missing value handling using a parameter. Another is to always make sure that a NaN in input produces at least one NaN in the corresponding output, but this comes at some inconvenience to the user who then will need to post-process the data for missing values. The third is to accept that we make the user blind to feature engineering errors, but we document the fact that this will happen. |
Thank you for the examples and explanation! Now I can better follow what you are worrying about. The design of my change, to assign NaN to an additional category -1 in the KBinsDiscretizer, was motivated by the fact that I only considered NaN to be genuine missing values in numeric input features and not product of (silent) errors in preceding preprocessors / transformations. But the points you made are certainly something to care about. The problem is only: If some transformation, say e.g. division by 0 or logarithm of negative number in some other preprocessor, silently generates NaNs without raising an error or even a warning, then genuine missing values and these "error NaNs" get mixed. After this preceding preprocessor passes the data, the KBinsDiscretizer cannot distinguish between genuine missing values and silent error NaNs, because it only sees NaNs as input. My point is therefore: Even if we pass the input NaNs (genuine missing values and / or silent error NaNs) as NaNs in the output, for the user it would be almost the same as if we pass the -1 category. This is because I expect the user to impute the NaNs in the next step (most naturally via simple imputation, say with -1 or whatever number) to make the feature usable for a model. Hence, mixing again genuine missing values and potential "error NaNs" of preceding preprocessors and making the NaN passthrough useless. I think, it should be the job of the other preprocessors to at least raise a warning (better would be an error) in case they generate NaNs as a consequence of a not allowed calculation (e.g. division by 0 or logarithm of negative number). But I think this is rather a broader topic beyond the nature of the proposed change. Having said that, in order to push this further I can offer:
What do you think, @jnothman? |
@StefanieSenger similar to your other work if you're interested. |
Reference Issues/PRs
This fixes #9341
What does this implement/fix? Explain your changes.
This fix extends the functionality of the KBinsDiscretizer to allow for NaNs in the dataset. The NaN will be assigned to an additional category -1 and this category will be propagated into the onehot-encoding if specified.
Any other comments?
The new feature in KBinsDiscretizer did not pass the test
getting the following assertion error:
Since it does not need to check for NaN anymore I'm just wondering if this test is needed. In the case it is not needed anymore, then the pull request is complete :-)