[WIP] FIX KBinsDiscretizer: allow nans #9341 #17179

PabloRMira · 2020-05-10T23:45:24Z

Reference Issues/PRs

This fixes #9341

What does this implement/fix? Explain your changes.

This fix extends the functionality of the KBinsDiscretizer to allow for NaNs in the dataset. The NaN will be assigned to an additional category -1 and this category will be propagated into the onehot-encoding if specified.

Any other comments?

The new feature in KBinsDiscretizer did not pass the test

pytest sklearn/tests/test_common.py -k KBinsDiscretizer

getting the following assertion error:

AssertionError: ("Estimator doesn't check for NaN and inf in fit.", KBinsDiscretizer())

Since it does not need to check for NaN anymore I'm just wondering if this test is needed. In the case it is not needed anymore, then the pull request is complete :-)

…do not produce an extra category-column for NaN)

jnothman · 2020-05-11T14:18:09Z

Since it does not need to check for NaN anymore I'm just wondering if this test is needed. In the case it is not needed anymore, then the pull request is complete :-)

You will need to add a tag to the estimator to change its behaviour around that test. See other preprocessing estimators that mark that they allow NaNs.

You will need to add tests for the added functionality.

And you will need to make the linter happy by ensuring your code is PEP8-compliant.

KumarGanesha1996 · 2020-05-11T17:21:26Z

i think this change is good. only tests are missing... add tests please. @PabloRMira

PabloRMira · 2020-05-11T22:03:03Z

@jnothman, @KumarGanesha1996: thank you very much for your tipps! I've just added:

an additional tag for KBinsDiscretizer (allow_nan = True)
a test for the new feature

and my linter is now happy as well :-)

jnothman

Thanks for attempting this. I still wonder if:

ordinal encoding best passes the NaN downstream rather than replacing it with a number..??
one-hot encoding should leave a blank row rather than adding another indicator.

sklearn/preprocessing/tests/test_discretization.py

PabloRMira · 2020-05-12T21:15:08Z

Thanks for attempting this. I still wonder if:

ordinal encoding best passes the NaN downstream rather than replacing it with a number..??

I did it this way because then

you have the transformed data ready to input into a model without additional transformations (e.g. imputation)
- even in the case of imputation it is natural way to impute the NaNs with an additional category, in this case -1
- this kind of ordinal encoding can be handled by tree-based models in a very natural way, being able to split on -0.5 if needed (and the ordinal encoding is treated as a continuous feature), distinguising between Non-NaNs and NaNs

one-hot encoding should leave a blank row rather than adding another indicator.

I would like to keep the additional column for NaN in the onehot-encoding case as it makes the modelling more explicit. Specially if we would like to implement a get_feature_names() for the KBinsDiscretizer later. Then you could see the effect of feature_x = NA explicitly, e.g. the weight in a linear regression or a linear classifier.

jnothman · 2020-05-13T01:12:47Z

I think I'm okay with -1 in ordinal, at least in the absence of estimators that handle NaN directly. I'm not entirely convinced by the need for an additional column in OHE. The question of coefficients can go either way: the presence of NaN always implies the absence of the other OHEncoded values, and that interdependence needs to be taken into account when looking at any of their coefficients, even if NaN does not have a separate column. But a column representing NaN does mean a model can select for it more easily than just representing the absence of each other feature.

jnothman · 2020-05-13T01:13:13Z

I think I'm okay with this design, but would be more confident from others' input.

…

davzaman · 2020-11-24T23:38:37Z

Hello! Checking in here, how do we push this forward?

PabloRMira · 2020-12-26T00:51:15Z

Hello! Checking in here, how do we push this forward?

I would still like to merge this. But since then we have got some conflicts with master now :-(

@jnothman :

Is the proposed design still okay for you?
Should we still wait for others' input?

If you give me green light I would then try to resolve the conflicts so that we can merge the changes

jnothman

I think this looks sensible. My only concern is that it makes NaNs disappear, which can be bad for feature matrices that have been computed in a way that may produce NaNs in cases of error (which is our fault for adopting NaN as a missing value sentinel).

The new behaviour requires documentation.

PabloRMira · 2020-12-29T15:32:19Z

I think this looks sensible. My only concern is that it makes NaNs disappear, which can be bad for feature matrices that have been computed in a way that may produce NaNs in cases of error (which is our fault for adopting NaN as a missing value sentinel).

The new behaviour requires documentation.

Thank you for the feedback, @jnothman!

I see your point and if I understand you correctly, your concern is that the KBinsDiscretizer will not return NaNs in the case of some error in the bin assignment within transform() cp. below (specially on line 293 the np.clip().

scikit-learn/sklearn/preprocessing/_discretization.py

Lines 282 to 295 in ee6896e

    
           bin_edges = self.bin_edges_ 
        
           for jj in range(Xt.shape[1]): 
        
               column = Xt[:, jj] 
        
               # Values which are close to a bin edge are susceptible to numeric 
        
               # instability. Add eps to X so these values are binned correctly 
        
               # with respect to their decimal truncation. See documentation of 
        
               # numpy.isclose for an explanation of ``rtol`` and ``atol``. 
        
               rtol = 1.e-5 
        
               atol = 1.e-8 
        
               eps = atol + rtol * np.abs(column) 
        
               column = np.digitize(column + eps, bin_edges[jj][1:]) 
        
               np.clip(column, 0, self.n_bins_[jj] - 1, out=column) 
        
               column[np.isnan(Xt[:, jj])] = -1 
        
               Xt[:, jj] = column

If this is the case, I think we should not be concerned that much because my change only replaces NaNs if the input column has NaNs and exactly in the positions where the input column has these NaNs (cp. line 294). Hence, computational errors in the bin assignment should still return some sort of error.

Therefore,

Would it be sufficient for the approval if I document the replacement of the NaN in the docstrings of the transformer?

If this is the case, I will then document the change in the docstrings and resolve conflicts with master

jnothman · 2020-12-30T02:34:14Z

I'm not talking about computational errors in the bin assignments, so much as NaNs introduced by division by zero, logarithm of a non-positive number, missing ID matches in table joins, etc. If feature engineering results in true NaNs, rather than their use as a sentinel for missingness, then we should endeavour to make sure that the user knows about those computation errors, rather than them being silently covered up.

So far our preprocessing tools maintain NaN values, so if any downstream processing forbids NaNs, the user is given an error to debug, and might find corresponding anomalies in their feature engineering. I think this is valuable behaviour. My minor concern here is that the NaNs get replaced by numbers, without the user explicitly requesting this behaviour.

One option would be to require the user to switch on missing value handling using a parameter. Another is to always make sure that a NaN in input produces at least one NaN in the corresponding output, but this comes at some inconvenience to the user who then will need to post-process the data for missing values. The third is to accept that we make the user blind to feature engineering errors, but we document the fact that this will happen.

PabloRMira · 2020-12-31T01:15:23Z

I'm not talking about computational errors in the bin assignments, so much as NaNs introduced by division by zero, logarithm of a non-positive number, missing ID matches in table joins, etc. If feature engineering results in true NaNs, rather than their use as a sentinel for missingness, then we should endeavour to make sure that the user knows about those computation errors, rather than them being silently covered up.

So far our preprocessing tools maintain NaN values, so if any downstream processing forbids NaNs, the user is given an error to debug, and might find corresponding anomalies in their feature engineering. I think this is valuable behaviour. My minor concern here is that the NaNs get replaced by numbers, without the user explicitly requesting this behaviour.

One option would be to require the user to switch on missing value handling using a parameter. Another is to always make sure that a NaN in input produces at least one NaN in the corresponding output, but this comes at some inconvenience to the user who then will need to post-process the data for missing values. The third is to accept that we make the user blind to feature engineering errors, but we document the fact that this will happen.

Thank you for the examples and explanation! Now I can better follow what you are worrying about.

The design of my change, to assign NaN to an additional category -1 in the KBinsDiscretizer, was motivated by the fact that I only considered NaN to be genuine missing values in numeric input features and not product of (silent) errors in preceding preprocessors / transformations.

But the points you made are certainly something to care about. The problem is only:

If some transformation, say e.g. division by 0 or logarithm of negative number in some other preprocessor, silently generates NaNs without raising an error or even a warning, then genuine missing values and these "error NaNs" get mixed. After this preceding preprocessor passes the data, the KBinsDiscretizer cannot distinguish between genuine missing values and silent error NaNs, because it only sees NaNs as input.

My point is therefore: Even if we pass the input NaNs (genuine missing values and / or silent error NaNs) as NaNs in the output, for the user it would be almost the same as if we pass the -1 category. This is because I expect the user to impute the NaNs in the next step (most naturally via simple imputation, say with -1 or whatever number) to make the feature usable for a model. Hence, mixing again genuine missing values and potential "error NaNs" of preceding preprocessors and making the NaN passthrough useless.

I think, it should be the job of the other preprocessors to at least raise a warning (better would be an error) in case they generate NaNs as a consequence of a not allowed calculation (e.g. division by 0 or logarithm of negative number). But I think this is rather a broader topic beyond the nature of the proposed change.

Having said that, in order to push this further I can offer:

Let everything as is, solve the conflicts and document in the docstrings that NaNs will be assigned to the category -1 and the user therefore should be aware that she / he does not pass "error NaNs" from preceding transformations
Or add an additional parameter to the KBinsDiscretizer, say nan_repl = np.nan, with default NaN to leave to the user how to treat NaNs. As discussed above, this will not protect the user from "error NaNs" from preceding preprocessors because missing values NaNs and error NaNs get mixed in the case of preceding unallowed calculations.

What do you think, @jnothman?

adrinjalali · 2024-03-06T09:22:21Z

@StefanieSenger similar to your other work if you're interested.

PabloRMira added 2 commits May 11, 2020 00:58

added functionality to KBinsDiscretizer to deal with NaNs automatically

4d96c7e

corrected onehot-encoder behavior when dataset does not contain NaN (…

40d9873

…do not produce an extra category-column for NaN)

github-actions bot added the module:preprocessing label May 10, 2020

PabloRMira added 3 commits May 11, 2020 21:05

pep8 styling adjustments

ea0b5cb

added additional tag allow_nan = True for the common tests

a61efa2

added test for new feature in KBinsDiscretizer: nan handling

165eb87

jnothman reviewed May 12, 2020

View reviewed changes

sklearn/preprocessing/tests/test_discretization.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_discretization.py Outdated Show resolved Hide resolved

improved and simplified test

ee6896e

jnothman reviewed Dec 29, 2020

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:52

cmarmo added help wanted Stalled labels Dec 17, 2021

OmarManzoor mentioned this pull request Jul 27, 2023

Handle np.nan / missing values in KBinsDiscretizer #26794

Open

Uh oh!

[WIP] FIX KBinsDiscretizer: allow nans #9341 #17179

Are you sure you want to change the base?

[WIP] FIX KBinsDiscretizer: allow nans #9341 #17179

Uh oh!

Conversation

PabloRMira commented May 10, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented May 11, 2020

Uh oh!

KumarGanesha1996 commented May 11, 2020

Uh oh!

PabloRMira commented May 11, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PabloRMira commented May 12, 2020

Uh oh!

jnothman commented May 13, 2020 via email

Uh oh!

jnothman commented May 13, 2020 via email

Uh oh!

davzaman commented Nov 24, 2020

Uh oh!

PabloRMira commented Dec 26, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

PabloRMira commented Dec 29, 2020

Uh oh!

jnothman commented Dec 30, 2020

Uh oh!

PabloRMira commented Dec 31, 2020

Uh oh!

adrinjalali commented Mar 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants