Skip to content

Conversation

PabloRMira
Copy link

Reference Issues/PRs

This fixes #9341

What does this implement/fix? Explain your changes.

This fix extends the functionality of the KBinsDiscretizer to allow for NaNs in the dataset. The NaN will be assigned to an additional category -1 and this category will be propagated into the onehot-encoding if specified.

Any other comments?

The new feature in KBinsDiscretizer did not pass the test

pytest sklearn/tests/test_common.py -k KBinsDiscretizer

getting the following assertion error:

AssertionError: ("Estimator doesn't check for NaN and inf in fit.", KBinsDiscretizer())

Since it does not need to check for NaN anymore I'm just wondering if this test is needed. In the case it is not needed anymore, then the pull request is complete :-)

@jnothman
Copy link
Member

Since it does not need to check for NaN anymore I'm just wondering if this test is needed. In the case it is not needed anymore, then the pull request is complete :-)

You will need to add a tag to the estimator to change its behaviour around that test. See other preprocessing estimators that mark that they allow NaNs.

You will need to add tests for the added functionality.

And you will need to make the linter happy by ensuring your code is PEP8-compliant.

@KumarGanesha1996
Copy link

i think this change is good. only tests are missing... add tests please. @PabloRMira

@PabloRMira
Copy link
Author

@jnothman, @KumarGanesha1996: thank you very much for your tipps! I've just added:

  • an additional tag for KBinsDiscretizer (allow_nan = True)
  • a test for the new feature

and my linter is now happy as well :-)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for attempting this. I still wonder if:

  • ordinal encoding best passes the NaN downstream rather than replacing it with a number..??
  • one-hot encoding should leave a blank row rather than adding another indicator.

@PabloRMira
Copy link
Author

Thanks for attempting this. I still wonder if:

  • ordinal encoding best passes the NaN downstream rather than replacing it with a number..??

I did it this way because then

  • you have the transformed data ready to input into a model without additional transformations (e.g. imputation)
    • even in the case of imputation it is natural way to impute the NaNs with an additional category, in this case -1
    • this kind of ordinal encoding can be handled by tree-based models in a very natural way, being able to split on -0.5 if needed (and the ordinal encoding is treated as a continuous feature), distinguising between Non-NaNs and NaNs
  • one-hot encoding should leave a blank row rather than adding another indicator.

I would like to keep the additional column for NaN in the onehot-encoding case as it makes the modelling more explicit. Specially if we would like to implement a get_feature_names() for the KBinsDiscretizer later. Then you could see the effect of feature_x = NA explicitly, e.g. the weight in a linear regression or a linear classifier.

@jnothman
Copy link
Member

jnothman commented May 13, 2020 via email

@jnothman
Copy link
Member

jnothman commented May 13, 2020 via email

@davzaman
Copy link

Hello! Checking in here, how do we push this forward?

@PabloRMira
Copy link
Author

Hello! Checking in here, how do we push this forward?

I would still like to merge this. But since then we have got some conflicts with master now :-(

@jnothman :

  • Is the proposed design still okay for you?
  • Should we still wait for others' input?

If you give me green light I would then try to resolve the conflicts so that we can merge the changes

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks sensible. My only concern is that it makes NaNs disappear, which can be bad for feature matrices that have been computed in a way that may produce NaNs in cases of error (which is our fault for adopting NaN as a missing value sentinel).

The new behaviour requires documentation.

@PabloRMira
Copy link
Author

I think this looks sensible. My only concern is that it makes NaNs disappear, which can be bad for feature matrices that have been computed in a way that may produce NaNs in cases of error (which is our fault for adopting NaN as a missing value sentinel).

The new behaviour requires documentation.

Thank you for the feedback, @jnothman!

I see your point and if I understand you correctly, your concern is that the KBinsDiscretizer will not return NaNs in the case of some error in the bin assignment within transform() cp. below (specially on line 293 the np.clip().

bin_edges = self.bin_edges_
for jj in range(Xt.shape[1]):
column = Xt[:, jj]
# Values which are close to a bin edge are susceptible to numeric
# instability. Add eps to X so these values are binned correctly
# with respect to their decimal truncation. See documentation of
# numpy.isclose for an explanation of ``rtol`` and ``atol``.
rtol = 1.e-5
atol = 1.e-8
eps = atol + rtol * np.abs(column)
column = np.digitize(column + eps, bin_edges[jj][1:])
np.clip(column, 0, self.n_bins_[jj] - 1, out=column)
column[np.isnan(Xt[:, jj])] = -1
Xt[:, jj] = column

If this is the case, I think we should not be concerned that much because my change only replaces NaNs if the input column has NaNs and exactly in the positions where the input column has these NaNs (cp. line 294). Hence, computational errors in the bin assignment should still return some sort of error.

Therefore,

  • Would it be sufficient for the approval if I document the replacement of the NaN in the docstrings of the transformer?

If this is the case, I will then document the change in the docstrings and resolve conflicts with master

@jnothman
Copy link
Member

I'm not talking about computational errors in the bin assignments, so much as NaNs introduced by division by zero, logarithm of a non-positive number, missing ID matches in table joins, etc. If feature engineering results in true NaNs, rather than their use as a sentinel for missingness, then we should endeavour to make sure that the user knows about those computation errors, rather than them being silently covered up.

So far our preprocessing tools maintain NaN values, so if any downstream processing forbids NaNs, the user is given an error to debug, and might find corresponding anomalies in their feature engineering. I think this is valuable behaviour. My minor concern here is that the NaNs get replaced by numbers, without the user explicitly requesting this behaviour.

One option would be to require the user to switch on missing value handling using a parameter. Another is to always make sure that a NaN in input produces at least one NaN in the corresponding output, but this comes at some inconvenience to the user who then will need to post-process the data for missing values. The third is to accept that we make the user blind to feature engineering errors, but we document the fact that this will happen.

@PabloRMira
Copy link
Author

I'm not talking about computational errors in the bin assignments, so much as NaNs introduced by division by zero, logarithm of a non-positive number, missing ID matches in table joins, etc. If feature engineering results in true NaNs, rather than their use as a sentinel for missingness, then we should endeavour to make sure that the user knows about those computation errors, rather than them being silently covered up.

So far our preprocessing tools maintain NaN values, so if any downstream processing forbids NaNs, the user is given an error to debug, and might find corresponding anomalies in their feature engineering. I think this is valuable behaviour. My minor concern here is that the NaNs get replaced by numbers, without the user explicitly requesting this behaviour.

One option would be to require the user to switch on missing value handling using a parameter. Another is to always make sure that a NaN in input produces at least one NaN in the corresponding output, but this comes at some inconvenience to the user who then will need to post-process the data for missing values. The third is to accept that we make the user blind to feature engineering errors, but we document the fact that this will happen.

Thank you for the examples and explanation! Now I can better follow what you are worrying about.

The design of my change, to assign NaN to an additional category -1 in the KBinsDiscretizer, was motivated by the fact that I only considered NaN to be genuine missing values in numeric input features and not product of (silent) errors in preceding preprocessors / transformations.

But the points you made are certainly something to care about. The problem is only:

If some transformation, say e.g. division by 0 or logarithm of negative number in some other preprocessor, silently generates NaNs without raising an error or even a warning, then genuine missing values and these "error NaNs" get mixed. After this preceding preprocessor passes the data, the KBinsDiscretizer cannot distinguish between genuine missing values and silent error NaNs, because it only sees NaNs as input.

My point is therefore: Even if we pass the input NaNs (genuine missing values and / or silent error NaNs) as NaNs in the output, for the user it would be almost the same as if we pass the -1 category. This is because I expect the user to impute the NaNs in the next step (most naturally via simple imputation, say with -1 or whatever number) to make the feature usable for a model. Hence, mixing again genuine missing values and potential "error NaNs" of preceding preprocessors and making the NaN passthrough useless.

I think, it should be the job of the other preprocessors to at least raise a warning (better would be an error) in case they generate NaNs as a consequence of a not allowed calculation (e.g. division by 0 or logarithm of negative number). But I think this is rather a broader topic beyond the nature of the proposed change.

Having said that, in order to push this further I can offer:

  • Let everything as is, solve the conflicts and document in the docstrings that NaNs will be assigned to the category -1 and the user therefore should be aware that she / he does not pass "error NaNs" from preceding transformations
  • Or add an additional parameter to the KBinsDiscretizer, say nan_repl = np.nan, with default NaN to leave to the user how to treat NaNs. As discussed above, this will not protect the user from "error NaNs" from preceding preprocessors because missing values NaNs and error NaNs get mixed in the case of preceding unallowed calculations.

What do you think, @jnothman?

@adrinjalali
Copy link
Member

@StefanieSenger similar to your other work if you're interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KBinsDiscretizer: allow nans

6 participants