Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165

Merged
merged 14 commits into from Feb 20, 2019

Conversation

Projects
None yet
3 participants
@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Feb 14, 2019

Closes #12774
Closes #13194
Closes #13195
(1) Remove bins whose width <= 0 with a warning.
(2) Tell users that KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data) and these features can be removed with feature selection algorithms (e.g., sklearn.compose.VarianceThreshold).

Similar to #12893, I don't think it's the duty of a preprocessor to remove redundant features.

I think this is a bug (seems that Joel agrees with me in the issue), so we don't need to worry about backward compatibility.

qinhanmin2014 added some commits Feb 13, 2019

if self.strategy in ('quantile', 'kmeans'):
bin_edges[jj] = np.unique(bin_edges[jj])
if len(bin_edges[jj]) - 1 != n_bins[jj]:
warnings.warn('Redundant bins (i.e., bins whose width = 0)'

This comment has been minimized.

@adrinjalali

adrinjalali Feb 15, 2019

Member

This means the user would get a UserWarning without being able or knowing how to fix it. We usually try to tell the user how to handle and remove the warnings, don't we?

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Feb 15, 2019

Author Member

We've solved the problem for the user. We've already raised similar warning when certain feature is constant.

This comment has been minimized.

@adrinjalali

adrinjalali Feb 15, 2019

Member

Then we can add something like "Setting the number of bins for feature %d to %d."

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Feb 15, 2019

Author Member

Then we can add something like "Setting the number of bins for feature %d to %d."

I don't think this will solve the problem when data is skewed.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 19, 2019

We should make sure to remove bins with width < 0. See #13194

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 20, 2019

@jnothman The PR now solves the new issue (tagged 0.20.3 by you) as well.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 20, 2019

Yes, I'd be okay to include this in 0.20.3 if others agree that that is the right place to fix up a design flaw in KBinsDiscretizer.

Show resolved Hide resolved sklearn/preprocessing/_discretization.py Outdated
bin_edges[jj] = np.array(
[bin_edges[jj][0]] +
[bin_edges[jj][i] for i in range(1, len(bin_edges[jj]))
if bin_edges[jj][i] - bin_edges[jj][i - 1] > 1e-8])

This comment has been minimized.

@jnothman

jnothman Feb 20, 2019

Member

we would still be solving both issues if we had >0 here. Are we sure we want a hard coded tolerance of 1e-8?

This comment has been minimized.

@jnothman

jnothman Feb 20, 2019

Member

Is mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8; bin_edges[jj] = bin_edges[jj][mask] clearer?

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Feb 20, 2019

Author Member

I think we need 1e-8 here, see e.g.,

import numpy as np
a = np.ones(279) * 0.56758051638767337
ans = []
for p in range(0, 100, 5):
    ans.append(np.percentile(a, p))
print([ans[0]] + [ans[i] for i in range(1, len(ans)) if ans[i] - ans[i - 1] > 0])
# [0.5675805163876734, 0.5675805163876735]
print([ans[0]] + [ans[i] for i in range(1, len(ans)) if ans[i] - ans[i - 1] > 1e-8])
# [0.5675805163876734]

Though we should blame numpy for this

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Feb 20, 2019

Author Member

Is mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8; bin_edges[jj] = bin_edges[jj][mask] clearer?

At least not from my side :) But I assume this can make use of some optimization in numpy.

qinhanmin2014 added some commits Feb 20, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 20, 2019

ping @jnothman ready for another review :)

@@ -102,6 +103,11 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
:class:`sklearn.compose.ColumnTransformer` if you only want to preprocess
part of the features.
``KBinsDiscretizer`` might produce constant features (e.g., when
``encode = 'onehot'`` and certain bins do not contain any data).

This comment has been minimized.

@jnothman

jnothman Feb 20, 2019

Member

Maybe mention strategy=uniform here

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Feb 20, 2019

Author Member

Maybe mention strategy=uniform here

Why? Other strategies also suffer from this problem.

This comment has been minimized.

@jnothman

jnothman Feb 20, 2019

Member

Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Feb 20, 2019

Author Member

Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?

We only remove bins whose width <= 0, e.g.,

from sklearn.preprocessing import KBinsDiscretizer
X = [[1], [2]]
kb = KBinsDiscretizer(n_bins=5, encode='ordinal')
kb.fit_transform(X)
# array([[0.], [4.]])

Maybe I should use bins whose width <= 0 instead of Redundant bins (i.e., bins whose width <= 0) since it's difficult to define redundant bins?

This comment has been minimized.

@jnothman

jnothman Feb 20, 2019

Member

Ahh... Okay. A bit of an edge case since there are fewer samples than bins which maybe we should prohibit anyway

Show resolved Hide resolved sklearn/preprocessing/_discretization.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_discretization.py Outdated
@@ -177,6 +183,16 @@ def fit(self, X, y=None):
bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5
bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max]

# Remove redundant bins (i.e., bins whose width <= 0)
if self.strategy in ('quantile', 'kmeans'):
mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8

This comment has been minimized.

@adrinjalali

adrinjalali Feb 20, 2019

Member

don't we need to test this for a 32bit system as well? It'd be nice to have a test which tests this precision level and see if it works on all CI machines.

This comment has been minimized.

@jnothman

jnothman Feb 20, 2019

Member

I actually think we should make this 0 and not specify a tolerance for a negligible bin. With unary or ordinal encoding small bins won't harm the system... For one hot we could consider a user-specified tolerance

This comment has been minimized.

@jnothman

jnothman Feb 20, 2019

Member

Sorry, fixed typo

This comment has been minimized.

@adrinjalali

adrinjalali Feb 20, 2019

Member

I see what you mean, but in general, comparing floats to 0 has kinda proven itself unreliable on multiple occasions at least in the past few months. The last one I remember is the test you commented on, where the two floats logically should have been identical, but they weren't.

math.isclose() has the default rel_tol=1e-9, which kinda seems like a good value for 64bit systems to me, not sure if it'd be as good on the 32bit ones.

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Feb 20, 2019

Author Member

don't we need to test this for a 32bit system as well?

We have 32bit CI?

I actually think we should make this 0 and not specify a tolerance for a negligible bin.

Is it good to do so when comparing floats? If we use 0, this PR won't solve the new issue completely (see #13165 (comment)), or do you want to ignore these extreme cases?

I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.

This comment has been minimized.

@adrinjalali

adrinjalali Feb 20, 2019

Member

We have 32bit CI?

we have at least one 32bit windows.

I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.

AFAIK, np.isclose is the way it is for backward compatibility. The new standard is the one adopted by python>=3.5 in PEP-485 (i.e. math.isclose), and seems like a logical choice for new code to follow that one wherever possible/convenient.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 20, 2019

@qinhanmin2014 qinhanmin2014 changed the title FIX Remove redundant bins in KBinsDiscretizer FIX Remove bins whose width <= 0 in KBinsDiscretizer Feb 20, 2019

@qinhanmin2014 qinhanmin2014 changed the title FIX Remove bins whose width <= 0 in KBinsDiscretizer FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer Feb 20, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 20, 2019

I've removed the term redundant bins because it's hard to define. Users might regard bins which do not contain any data as redundant bins. Will revert if someone disagree.

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 20, 2019

Okay. Let's leave it at something like 1e-8, document it

@jnothman How to document it?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 20, 2019

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 20, 2019

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Feb 20, 2019

I wouldn't thought 32 bit data was more relevant than 32 bit processor ... But perhaps both

I'm not sure, that's why I think having a test checking the boundary conditions here is a good idea.

@jnothman
Copy link
Member

jnothman left a comment

This is fine by me. Re @adrinjalali's suggestion of testing platforms and datatypes, is that particularly with reference to the numpy bug in percentile?

@jnothman jnothman referenced this pull request Feb 20, 2019

Merged

Release 0.20.3 #13186

16 of 17 tasks complete
@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Feb 20, 2019

This is fine by me. Re @adrinjalali's suggestion of testing platforms and datatypes, is that particularly with reference to the numpy bug in percentile?

No, not really. Regardless, I think this is an improvement on the status-quo anyway; and I don't think we test these boundary cases in other similar situations.

@adrinjalali adrinjalali merged commit b40868d into scikit-learn:master Feb 20, 2019

11 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 92.43%)
Details
codecov/project 92.47% (+0.04%) compared to 1c8668b
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 20, 2019

No, not really. Regardless, I think this is an improvement on the status-quo anyway; and I don't think we test these boundary cases in other similar situations.

Please open an issue if you think it's worth discussing. Honestly I'm unable to fully understand your point :)

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:redundant-bin branch Feb 20, 2019

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 20, 2019

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Feb 20, 2019

FIX Remove bins whose width <= 1e-8 with a warning in KBinsDiscretizer (
scikit-learn#13165)

* remove redundant bins

* tests

* what's new

* issue number

* numeric issue

* move what's new

* Joel's comment

* forget something

* flake8

* more doc update

* Joel's comment

* redundant bins

* new message

* comment

Kiku-git added a commit to Kiku-git/scikit-learn that referenced this pull request Mar 4, 2019

FIX Remove bins whose width <= 1e-8 with a warning in KBinsDiscretizer (
scikit-learn#13165)

* remove redundant bins

* tests

* what's new

* issue number

* numeric issue

* move what's new

* Joel's comment

* forget something

* flake8

* more doc update

* Joel's comment

* redundant bins

* new message

* comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.