Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165

Merged
merged 14 commits into from Feb 20, 2019
Merged

FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165

merged 14 commits into from Feb 20, 2019

Conversation

qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Feb 14, 2019

Closes #12774
Closes #13194
Closes #13195
(1) Remove bins whose width <= 0 with a warning.
(2) Tell users that KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data) and these features can be removed with feature selection algorithms (e.g., sklearn.compose.VarianceThreshold).

Similar to #12893, I don't think it's the duty of a preprocessor to remove redundant features.

I think this is a bug (seems that Joel agrees with me in the issue), so we don't need to worry about backward compatibility.

if self.strategy in ('quantile', 'kmeans'):
bin_edges[jj] = np.unique(bin_edges[jj])
if len(bin_edges[jj]) - 1 != n_bins[jj]:
warnings.warn('Redundant bins (i.e., bins whose width = 0)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the user would get a UserWarning without being able or knowing how to fix it. We usually try to tell the user how to handle and remove the warnings, don't we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've solved the problem for the user. We've already raised similar warning when certain feature is constant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can add something like "Setting the number of bins for feature %d to %d."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can add something like "Setting the number of bins for feature %d to %d."

I don't think this will solve the problem when data is skewed.

@jnothman
Copy link
Member

We should make sure to remove bins with width < 0. See #13194

@qinhanmin2014
Copy link
Member Author

@jnothman The PR now solves the new issue (tagged 0.20.3 by you) as well.

@jnothman
Copy link
Member

Yes, I'd be okay to include this in 0.20.3 if others agree that that is the right place to fix up a design flaw in KBinsDiscretizer.

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved
bin_edges[jj] = np.array(
[bin_edges[jj][0]] +
[bin_edges[jj][i] for i in range(1, len(bin_edges[jj]))
if bin_edges[jj][i] - bin_edges[jj][i - 1] > 1e-8])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would still be solving both issues if we had >0 here. Are we sure we want a hard coded tolerance of 1e-8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8; bin_edges[jj] = bin_edges[jj][mask] clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need 1e-8 here, see e.g.,

import numpy as np
a = np.ones(279) * 0.56758051638767337
ans = []
for p in range(0, 100, 5):
    ans.append(np.percentile(a, p))
print([ans[0]] + [ans[i] for i in range(1, len(ans)) if ans[i] - ans[i - 1] > 0])
# [0.5675805163876734, 0.5675805163876735]
print([ans[0]] + [ans[i] for i in range(1, len(ans)) if ans[i] - ans[i - 1] > 1e-8])
# [0.5675805163876734]

Though we should blame numpy for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8; bin_edges[jj] = bin_edges[jj][mask] clearer?

At least not from my side :) But I assume this can make use of some optimization in numpy.

@qinhanmin2014
Copy link
Member Author

ping @jnothman ready for another review :)

@@ -102,6 +103,11 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
:class:`sklearn.compose.ColumnTransformer` if you only want to preprocess
part of the features.

``KBinsDiscretizer`` might produce constant features (e.g., when
``encode = 'onehot'`` and certain bins do not contain any data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention strategy=uniform here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention strategy=uniform here

Why? Other strategies also suffer from this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?

We only remove bins whose width <= 0, e.g.,

from sklearn.preprocessing import KBinsDiscretizer
X = [[1], [2]]
kb = KBinsDiscretizer(n_bins=5, encode='ordinal')
kb.fit_transform(X)
# array([[0.], [4.]])

Maybe I should use bins whose width <= 0 instead of Redundant bins (i.e., bins whose width <= 0) since it's difficult to define redundant bins?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh... Okay. A bit of an edge case since there are fewer samples than bins which maybe we should prohibit anyway

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved
@@ -177,6 +183,16 @@ def fit(self, X, y=None):
bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5
bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max]

# Remove redundant bins (i.e., bins whose width <= 0)
if self.strategy in ('quantile', 'kmeans'):
mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to test this for a 32bit system as well? It'd be nice to have a test which tests this precision level and see if it works on all CI machines.

Copy link
Member

@jnothman jnothman Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should make this 0 and not specify a tolerance for a negligible bin. With unary or ordinal encoding small bins won't harm the system... For one hot we could consider a user-specified tolerance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, fixed typo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but in general, comparing floats to 0 has kinda proven itself unreliable on multiple occasions at least in the past few months. The last one I remember is the test you commented on, where the two floats logically should have been identical, but they weren't.

math.isclose() has the default rel_tol=1e-9, which kinda seems like a good value for 64bit systems to me, not sure if it'd be as good on the 32bit ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to test this for a 32bit system as well?

We have 32bit CI?

I actually think we should make this 0 and not specify a tolerance for a negligible bin.

Is it good to do so when comparing floats? If we use 0, this PR won't solve the new issue completely (see #13165 (comment)), or do you want to ignore these extreme cases?

I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 32bit CI?

we have at least one 32bit windows.

I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.

AFAIK, np.isclose is the way it is for backward compatibility. The new standard is the one adopted by python>=3.5 in PEP-485 (i.e. math.isclose), and seems like a logical choice for new code to follow that one wherever possible/convenient.

@jnothman
Copy link
Member

jnothman commented Feb 20, 2019 via email

@qinhanmin2014 qinhanmin2014 changed the title FIX Remove redundant bins in KBinsDiscretizer FIX Remove bins whose width <= 0 in KBinsDiscretizer Feb 20, 2019
@qinhanmin2014 qinhanmin2014 changed the title FIX Remove bins whose width <= 0 in KBinsDiscretizer FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer Feb 20, 2019
@qinhanmin2014
Copy link
Member Author

I've removed the term redundant bins because it's hard to define. Users might regard bins which do not contain any data as redundant bins. Will revert if someone disagree.

@qinhanmin2014
Copy link
Member Author

Okay. Let's leave it at something like 1e-8, document it

@jnothman How to document it?

@jnothman
Copy link
Member

jnothman commented Feb 20, 2019 via email

@jnothman
Copy link
Member

jnothman commented Feb 20, 2019 via email

@adrinjalali
Copy link
Member

I wouldn't thought 32 bit data was more relevant than 32 bit processor ... But perhaps both

I'm not sure, that's why I think having a test checking the boundary conditions here is a good idea.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine by me. Re @adrinjalali's suggestion of testing platforms and datatypes, is that particularly with reference to the numpy bug in percentile?

@jnothman jnothman mentioned this pull request Feb 20, 2019
17 tasks
@adrinjalali
Copy link
Member

This is fine by me. Re @adrinjalali's suggestion of testing platforms and datatypes, is that particularly with reference to the numpy bug in percentile?

No, not really. Regardless, I think this is an improvement on the status-quo anyway; and I don't think we test these boundary cases in other similar situations.

@adrinjalali adrinjalali merged commit b40868d into scikit-learn:master Feb 20, 2019
@qinhanmin2014
Copy link
Member Author

No, not really. Regardless, I think this is an improvement on the status-quo anyway; and I don't think we test these boundary cases in other similar situations.

Please open an issue if you think it's worth discussing. Honestly I'm unable to fully understand your point :)

@qinhanmin2014 qinhanmin2014 deleted the redundant-bin branch February 20, 2019 14:43
@jnothman
Copy link
Member

jnothman commented Feb 20, 2019 via email

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 20, 2019
scikit-learn#13165)

* remove redundant bins

* tests

* what's new

* issue number

* numeric issue

* move what's new

* Joel's comment

* forget something

* flake8

* more doc update

* Joel's comment

* redundant bins

* new message

* comment
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
scikit-learn#13165)

* remove redundant bins

* tests

* what's new

* issue number

* numeric issue

* move what's new

* Joel's comment

* forget something

* flake8

* more doc update

* Joel's comment

* redundant bins

* new message

* comment
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
scikit-learn#13165)

* remove redundant bins

* tests

* what's new

* issue number

* numeric issue

* move what's new

* Joel's comment

* forget something

* flake8

* more doc update

* Joel's comment

* redundant bins

* new message

* comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants