Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165

Merged
merged 14 commits into from Feb 20, 2019
4 changes: 4 additions & 0 deletions doc/whats_new/v0.20.rst
Expand Up @@ -64,6 +64,10 @@ Changelog
combination with ``handle_unknown='ignore'``.
:issue:`12881` by `Joris Van den Bossche`_.

- |Fix| Redundant bins (i.e., bins whose width <= 0) are removed with a warning
in :class:`preprocessing.KBinsDiscretizer`.
:issue:`13165` by :user:`Hanmin Qin <qinhanmin2014>`.

:mod:`sklearn.feature_extraction.text`
......................................

Expand Down
17 changes: 16 additions & 1 deletion sklearn/preprocessing/_discretization.py
Expand Up @@ -56,7 +56,8 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Attributes
----------
n_bins_ : int array, shape (n_features,)
Number of bins per feature.
Number of bins per feature. Redundant bins (i.e., bins whose
width <= 0) are removed with a warning.

bin_edges_ : array of arrays, shape (n_features, )
The edges of each bin. Contain arrays of varying shapes ``(n_bins_, )``
Expand Down Expand Up @@ -102,6 +103,11 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
:class:`sklearn.compose.ColumnTransformer` if you only want to preprocess
part of the features.

``KBinsDiscretizer`` might produce constant features (e.g., when
``encode = 'onehot'`` and certain bins do not contain any data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention strategy=uniform here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention strategy=uniform here

Why? Other strategies also suffer from this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?

We only remove bins whose width <= 0, e.g.,

from sklearn.preprocessing import KBinsDiscretizer
X = [[1], [2]]
kb = KBinsDiscretizer(n_bins=5, encode='ordinal')
kb.fit_transform(X)
# array([[0.], [4.]])

Maybe I should use bins whose width <= 0 instead of Redundant bins (i.e., bins whose width <= 0) since it's difficult to define redundant bins?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh... Okay. A bit of an edge case since there are fewer samples than bins which maybe we should prohibit anyway

These features can be removed with feature selection algorithms
(e.g., :class:`sklearn.compose.VarianceThreshold`).
qinhanmin2014 marked this conversation as resolved.
Show resolved Hide resolved

See also
--------
sklearn.preprocessing.Binarizer : class used to bin values as ``0`` or
Expand Down Expand Up @@ -177,6 +183,15 @@ def fit(self, X, y=None):
bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5
bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max]

# Remove redundant bins (i.e., bins whose width <= 0)
if self.strategy in ('quantile', 'kmeans'):
mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to test this for a 32bit system as well? It'd be nice to have a test which tests this precision level and see if it works on all CI machines.

Copy link
Member

@jnothman jnothman Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should make this 0 and not specify a tolerance for a negligible bin. With unary or ordinal encoding small bins won't harm the system... For one hot we could consider a user-specified tolerance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, fixed typo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but in general, comparing floats to 0 has kinda proven itself unreliable on multiple occasions at least in the past few months. The last one I remember is the test you commented on, where the two floats logically should have been identical, but they weren't.

math.isclose() has the default rel_tol=1e-9, which kinda seems like a good value for 64bit systems to me, not sure if it'd be as good on the 32bit ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to test this for a 32bit system as well?

We have 32bit CI?

I actually think we should make this 0 and not specify a tolerance for a negligible bin.

Is it good to do so when comparing floats? If we use 0, this PR won't solve the new issue completely (see #13165 (comment)), or do you want to ignore these extreme cases?

I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 32bit CI?

we have at least one 32bit windows.

I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.

AFAIK, np.isclose is the way it is for backward compatibility. The new standard is the one adopted by python>=3.5 in PEP-485 (i.e. math.isclose), and seems like a logical choice for new code to follow that one wherever possible/convenient.

bin_edges[jj] = bin_edges[jj][mask]
if len(bin_edges[jj]) - 1 != n_bins[jj]:
warnings.warn('Redundant bins (i.e., bins whose width '
'<= 0) in feature %d are removed.' % jj)
qinhanmin2014 marked this conversation as resolved.
Show resolved Hide resolved
n_bins[jj] = len(bin_edges[jj]) - 1

self.bin_edges_ = bin_edges
self.n_bins_ = n_bins

Expand Down
54 changes: 39 additions & 15 deletions sklearn/preprocessing/tests/test_discretization.py
Expand Up @@ -7,6 +7,7 @@
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils.testing import (
assert_array_almost_equal,
assert_array_equal,
assert_raises,
assert_raise_message,
Expand Down Expand Up @@ -209,24 +210,22 @@ def test_nonuniform_strategies(
assert_array_equal(expected_5bins, Xt.ravel())


@pytest.mark.parametrize('strategy', ['uniform', 'kmeans', 'quantile'])
@pytest.mark.parametrize(
'strategy, expected_inv',
[('uniform', [[-1.5, 2., -3.5, -0.5], [-0.5, 3., -2.5, -0.5],
[0.5, 4., -1.5, 0.5], [0.5, 4., -1.5, 1.5]]),
('kmeans', [[-1.375, 2.125, -3.375, -0.5625],
[-1.375, 2.125, -3.375, -0.5625],
[-0.125, 3.375, -2.125, 0.5625],
[0.75, 4.25, -1.25, 1.625]]),
('quantile', [[-1.5, 2., -3.5, -0.75], [-0.5, 3., -2.5, 0.],
[0.5, 4., -1.5, 1.25], [0.5, 4., -1.5, 1.25]])])
@pytest.mark.parametrize('encode', ['ordinal', 'onehot', 'onehot-dense'])
def test_inverse_transform(strategy, encode):
X = np.random.RandomState(0).randn(100, 3)
def test_inverse_transform(strategy, encode, expected_inv):
kbd = KBinsDiscretizer(n_bins=3, strategy=strategy, encode=encode)
Xt = kbd.fit_transform(X)
X2 = kbd.inverse_transform(Xt)
X2t = kbd.fit_transform(X2)
if encode == 'onehot':
assert_array_equal(Xt.todense(), X2t.todense())
else:
assert_array_equal(Xt, X2t)
if 'onehot' in encode:
Xt = kbd._encoder.inverse_transform(Xt)
X2t = kbd._encoder.inverse_transform(X2t)

assert_array_equal(Xt.max(axis=0) + 1, kbd.n_bins_)
assert_array_equal(X2t.max(axis=0) + 1, kbd.n_bins_)
Xinv = kbd.inverse_transform(Xt)
assert_array_almost_equal(expected_inv, Xinv)


@pytest.mark.parametrize('strategy', ['uniform', 'kmeans', 'quantile'])
Expand All @@ -253,3 +252,28 @@ def test_overwrite():
Xinv = est.inverse_transform(Xt)
assert_array_equal(Xt, Xt_before)
assert_array_equal(Xinv, np.array([[0.5], [1.5], [2.5], [2.5]]))


@pytest.mark.parametrize(
'strategy, expected_bin_edges',
[('quantile', [0, 1, 3]), ('kmeans', [0, 1.5, 3])])
def test_redundant_bins(strategy, expected_bin_edges):
X = [[0], [0], [0], [0], [3], [3]]
kbd = KBinsDiscretizer(n_bins=3, strategy=strategy)
msg = ("Redundant bins (i.e., bins whose width <= 0) "
"in feature 0 are removed.")
assert_warns_message(UserWarning, msg, kbd.fit, X)
assert_array_almost_equal(kbd.bin_edges_[0], expected_bin_edges)


def test_percentile_numeric_stability():
X = np.array([0.05, 0.05, 0.95]).reshape(-1, 1)
bin_edges = np.array([0.05, 0.23, 0.41, 0.59, 0.77, 0.95])
Xt = np.array([0, 0, 4]).reshape(-1, 1)
kbd = KBinsDiscretizer(n_bins=10, encode='ordinal',
strategy='quantile')
msg = ("Redundant bins (i.e., bins whose width <= 0) "
"in feature 0 are removed.")
assert_warns_message(UserWarning, msg, kbd.fit, X)
assert_array_almost_equal(kbd.bin_edges_[0], bin_edges)
assert_array_almost_equal(kbd.transform(X), Xt)