Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) #11272

Merged
merged 13 commits into from Jul 9, 2018

Conversation

Projects
None yet
3 participants
@TomDLT
Copy link
Member

TomDLT commented Jun 15, 2018

Fixes #9338.

This PR implements two non-uniform strategies for KBinsDiscretizer (in discrete branch #9342), adding a parameter strategy:

  • KBinsDiscretizer(strategy='uniform'): previous behavior, constant width bins.
  • KBinsDiscretizer(strategy='quantile'): new behavior (default), based on np.percentile.
  • KBinsDiscretizer(strategy='kmeans'): new behavior, based on KMeans.

It adds the following example:
figure_1

@TomDLT TomDLT force-pushed the TomDLT:discrete_quantile branch from 2f8ad7a to ded8c41 Jun 15, 2018

@TomDLT TomDLT force-pushed the scikit-learn:discrete branch from e305039 to 8aa62ab Jun 15, 2018

@TomDLT TomDLT force-pushed the TomDLT:discrete_quantile branch from ded8c41 to 2b8c99f Jun 15, 2018

@TomDLT TomDLT force-pushed the TomDLT:discrete_quantile branch 2 times, most recently from aaddc11 to 21b9356 Jun 15, 2018

@TomDLT TomDLT force-pushed the TomDLT:discrete_quantile branch from 21b9356 to 54522af Jun 15, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 16, 2018

@jnothman
Copy link
Member

jnothman left a comment

I think we should treat this as if it'll be released with KBinsDiscretizer. We should set the default strategy to 'quantile'. We should consider API changes to make this more consistent. We should update the what's new message rather than adding one.

I was a bit surprised to see you've only added, and not modified tests. But the only thing I'm missing there is checking the consistency of 1-column and multi-column transformations.

@@ -576,6 +576,18 @@ Discretization is similar to constructing histograms for continuous data.
However, histograms focus on counting features which fall into particular
bins, whereas discretization focuses on assigning feature values to these bins.

:class:`KBinsDiscretizer` implement different binning strategies, which can be

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 16, 2018

Member

*implements

@@ -62,7 +82,9 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Number of bins per feature. An ignored feature at index ``i``
will have ``n_bins_[i] == 0``.
bin_width_ : float array, shape (n_features,)
bin_width_ : array, shape (n_features,)
Contain floats with 'uniform' strategy, and arrays of varying shapes

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 16, 2018

Member

Do we now want to store widths, or edges? Do we lose a lot by storing edges in all cases?

kmeans
Widths are defined by a k-means on each features.
random_state : int, RandomState instance or None (default)

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 16, 2018

Member

Seeing as we're doing it in 1d, can't we just initialise kmeans at the midpoints of a uniform split, and make it deterministic?

n_clusters = self.n_bins_[jj]
km = KMeans(n_clusters=n_clusters,
random_state=self.random_state).fit(col)
centers = np.sort(km.cluster_centers_[:, 0],

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 16, 2018

Member

I don't think we'd need this sort if we initialised with linspace


if self.strategy == 'quantile':
n_quantiles = self.n_bins_[jj] + 1
qt = QuantileTransformer(n_quantiles=n_quantiles).fit(col)

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 16, 2018

Member

This seems overkill. Just use boundaries = np.percentile(col, n_quantiles + 1)

valid_strategy = ('uniform', 'quantile', 'kmeans')
if self.strategy not in valid_strategy:
raise ValueError("Valid options for 'strategy' are {}. "
"Got 'strategy = {}' instead."

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 16, 2018

Member

use {!r} for quoting

def test_nonuniform_strategies():
X = np.array([0, 1, 2, 3, 9, 10]).reshape(-1, 1)

# with 2 bins

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 16, 2018

Member

Please also test the size of bin_widths_

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Jun 18, 2018

Thanks for the review.

  • I changed to storing the edges in all cases, instead of widths. It avoids successive diff and cumsum in non-uniform strategies, and makes the API more consistent across strategies.
  • I removed the offset attribute, since the information is now in the edges.
  • I changed the default to 'quantile'.
  • I made 'kmeans' strategy deterministic with a uniform init.

@TomDLT TomDLT force-pushed the TomDLT:discrete_quantile branch from 520fa7b to 785c14f Jun 18, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 18, 2018

I changed to storing the edges in all cases, instead of widths. It avoids successive diff and cumsum in non-uniform strategies, and makes the API more consistent across strategies.

And I suppose we're not worried about a slow-down in the uniform case?

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Jun 18, 2018

Good point, if you think it's important, I can add a branch in _transform to be faster on the uniform strategy.

@jnothman
Copy link
Member

jnothman left a comment

Great work. All nitpicks. LGTM!

X /= bin_width
bin_edges = self.bin_edges_[trans]
for jj in range(X.shape[1]):
# Values which are a multiple of the bin width are susceptible to

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2018

Member

"a multiple of the bin width" -> "close to the bin edge"

for jj in range(X.shape[1]):
# Values which are a multiple of the bin width are susceptible to
# numeric instability. Add eps to X so these values are binned
# correctly. See documentation for numpy.isclose for an explanation

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2018

Member

Maybe "correctly" -> "correctly with respect to their decimal truncation"???

edges = np.r_[col.min(), edges, col.max()]

if edges[0] == edges[-1] and self.n_bins_[jj] > 2:
warnings.warn("Features %d is constant and will be "

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2018

Member

Features -> Feature

Widths are defined by a quantile transform, to have a uniform
number of samples in each bin.
kmeans
Widths are defined by a k-means on each features.

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2018

Member

features -> feature

Perhaps:
"Values in each bin have the same nearest center of a k-means cluster."

uniform
All bins in each feature have identical widths.
quantile
Widths are defined by a quantile transform, to have a uniform

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2018

Member

"All bins have the same number of points"

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 18, 2018

Good point, if you think it's important, I can add a branch in _transform to be faster on the uniform strategy.

I don't yet have reason to believe it's important... But yes, we could consider that now or in the future.

@jnothman jnothman referenced this pull request Jun 18, 2018

Merged

[MRG+2] Merge discrete branch into master #9342

4 of 7 tasks complete

@TomDLT TomDLT force-pushed the TomDLT:discrete_quantile branch from e5586dc to 24a8745 Jun 18, 2018

@TomDLT TomDLT changed the title [MRG] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) [MRG+1] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) Jun 18, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 24, 2018

I think once this is merged, we should move the discretizer to _encoders.py. This would resolve conflicts with master.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 1, 2018

Would @qinhanmin2014 like to give this a second look?

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Jul 1, 2018

I promise to give a review next weekend, and a second review from someone else before next weekend is welcomed. Sorry for the delay but I'm surrounded by school work at the end of the semester.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 1, 2018

@qinhanmin2014
Copy link
Member

qinhanmin2014 left a comment

The code LGTM. Will look at the examples and tests in a couple of hours.

edges = np.asarray(np.percentile(col, quantiles))

elif self.strategy == 'kmeans':
from ..cluster import KMeans

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 7, 2018

Member

Any specific reason to import here?

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 9, 2018

Author Member

Importing loop problem, but it seems not to happen anymore so I moved it to the top.

@@ -565,7 +563,7 @@ example, these intervals are defined as:
Based on these bin intervals, ``X`` is transformed as follows::

>>> est.transform(X) # doctest: +SKIP
array([[ 0., 2., 1.],
array([[ 0., 1., 1.],

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 7, 2018

Member

If I understand correctly, the result change because we're now using strategy='quantile', right?
If so, I think we need to modify the intervals above, otherwise they're not consistent:

  - feature 1: :math:`{[-\infty, 0), [0, 3), [3, \infty)}`
  - feature 2: :math:`{[-\infty, 4), [4, 5), [5, \infty)}`
  - feature 3: :math:`{[-\infty, 13), [13, \infty)}`

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 9, 2018

Author Member

Good catch !

from ..utils.validation import check_array
from ..utils.validation import check_is_fitted
from ..utils.validation import column_or_1d
from ..utils.fixes import np_version


class KBinsDiscretizer(BaseEstimator, TransformerMixin):

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 7, 2018

Member

Need to update the doc below:
Bins continuous data into k equal width intervals. is no longer the case I think.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 9, 2018

Author Member

Thanks

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Jul 7, 2018

@jnothman @TomDLT Will it be better to change the default value of n_bins to e.g., 3 or 5. n_bins=2 seems strange from my side, even though I'm aware that we're going to provide ways to automatically determine the value of n_bins

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 7, 2018

Without looking at the code, 'auto' in np.histogram is:

q25, q75 = np.percentile(X, [25, 75], axis=0)
iqr = q75 - q25
fd = 2 * iqr / X.shape[0] ** (1/3)
n_bins_fd = np.ceil(np.ptp(X, axis=0) / fd)
n_bins_sturges = np.ceil(np.log2(X.shape[0])) + 1
n_bins = np.maximum(n_bins_fd, n_bins_sturges).astype(int)

so if we want a better default, it's not hard to implement....

@qinhanmin2014
Copy link
Member

qinhanmin2014 left a comment

LGTM. I'm much more confident to merge discrete branch after this PR.

assert_array_equal(expected_3bins, Xt.ravel())


def test_kmeans_strategy():

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 7, 2018

Member

Could you explain the purpose of the test (maybe a comment)? It seems a bit strange (though I understand how it pass) and honestly seems redundant from my side.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 9, 2018

Author Member

It is indeed redundant so I removed it.

==========================================================
This example presents the different strategies implemented in KBinsDiscretizer:
- 'uniform': The discretization is uniform in each feature, which means that

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 7, 2018

Member

Formatting issue according to Circle. I guess a blank line will solve the problem.

@qinhanmin2014 qinhanmin2014 changed the title [MRG+1] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) [MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) Jul 7, 2018

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 7, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 7, 2018

Further regarding default n_bins: showing that 'auto' is useful when not doing a histogram might be hard. I'm +1 for having it more than 2.

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Jul 9, 2018

Thanks for the review !

I changed the default from n_bins=2 to n_bins=5, indeed only 2 bins is a bit weird.

@TomDLT TomDLT force-pushed the TomDLT:discrete_quantile branch from 7c228e4 to 33dec83 Jul 9, 2018

bin_edges[jj] = np.asarray(np.percentile(column, quantiles))

elif self.strategy == 'kmeans':
from ..cluster import KMeans # fixes import loops

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 9, 2018

Member

So what's the final decision here? You said in #11272 (comment) that you'll move the import to the top but you actually leave it here.

This comment has been minimized.

Copy link
@jnothman

jnothman Jul 9, 2018

Member

Please do leave it here. I don't think importing from sklearn.preprocessing should automatically result in importing DBSCAN, etc. Better off keeping the loading as light as possible.

bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5
bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max]

if bin_edges[jj][0] == bin_edges[jj][-1] and n_bins[jj] > 2:

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 9, 2018

Member

I'm fine with handling edge cases, but:
(1) Why don't you handle situations when n_bins[jj] == 2?
(2) It seems strange that we reject n_bins = 1 and manually set n_bins_ = 1 in edge cases. Will it be reasonable to accept n_bins = 1?

This comment has been minimized.

Copy link
@jnothman

jnothman Jul 9, 2018

Member

It's not useful/meaningful to accept n_bins = 1...

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 9, 2018

Member

Fine, then changing n_bins[jj] > 2 to n_bins[jj] >= 2 seems enough from my side, or sorry if I'm wrong.

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 9, 2018

Member

Also, maybe we need a test for the edge case.

>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 3, 2], encode='ordinal').fit(X)
>>> est.bin_width_
array([3., 1., 2.])
>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 4, 2], encode='ordinal').fit(X)

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 9, 2018

Member

Is it good to have n_bins > number of samples in the example? I might still prefer n_bins=[3, 3, 2].

def test_same_min_max():
@pytest.mark.parametrize('strategy', ['uniform', 'kmeans', 'quantile'])
@pytest.mark.parametrize('n_bins', [2, 3])
def test_same_min_max(strategy, n_bins):

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 9, 2018

Member

Do we test n_bins_[0] = 1 somewhere? Or do you think we don't need such a test?
Also, maybe we don't need to test multiple n_bins here?

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 9, 2018

Author Member

n_bins = 1 is tested in arrays (test_invalid_n_bins_array) and as an integer (test_invalid_n_bins).

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 9, 2018

Author Member

Oh I misread your comment, I'll add a line to test n_bins_.

@TomDLT TomDLT force-pushed the TomDLT:discrete_quantile branch from b5b6083 to d1d2369 Jul 9, 2018

@qinhanmin2014
Copy link
Member

qinhanmin2014 left a comment

LGTM if CIs are green. ping @jnothman for a final check.

@jnothman jnothman merged commit 5a61af9 into scikit-learn:discrete Jul 9, 2018

0 of 5 checks passed

LGTM analysis: Python Running analyses for revisions
Details
ci/circleci: python2 CircleCI is running your tests
Details
ci/circleci: python3 CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.