# [MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) #11272

Merged
merged 13 commits into from Jul 9, 2018

## Conversation

Projects
None yet
3 participants
Member

### TomDLT commented Jun 15, 2018 • edited

 Fixes #9338. This PR implements two non-uniform strategies for KBinsDiscretizer (in discrete branch #9342), adding a parameter strategy: KBinsDiscretizer(strategy='uniform'): previous behavior, constant width bins. KBinsDiscretizer(strategy='quantile'): new behavior (default), based on np.percentile. KBinsDiscretizer(strategy='kmeans'): new behavior, based on KMeans. It adds the following example:

### TomDLTforce-pushed the TomDLT:discrete_quantile branch from ded8c41 to 2b8c99fJun 15, 2018

 Implement two new strategies for KBinsDiscretizer 
 9001878 

### TomDLTforce-pushed the TomDLT:discrete_quantile branch 2 times, most recently from aaddc11 to 21b9356Jun 15, 2018

 FIX python viridis import 
 54522af 

Member

 Nice!​

Member

### jnothman left a comment

 I think we should treat this as if it'll be released with KBinsDiscretizer. We should set the default strategy to 'quantile'. We should consider API changes to make this more consistent. We should update the what's new message rather than adding one. I was a bit surprised to see you've only added, and not modified tests. But the only thing I'm missing there is checking the consistency of 1-column and multi-column transformations.
 @@ -576,6 +576,18 @@ Discretization is similar to constructing histograms for continuous data. However, histograms focus on counting features which fall into particular bins, whereas discretization focuses on assigning feature values to these bins. :class:KBinsDiscretizer implement different binning strategies, which can be

#### jnothman Jun 16, 2018

Member

*implements

 @@ -62,7 +82,9 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): Number of bins per feature. An ignored feature at index i will have n_bins_[i] == 0. bin_width_ : float array, shape (n_features,) bin_width_ : array, shape (n_features,) Contain floats with 'uniform' strategy, and arrays of varying shapes

#### jnothman Jun 16, 2018

Member

Do we now want to store widths, or edges? Do we lose a lot by storing edges in all cases?

 kmeans Widths are defined by a k-means on each features. random_state : int, RandomState instance or None (default)

#### jnothman Jun 16, 2018

Member

Seeing as we're doing it in 1d, can't we just initialise kmeans at the midpoints of a uniform split, and make it deterministic?

 n_clusters = self.n_bins_[jj] km = KMeans(n_clusters=n_clusters, random_state=self.random_state).fit(col) centers = np.sort(km.cluster_centers_[:, 0],

#### jnothman Jun 16, 2018

Member

I don't think we'd need this sort if we initialised with linspace

 if self.strategy == 'quantile': n_quantiles = self.n_bins_[jj] + 1 qt = QuantileTransformer(n_quantiles=n_quantiles).fit(col)

#### jnothman Jun 16, 2018

Member

This seems overkill. Just use boundaries = np.percentile(col, n_quantiles + 1)

 valid_strategy = ('uniform', 'quantile', 'kmeans') if self.strategy not in valid_strategy: raise ValueError("Valid options for 'strategy' are {}. " "Got 'strategy = {}' instead."

#### jnothman Jun 16, 2018

Member

use {!r} for quoting

 def test_nonuniform_strategies(): X = np.array([0, 1, 2, 3, 9, 10]).reshape(-1, 1) # with 2 bins

#### jnothman Jun 16, 2018

Member

Please also test the size of bin_widths_

### TomDLT added some commits Jun 18, 2018

 ENH store edges instead of widths 
 262813a 
 ENH change default to "quantile" and make "kmeans" deterministic 
 3b45b05 
 ENH improve tests and error messages 
 de0a349 
Member Author

### TomDLT commented Jun 18, 2018

 Thanks for the review. I changed to storing the edges in all cases, instead of widths. It avoids successive diff and cumsum in non-uniform strategies, and makes the API more consistent across strategies. I removed the offset attribute, since the information is now in the edges. I changed the default to 'quantile'. I made 'kmeans' strategy deterministic with a uniform init.
 CLN simplify branching 
 785c14f 

Member

### jnothman commented Jun 18, 2018

 I changed to storing the edges in all cases, instead of widths. It avoids successive diff and cumsum in non-uniform strategies, and makes the API more consistent across strategies. And I suppose we're not worried about a slow-down in the uniform case?
Member Author

### TomDLT commented Jun 18, 2018

 Good point, if you think it's important, I can add a branch in _transform to be faster on the uniform strategy.

Member

### jnothman left a comment

 Great work. All nitpicks. LGTM!
 X /= bin_width bin_edges = self.bin_edges_[trans] for jj in range(X.shape[1]): # Values which are a multiple of the bin width are susceptible to

#### jnothman Jun 18, 2018

Member

"a multiple of the bin width" -> "close to the bin edge"

 for jj in range(X.shape[1]): # Values which are a multiple of the bin width are susceptible to # numeric instability. Add eps to X so these values are binned # correctly. See documentation for numpy.isclose for an explanation

#### jnothman Jun 18, 2018

Member

Maybe "correctly" -> "correctly with respect to their decimal truncation"???

 edges = np.r_[col.min(), edges, col.max()] if edges[0] == edges[-1] and self.n_bins_[jj] > 2: warnings.warn("Features %d is constant and will be "

#### jnothman Jun 18, 2018

Member

Features -> Feature

 Widths are defined by a quantile transform, to have a uniform number of samples in each bin. kmeans Widths are defined by a k-means on each features.

#### jnothman Jun 18, 2018

Member

features -> feature

Perhaps:
"Values in each bin have the same nearest center of a k-means cluster."

 uniform All bins in each feature have identical widths. quantile Widths are defined by a quantile transform, to have a uniform

#### jnothman Jun 18, 2018

Member

"All bins have the same number of points"

Member

### jnothman commented Jun 18, 2018

 Good point, if you think it's important, I can add a branch in _transform to be faster on the uniform strategy. I don't yet have reason to believe it's important... But yes, we could consider that now or in the future.
 FIX example 
 b677ab3 

### jnothman referenced this pull request Jun 18, 2018

Merged

#### [MRG+2] Merge discrete branch into master #9342

 FIX compatibility with numpy 1.8 
 24a8745 

### TomDLT changed the title [MRG] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch)[MRG+1] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch)Jun 18, 2018

 CLN just renaming variables 
 b204f46 
Member

### jnothman commented Jun 24, 2018

 I think once this is merged, we should move the discretizer to _encoders.py. This would resolve conflicts with master.
Member

### jnothman commented Jul 1, 2018

 Would @qinhanmin2014 like to give this a second look?
Member

### qinhanmin2014 commented Jul 1, 2018

 I promise to give a review next weekend, and a second review from someone else before next weekend is welcomed. Sorry for the delay but I'm surrounded by school work at the end of the semester.
Member

### jnothman commented Jul 1, 2018

 That's fine! Just thought you might be a good person to give it a go if available.​

Member

### qinhanmin2014 left a comment

 The code LGTM. Will look at the examples and tests in a couple of hours.
 edges = np.asarray(np.percentile(col, quantiles)) elif self.strategy == 'kmeans': from ..cluster import KMeans

#### qinhanmin2014 Jul 7, 2018

Member

Any specific reason to import here?

#### TomDLT Jul 9, 2018

Author Member

Importing loop problem, but it seems not to happen anymore so I moved it to the top.

 @@ -565,7 +563,7 @@ example, these intervals are defined as: Based on these bin intervals, X is transformed as follows:: >>> est.transform(X) # doctest: +SKIP array([[ 0., 2., 1.], array([[ 0., 1., 1.],

#### qinhanmin2014 Jul 7, 2018

Member

If I understand correctly, the result change because we're now using strategy='quantile', right?
If so, I think we need to modify the intervals above, otherwise they're not consistent:

  - feature 1: :math:{[-\infty, 0), [0, 3), [3, \infty)}
- feature 2: :math:{[-\infty, 4), [4, 5), [5, \infty)}
- feature 3: :math:{[-\infty, 13), [13, \infty)}


Author Member

Good catch !

### qinhanmin2014 reviewed Jul 7, 2018

 from ..utils.validation import check_array from ..utils.validation import check_is_fitted from ..utils.validation import column_or_1d from ..utils.fixes import np_version class KBinsDiscretizer(BaseEstimator, TransformerMixin):

#### qinhanmin2014 Jul 7, 2018

Member

Need to update the doc below:
Bins continuous data into k equal width intervals. is no longer the case I think.

Author Member

Thanks

Member

### qinhanmin2014 commented Jul 7, 2018

 @jnothman @TomDLT Will it be better to change the default value of n_bins to e.g., 3 or 5. n_bins=2 seems strange from my side, even though I'm aware that we're going to provide ways to automatically determine the value of n_bins
Member

### jnothman commented Jul 7, 2018

 Without looking at the code, 'auto' in np.histogram is: q25, q75 = np.percentile(X, [25, 75], axis=0) iqr = q75 - q25 fd = 2 * iqr / X.shape[0] ** (1/3) n_bins_fd = np.ceil(np.ptp(X, axis=0) / fd) n_bins_sturges = np.ceil(np.log2(X.shape[0])) + 1 n_bins = np.maximum(n_bins_fd, n_bins_sturges).astype(int)  so if we want a better default, it's not hard to implement....

Member

### qinhanmin2014 left a comment

 LGTM. I'm much more confident to merge discrete branch after this PR.
 assert_array_equal(expected_3bins, Xt.ravel()) def test_kmeans_strategy():

#### qinhanmin2014 Jul 7, 2018

Member

Could you explain the purpose of the test (maybe a comment)? It seems a bit strange (though I understand how it pass) and honestly seems redundant from my side.

#### TomDLT Jul 9, 2018

Author Member

It is indeed redundant so I removed it.

 ========================================================== This example presents the different strategies implemented in KBinsDiscretizer: - 'uniform': The discretization is uniform in each feature, which means that

#### qinhanmin2014 Jul 7, 2018

Member

Formatting issue according to Circle. I guess a blank line will solve the problem.

Member

### jnothman commented Jul 7, 2018

 Further regarding default n_bins: showing that 'auto' is useful when not doing a histogram might be hard. I'm +1 for having it more than 2.
Member Author

### TomDLT commented Jul 9, 2018

 Thanks for the review ! I changed the default from n_bins=2 to n_bins=5, indeed only 2 bins is a bit weird.
 address comments 
 33dec83 

### qinhanmin2014 reviewed Jul 9, 2018

 bin_edges[jj] = np.asarray(np.percentile(column, quantiles)) elif self.strategy == 'kmeans': from ..cluster import KMeans # fixes import loops

#### qinhanmin2014 Jul 9, 2018

Member

So what's the final decision here? You said in #11272 (comment) that you'll move the import to the top but you actually leave it here.

#### jnothman Jul 9, 2018

Member

Please do leave it here. I don't think importing from sklearn.preprocessing should automatically result in importing DBSCAN, etc. Better off keeping the loading as light as possible.

 bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5 bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max] if bin_edges[jj][0] == bin_edges[jj][-1] and n_bins[jj] > 2:

#### qinhanmin2014 Jul 9, 2018

Member

I'm fine with handling edge cases, but:
(1) Why don't you handle situations when n_bins[jj] == 2?
(2) It seems strange that we reject n_bins = 1 and manually set n_bins_ = 1 in edge cases. Will it be reasonable to accept n_bins = 1?

#### jnothman Jul 9, 2018

Member

It's not useful/meaningful to accept n_bins = 1...

#### qinhanmin2014 Jul 9, 2018

Member

Fine, then changing n_bins[jj] > 2 to n_bins[jj] >= 2 seems enough from my side, or sorry if I'm wrong.

#### qinhanmin2014 Jul 9, 2018

Member

Also, maybe we need a test for the edge case.

 >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 3, 2], encode='ordinal').fit(X) >>> est.bin_width_ array([3., 1., 2.]) >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 4, 2], encode='ordinal').fit(X)

#### qinhanmin2014 Jul 9, 2018

Member

Is it good to have n_bins > number of samples in the example? I might still prefer n_bins=[3, 3, 2].

 move constant feature check and improve logic 
 c01d3dc 

### qinhanmin2014 reviewed Jul 9, 2018

 def test_same_min_max(): @pytest.mark.parametrize('strategy', ['uniform', 'kmeans', 'quantile']) @pytest.mark.parametrize('n_bins', [2, 3]) def test_same_min_max(strategy, n_bins):

#### qinhanmin2014 Jul 9, 2018

Member

Do we test n_bins_[0] = 1 somewhere? Or do you think we don't need such a test?
Also, maybe we don't need to test multiple n_bins here?

#### TomDLT Jul 9, 2018

Author Member

n_bins = 1 is tested in arrays (test_invalid_n_bins_array) and as an integer (test_invalid_n_bins).

#### TomDLT Jul 9, 2018

Author Member

Oh I misread your comment, I'll add a line to test n_bins_.

 simpler test 
 d1d2369 

Member

### qinhanmin2014 left a comment

 LGTM if CIs are green. ping @jnothman for a final check.
 Update v0.20.rst 
 9d9f3f9 

### jnothman merged commit 5a61af9 into scikit-learn:discrete Jul 9, 2018 0 of 5 checks passed

#### 0 of 5 checks passed

LGTM analysis: Python Running analyses for revisions
Details
ci/circleci: python2 CircleCI is running your tests
Details
ci/circleci: python3 CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

Closed