ENH Add dtype parameter to KBinsDiscretizer to manage the output data type #16335

Henley13 · 2020-01-31T11:22:47Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Add dtypeparameter to the class KBinsDiscretizerin order to cast the output of transform and inverse_transform methods.

Any other comments?

The dtype of the output is independent from the dtype of the data used to fit KBinsDiscretizer. Default is np.float64.

rth

Thanks, a few comments below.

Please add an what's new entry to doc/whats_new/v0.23.rst

sklearn/preprocessing/_discretization.py

sklearn/preprocessing/tests/test_discretization.py

agramfort

LGTM besides my nitpick

agramfort · 2020-01-31T19:42:44Z

sklearn/preprocessing/_discretization.py

+    dtype : data-type, default=None
+        The desired data-type for the output. If None, output dtype is
+        consistent with input dtype. Only np.float32 and np.float64 are
+        supported.


as the behavior is now matching the dtype param in OneHotEncoder I would uniformize the docstrings by updating the OneHotEncoder one as it is said here. Provided it's correct.

It doesn't match the default behavior of OneHotEncoder, there was some confusion about that in the original issue. By default this preserves the dtype when it's float, while OneHotEncoder has typically categorical/string as input dtype and will allways produce float64 by default.

It behaves similarly to say StandardScaler, but the user can explicitly specify the dtype in addition.

Currently, I do not touch the docstrings of OneHotEncoder because it is not exactly the same behaviour. If you want to do so, we would need a specific PR for OneHotEncoder, no ?

rth

A last batch of comments, should be good to go then. Thanks!

rth · 2020-02-01T11:48:05Z

sklearn/preprocessing/_discretization.py

        self.n_bins = n_bins
        self.encode = encode
        self.strategy = strategy
+        self.dtype = dtype if dtype in FLOAT_DTYPES[:2] else None


Here we should just assign it without doing any validation. We can validate it in fit.

sklearn/preprocessing/_discretization.py

rth · 2020-02-01T11:58:08Z

sklearn/preprocessing/_discretization.py

@@ -138,6 +145,7 @@ def fit(self, X, y=None):
        self
        """
        X = check_array(X, dtype='numeric')
+        output_dtype = self.dtype if self.dtype is not None else X.dtype


Maybe let's raise a value error if self.dtype is not in [np.float64, np.float32, None] to say that it's not supported instead of silently discarding it. With a corresponding test using for instance pytest.raise(ValueError, match="dtype=.* not supported") to check that it is indeed raised.

After that validation we can assume that self.dtype is one these 3 values.

Currently, I raise an error if the dtype defined in the discretizer is wrong. However, if the dtype of the input data is "wrong" (numeric, but different than float32 or float64), we silently cast it in float64. Do you want to raise an error as well for the input dtype ?

sklearn/preprocessing/_discretization.py

rth · 2020-02-01T12:08:03Z

sklearn/preprocessing/_discretization.py

            # Fit the OneHotEncoder with toy datasets
            # so that it's ready for use after the KBinsDiscretizer is fitted
-            self._encoder.fit(np.zeros((1, len(self.n_bins_)), dtype=int))
+            self._encoder.fit(np.zeros((1, len(self.n_bins_))))


I think we still may want this, not sure.

It is the dtype of the toy dataset used to fit the encoder. It only influences the dtype of the private attributes of OneHotEncoder when we fit it (if I am right). When we apply self._encoder.transform(Xt) (ie. OneHotEncoder.transform(Xt)), the output is cast with the dtype parameter of the encoder. So dtype=int seems useless here. Plus, the tests are still green. Do you want it back ?

It might be best to pass dtype=X.dtype for consistency

sklearn/preprocessing/tests/test_discretization.py

…dtype

sklearn/preprocessing/_discretization.py

glemaitre

Another review but this is just nitpicking.

sklearn/preprocessing/_discretization.py

glemaitre · 2020-02-21T09:48:04Z

sklearn/preprocessing/_discretization.py

            # Fit the OneHotEncoder with toy datasets
            # so that it's ready for use after the KBinsDiscretizer is fitted
-            self._encoder.fit(np.zeros((1, len(self.n_bins_)), dtype=int))
+            self._encoder.fit(np.zeros((1, len(self.n_bins_))))


It might be best to pass dtype=X.dtype for consistency

sklearn/preprocessing/_discretization.py

cmarmo · 2020-03-28T14:47:25Z

Hi @Henley13, one approval already... do you think you can find some time to sync with upstream and address the comments? Thanks for your work!

glemaitre · 2020-06-16T12:14:37Z

I solve the conflicts and applied my suggestion. Thanks @Henley13

… type (scikit-learn#16335) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

test and solution

02d730f

rth reviewed Jan 31, 2020

View reviewed changes

jeremiedbb added Sprint module:preprocessing labels Jan 31, 2020

Henley13 added 2 commits January 31, 2020 16:53

second solution

9631cb2

update doc

910e3de

agramfort approved these changes Jan 31, 2020

View reviewed changes

rth reviewed Feb 1, 2020

View reviewed changes

rth added the Needs work label Feb 5, 2020

Henley13 added 2 commits February 11, 2020 17:40

update branch from upstream/master

077f6d8

Merge remote-tracking branch 'upstream/master' into KBinsDiscretizer_…

b2e2bc7

…dtype

jnothman reviewed Feb 12, 2020

View reviewed changes

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

Henley13 added 2 commits February 12, 2020 16:45

add test to compare 32-64 bit results

86c9e9e

defaut parameter for the test

a293ecc

cmarmo added Needs Info and removed Needs work labels Feb 16, 2020

swchoi727 reviewed Feb 21, 2020

View reviewed changes

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

glemaitre reviewed Feb 21, 2020

View reviewed changes

cmarmo removed the Needs Info label Mar 3, 2020

glemaitre self-assigned this Jun 4, 2020

glemaitre added 3 commits June 4, 2020 16:44

fix conflict

d7eff73

iter

f0b22f7

iter

57c8d70

glemaitre changed the title ~~[MRG] Add dtype parameter to KBinsDiscretizer~~ ENH Add dtype parameter to KBinsDiscretizer to manage the output data type Jun 16, 2020

glemaitre approved these changes Jun 16, 2020

View reviewed changes

glemaitre merged commit 846d517 into scikit-learn:master Jun 16, 2020

rubywerman pushed a commit to MLH-Fellowship/scikit-learn that referenced this pull request Jun 24, 2020

ENH Add dtype parameter to KBinsDiscretizer to manage the output data…

26ebd4e

… type (scikit-learn#16335) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

ENH Add dtype parameter to KBinsDiscretizer to manage the output data…

87acf04

… type (scikit-learn#16335) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

ENH Add dtype parameter to KBinsDiscretizer to manage the output data…

ab429a9

… type (scikit-learn#16335) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

cmarmo mentioned this pull request Jan 11, 2021

MRG Add dtype parameter in KBinsDiscretizer #15528

Closed

Uh oh!

ENH Add dtype parameter to KBinsDiscretizer to manage the output data type #16335

ENH Add dtype parameter to KBinsDiscretizer to manage the output data type #16335

Uh oh!

Conversation

Henley13 commented Jan 31, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agramfort left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cmarmo commented Mar 28, 2020

Uh oh!

glemaitre commented Jun 16, 2020

Uh oh!

Uh oh!