Skip to content

ENH Add dtype parameter to KBinsDiscretizer to manage the output data type #16335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 16, 2020

Conversation

Henley13
Copy link
Contributor

Reference Issues/PRs

Fixes #15409.

What does this implement/fix? Explain your changes.

Add dtypeparameter to the class KBinsDiscretizerin order to cast the output of transform and inverse_transform methods.

Any other comments?

The dtype of the output is independent from the dtype of the data used to fit KBinsDiscretizer. Default is np.float64.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few comments below.

Please add an what's new entry to doc/whats_new/v0.23.rst

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides my nitpick

dtype : data-type, default=None
The desired data-type for the output. If None, output dtype is
consistent with input dtype. Only np.float32 and np.float64 are
supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the behavior is now matching the dtype param in OneHotEncoder I would uniformize the docstrings by updating the OneHotEncoder one as it is said here. Provided it's correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't match the default behavior of OneHotEncoder, there was some confusion about that in the original issue. By default this preserves the dtype when it's float, while OneHotEncoder has typically categorical/string as input dtype and will allways produce float64 by default.

It behaves similarly to say StandardScaler, but the user can explicitly specify the dtype in addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I do not touch the docstrings of OneHotEncoder because it is not exactly the same behaviour. If you want to do so, we would need a specific PR for OneHotEncoder, no ?

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A last batch of comments, should be good to go then. Thanks!

self.n_bins = n_bins
self.encode = encode
self.strategy = strategy
self.dtype = dtype if dtype in FLOAT_DTYPES[:2] else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should just assign it without doing any validation. We can validate it in fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -138,6 +145,7 @@ def fit(self, X, y=None):
self
"""
X = check_array(X, dtype='numeric')
output_dtype = self.dtype if self.dtype is not None else X.dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's raise a value error if self.dtype is not in [np.float64, np.float32, None] to say that it's not supported instead of silently discarding it. With a corresponding test using for instance pytest.raise(ValueError, match="dtype=.* not supported") to check that it is indeed raised.

After that validation we can assume that self.dtype is one these 3 values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I raise an error if the dtype defined in the discretizer is wrong. However, if the dtype of the input data is "wrong" (numeric, but different than float32 or float64), we silently cast it in float64. Do you want to raise an error as well for the input dtype ?

# Fit the OneHotEncoder with toy datasets
# so that it's ready for use after the KBinsDiscretizer is fitted
self._encoder.fit(np.zeros((1, len(self.n_bins_)), dtype=int))
self._encoder.fit(np.zeros((1, len(self.n_bins_))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still may want this, not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the dtype of the toy dataset used to fit the encoder. It only influences the dtype of the private attributes of OneHotEncoder when we fit it (if I am right). When we apply self._encoder.transform(Xt) (ie. OneHotEncoder.transform(Xt)), the output is cast with the dtype parameter of the encoder. So dtype=int seems useless here. Plus, the tests are still green. Do you want it back ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be best to pass dtype=X.dtype for consistency

@rth rth added the Needs work label Feb 5, 2020
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another review but this is just nitpicking.

# Fit the OneHotEncoder with toy datasets
# so that it's ready for use after the KBinsDiscretizer is fitted
self._encoder.fit(np.zeros((1, len(self.n_bins_)), dtype=int))
self._encoder.fit(np.zeros((1, len(self.n_bins_))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be best to pass dtype=X.dtype for consistency

@cmarmo cmarmo removed the Needs Info label Mar 3, 2020
@cmarmo
Copy link
Contributor

cmarmo commented Mar 28, 2020

Hi @Henley13, one approval already... do you think you can find some time to sync with upstream and address the comments? Thanks for your work!

@glemaitre glemaitre self-assigned this Jun 4, 2020
@glemaitre glemaitre changed the title [MRG] Add dtype parameter to KBinsDiscretizer ENH Add dtype parameter to KBinsDiscretizer to manage the output data type Jun 16, 2020
@glemaitre glemaitre merged commit 846d517 into scikit-learn:master Jun 16, 2020
@glemaitre
Copy link
Member

I solve the conflicts and applied my suggestion. Thanks @Henley13

rubywerman pushed a commit to MLH-Fellowship/scikit-learn that referenced this pull request Jun 24, 2020
… type (scikit-learn#16335)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
… type (scikit-learn#16335)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
… type (scikit-learn#16335)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add dtype parameter to KBinsDiscretizer
8 participants