-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Add dtype parameter to KBinsDiscretizer to manage the output data type #16335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add dtype parameter to KBinsDiscretizer to manage the output data type #16335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, a few comments below.
Please add an what's new entry to doc/whats_new/v0.23.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides my nitpick
dtype : data-type, default=None | ||
The desired data-type for the output. If None, output dtype is | ||
consistent with input dtype. Only np.float32 and np.float64 are | ||
supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the behavior is now matching the dtype param in OneHotEncoder I would uniformize the docstrings by updating the OneHotEncoder one as it is said here. Provided it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't match the default behavior of OneHotEncoder, there was some confusion about that in the original issue. By default this preserves the dtype when it's float, while OneHotEncoder has typically categorical/string as input dtype and will allways produce float64 by default.
It behaves similarly to say StandardScaler, but the user can explicitly specify the dtype in addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I do not touch the docstrings of OneHotEncoder because it is not exactly the same behaviour. If you want to do so, we would need a specific PR for OneHotEncoder, no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A last batch of comments, should be good to go then. Thanks!
self.n_bins = n_bins | ||
self.encode = encode | ||
self.strategy = strategy | ||
self.dtype = dtype if dtype in FLOAT_DTYPES[:2] else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should just assign it without doing any validation. We can validate it in fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -138,6 +145,7 @@ def fit(self, X, y=None): | |||
self | |||
""" | |||
X = check_array(X, dtype='numeric') | |||
output_dtype = self.dtype if self.dtype is not None else X.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's raise a value error if self.dtype is not in [np.float64, np.float32, None] to say that it's not supported instead of silently discarding it. With a corresponding test using for instance pytest.raise(ValueError, match="dtype=.* not supported")
to check that it is indeed raised.
After that validation we can assume that self.dtype is one these 3 values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I raise an error if the dtype defined in the discretizer is wrong. However, if the dtype of the input data is "wrong" (numeric, but different than float32 or float64), we silently cast it in float64. Do you want to raise an error as well for the input dtype ?
# Fit the OneHotEncoder with toy datasets | ||
# so that it's ready for use after the KBinsDiscretizer is fitted | ||
self._encoder.fit(np.zeros((1, len(self.n_bins_)), dtype=int)) | ||
self._encoder.fit(np.zeros((1, len(self.n_bins_)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still may want this, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the dtype of the toy dataset used to fit the encoder. It only influences the dtype of the private attributes of OneHotEncoder when we fit it (if I am right). When we apply self._encoder.transform(Xt)
(ie. OneHotEncoder.transform(Xt)
), the output is cast with the dtype parameter of the encoder. So dtype=int
seems useless here. Plus, the tests are still green. Do you want it back ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be best to pass dtype=X.dtype
for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another review but this is just nitpicking.
# Fit the OneHotEncoder with toy datasets | ||
# so that it's ready for use after the KBinsDiscretizer is fitted | ||
self._encoder.fit(np.zeros((1, len(self.n_bins_)), dtype=int)) | ||
self._encoder.fit(np.zeros((1, len(self.n_bins_)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be best to pass dtype=X.dtype
for consistency
Hi @Henley13, one approval already... do you think you can find some time to sync with upstream and address the comments? Thanks for your work! |
I solve the conflicts and applied my suggestion. Thanks @Henley13 |
… type (scikit-learn#16335) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
… type (scikit-learn#16335) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
… type (scikit-learn#16335) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Reference Issues/PRs
Fixes #15409.
What does this implement/fix? Explain your changes.
Add
dtype
parameter to the classKBinsDiscretizer
in order to cast the output oftransform
andinverse_transform
methods.Any other comments?
The dtype of the output is independent from the dtype of the data used to fit
KBinsDiscretizer
. Default isnp.float64
.