New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] KBinsDiscretizer : inverse_transform for ohe encoder #11505

Merged
merged 4 commits into from Jul 21, 2018

Conversation

Projects
None yet
5 participants
@ggc87
Contributor

ggc87 commented Jul 13, 2018

This PR introduce inverse_transform in KBinsDiscretizer for encoder
other than ordinal.

KBinsDiscretizer : inverse_transform for ohe encoder [See #11489]
This PR introduce inverse_transform in KBinsDiscretizer for encoder
other than ordinal.
@ggc87

This comment has been minimized.

Contributor

ggc87 commented Jul 13, 2018

See #11489.

I'm not totally convinced about the testing though!

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Jul 14, 2018

So the problem it that you assign self.ohe_encoder_ in transform, not in fit.

Fix testing.
Creating encoder_ohe_ in fit and preventing test_non_meta_estimators
to fail.
@ggc87

This comment has been minimized.

Contributor

ggc87 commented Jul 14, 2018

Sorry this was actually very very easy to solve :).

raise ValueError("inverse_transform only supports "
"'encode = ordinal'. Got encode={!r} instead."
.format(self.encode))
Xt = self.ohe_encoder_.inverse_transform(Xt)

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Jul 14, 2018

Member

For inverse transform, we need to use check_array to do input validation and then check number of features. So you'll have something like

Xinv = check_array...
if self.encode != 'ordinal':
    if ...
        raise ValueError("Incorrect number of features....)
    Xinv = self.ohe_encoder_.inverse_transform(Xinv)
else:
   ...

Also, maybe we should make ohe_encoder_ a private attribute, otherwise we need to document it.

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Jul 14, 2018

Member

Maybe a better way is to rely on the input validation provided by OneHotEncoder.inverse_transform, so you'll have something like

if self.encode != 'ordinal':
    self.ohe_encoder_.inverse_transform
else:
    existing input validation

This comment has been minimized.

@ggc87

ggc87 Jul 14, 2018

Contributor

Yes OneHotEncoder already run input validation, doesn't the output of the OHE inverse_transform need to be validated as well? I mean is there some case in which the input can pass the validation for a OneHotEncoder and not for a KBinsDiscretizer? As it is now both validation are runt.

I was thinking about the private variable as well, I have a doubt though, that could be a very stupid question but how can I run all the tests if the ohe_encoder is private? In particular

 if encode != 'ordinal':
         Xt_tmp = kbd.ohe_encoder_.inverse_transform(X2t)
     else:
         Xt_tmp = X2t
     assert_array_equal(Xt_tmp.max(axis=0) + 1, kbd.n_bins_)

Should I write a different test?

@@ -188,6 +188,12 @@ def fit(self, X, y=None):
self.bin_edges_ = bin_edges
self.n_bins_ = n_bins
if self.encode != 'ordinal':
encode_sparse = self.encode == 'onehot'
self.ohe_encoder_ = OneHotEncoder(

This comment has been minimized.

@glemaitre

glemaitre Jul 14, 2018

Contributor

I don't think that we want this attribute to be public.
So it should be named _ohe_encoder_

This comment has been minimized.

@ggc87

ggc87 Jul 18, 2018

Contributor

As mentioned I agree and I was actually thinking to make it private, _ohe_encoder would not be a "real" private would it be enough? If yes that would actually solve my previous question.

@@ -188,6 +188,12 @@ def fit(self, X, y=None):
self.bin_edges_ = bin_edges
self.n_bins_ = n_bins
if self.encode != 'ordinal':

This comment has been minimized.

@glemaitre

glemaitre Jul 14, 2018

Contributor

This condition is weird.

I would expect something like:

if 'onehot' in self.encode:
   ...

This comment has been minimized.

@ggc87

ggc87 Jul 18, 2018

Contributor

This if statement was actually already in the code (see line 291) should I change it as well to keep consistency?

Xt = kbd.fit_transform(X)
assert_array_equal(Xt.max(axis=0) + 1, kbd.n_bins_)
if encode != 'ordinal':
Xt_tmp = kbd.ohe_encoder_.inverse_transform(Xt)

This comment has been minimized.

@glemaitre

glemaitre Jul 14, 2018

Contributor

Xt_tmp is not a good name :)

you can also make it inline

Xt = kbd.ohe_encoder_.inverse_transform(Xt) if 'onehot' in encode else Xt

This comment has been minimized.

@ggc87

ggc87 Jul 18, 2018

Contributor

I perfectly agree on the naming, it is actually bad, I'll probably need to refactor a bit this code, Xt is required to compute X2.
I'm not a big fan of inline statements though in particular when they end up in long strings.

Is it really beneficial in this case?

X2 = kbd.inverse_transform(Xt)
X2t = kbd.fit_transform(X2)
assert_array_equal(X2t.max(axis=0) + 1, kbd.n_bins_)
if encode != 'ordinal':

This comment has been minimized.

@glemaitre

glemaitre Jul 14, 2018

Contributor

same has above

Xt = kbd.fit_transform(X)
assert_array_equal(Xt.max(axis=0) + 1, kbd.n_bins_)
if encode != 'ordinal':
Xt_tmp = kbd.ohe_encoder_.inverse_transform(Xt)

This comment has been minimized.

@ggc87

ggc87 Jul 18, 2018

Contributor

I perfectly agree on the naming, it is actually bad, I'll probably need to refactor a bit this code, Xt is required to compute X2.
I'm not a big fan of inline statements though in particular when they end up in long strings.

Is it really beneficial in this case?

@@ -188,6 +188,12 @@ def fit(self, X, y=None):
self.bin_edges_ = bin_edges
self.n_bins_ = n_bins
if self.encode != 'ordinal':
encode_sparse = self.encode == 'onehot'
self.ohe_encoder_ = OneHotEncoder(

This comment has been minimized.

@ggc87

ggc87 Jul 18, 2018

Contributor

As mentioned I agree and I was actually thinking to make it private, _ohe_encoder would not be a "real" private would it be enough? If yes that would actually solve my previous question.

@@ -188,6 +188,12 @@ def fit(self, X, y=None):
self.bin_edges_ = bin_edges
self.n_bins_ = n_bins
if self.encode != 'ordinal':

This comment has been minimized.

@ggc87

ggc87 Jul 18, 2018

Contributor

This if statement was actually already in the code (see line 291) should I change it as well to keep consistency?

@@ -188,6 +188,11 @@ def fit(self, X, y=None):
self.bin_edges_ = bin_edges
self.n_bins_ = n_bins
if 'onehot' in self.encode:
self._ohe_encoder = OneHotEncoder(

This comment has been minimized.

@jnothman

jnothman Jul 19, 2018

Member

let's just call it _encoder, so that we might use the same inverse_transform code when unary encoding is available.

@qinhanmin2014

LGTM, please add an entry to what's new (maybe use the existing entry for KBinsDiscretizer)

@qinhanmin2014 qinhanmin2014 changed the title from KBinsDiscretizer : inverse_transform for ohe encoder [See #11489] to [MRG+2] KBinsDiscretizer : inverse_transform for ohe encoder Jul 20, 2018

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 20, 2018

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 20, 2018

maybe use the existing entry for KBinsDiscretizer

+1

@qinhanmin2014

Will merge when green.

@qinhanmin2014 qinhanmin2014 merged commit 61547de into scikit-learn:master Jul 21, 2018

4 of 5 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Jul 21, 2018

thx @ggc87

@ggc87

This comment has been minimized.

Contributor

ggc87 commented Jul 22, 2018

Thank you for your review guys :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment