Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH Remove ignored_features in KBinsDiscretizer #11467

Merged
merged 9 commits into from Jul 11, 2018

Conversation

Projects
None yet
3 participants
@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Jul 10, 2018

Now we have ColumnTransformer, so we don't need to support ignored_features in KBinsDiscretizer (as we've done in OneHotEncoder). See:
#9342 (comment)
#9342 (comment)

I think this solution will work, but we still have things to consider:
(1) Is it good to use _transform_selected? I think it's acceptable, since with default selected="all", it will simply validate the input with check_array and return directly.
(2) Should we support inverse_transform for encoders other than ordinal? To support this, one way is to store the fitted OntHotEncoder (or simply build a new one and set categories_, so that it can support inverse_transform without fitting). The other way is to borrow some code from inverse_transform of OntHotEncoder.

qinhanmin2014 added some commits Jul 10, 2018

@scikit-learn scikit-learn deleted a comment from sklearn-lgtm Jul 10, 2018

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Jul 10, 2018

ping @jnothman @TomDLT ready for review :)

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 10, 2018

What is the benefit of using _transform_selected?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 10, 2018

I'd be fine with storing the fitted encoder to inverse_transform. But that's an enhancement to consider later

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Jul 11, 2018

What is the benefit of using _transform_selected?

No apparent benefits (maybe just avoid some duplicate code)

Xt = _transform_selected(X, self._transform, self.dtype, copy=True, retain_order=True)

is the same as

X = check_array(X, accept_sparse='csc', copy=True, dtype=FLOAT_DTYPES)
Xt = self._transform(X)
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 11, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 11, 2018

qinhanmin2014 added some commits Jul 11, 2018

@qinhanmin2014
Copy link
Member Author

qinhanmin2014 left a comment

ping @jnothman ready for review

Xt = _transform_selected(X, self._transform, self.dtype,
self.transformed_features_, copy=True,
retain_order=True)
X = check_array(X, dtype=FLOAT_DTYPES)

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jul 11, 2018

Author Member

We use X = check_array(X, accept_sparse='csc', copy=True, dtype=FLOAT_DTYPES) in _transform_selected.
Here, I remove accept_sparse because we don't support sparse input here. I remove copy because I don't find it useful. I can't remove dtype because the result will change.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 11, 2018

Member

check_array is already called in _validate_X_post_fit.

Also, copy=True is necessary since X is modified inplace.
This should probably be tested.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 11, 2018

Member

I also realize that self.dtype is not used anymore.
We can probably remove it completely.

@jnothman
Copy link
Member

jnothman left a comment

LGTM. Thank you

@@ -117,16 +109,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])
You can combine ``KBinsDiscretizer`` with ``ColumnTransformer`` if you

This comment has been minimized.

Copy link
@jnothman

jnothman Jul 11, 2018

Member

Probably best to use a :class: reference here

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Jul 11, 2018

ping @TomDLT for a second review if you have time :)

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 11, 2018

@TomDLT
Copy link
Member

TomDLT left a comment

This is much cleaner, thanks!
Just a few changes necessary

Xt = _transform_selected(X, self._transform, self.dtype,
self.transformed_features_, copy=True,
retain_order=True)
X = check_array(X, dtype=FLOAT_DTYPES)

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 11, 2018

Member

check_array is already called in _validate_X_post_fit.

Also, copy=True is necessary since X is modified inplace.
This should probably be tested.

Xt = _transform_selected(X, self._transform, self.dtype,
self.transformed_features_, copy=True,
retain_order=True)
X = check_array(X, dtype=FLOAT_DTYPES)

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 11, 2018

Member

I also realize that self.dtype is not used anymore.
We can probably remove it completely.

qinhanmin2014 added some commits Jul 11, 2018

@TomDLT

TomDLT approved these changes Jul 11, 2018

Copy link
Member

TomDLT left a comment

LGTM

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Jul 11, 2018

Thanks @TomDLT for the review. (I'm waiting for CI, not expecting to get a response so quickly :))

check_array is already called in _validate_X_post_fit

I choose to remove _validate_X_post_fit. Firstly, this can avoid duplicate check_array in transform. Secondly, _validate_X_post_fit block us from supporting inverse_transform for encoders other than ordinal.

Also, copy=True is necessary since X is modified inplace. This should probably be tested.

Thanks, updated with a test. (Seems that there's not such test in the common test? not 100% sure though. )

I also realize that self.dtype is not used anymore.

Thanks, removed.

I also let KBinsDiscretizer go through the common test :)

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Jul 11, 2018

CIs are green. ping @jnothman for a final check or @TomDLT if you're confident enough to merge directly. Thanks :)

if self.encode != 'ordinal':
raise ValueError("inverse_transform only supports "
"'encode = ordinal'. Got encode={!r} instead."
.format(self.encode))

Xt = self._validate_X_post_fit(Xt)
trans = self.transformed_features_
Xt = check_array(Xt, dtype='numeric')

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 11, 2018

Member

We should probably use FLOAT_DTYPES here, since we modify Xt inplace and we may want to put float in it.


bin_edges = self.bin_edges_[trans]
for jj in range(X.shape[1]):
Xt = X.copy()

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 11, 2018

Member

Why no using the copy parameter of check_array?
It would avoid a double copy if X is not a float array.

This applies also for inverse_transform.

@TomDLT

TomDLT approved these changes Jul 11, 2018

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Jul 11, 2018

Thanks @TomDLT. Comments addressed. I need to have a more thorough understanding of our check_array :)

@TomDLT TomDLT merged commit e4089d1 into scikit-learn:discrete Jul 11, 2018

8 checks passed

LGTM analysis: Python No alert changes
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 96.96% of diff hit (target 95.33%)
Details
codecov/project 95.33% (+<.01%) compared to d1e2615
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:remove-ignored-features branch Jul 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.