New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Feature: Implement PowerTransformer #10210

Merged
merged 56 commits into from Dec 5, 2017

Conversation

Projects
None yet
8 participants
@chang
Contributor

chang commented Nov 27, 2017

Reference Issues/PRs

Fixes #6675
Fixes #6781

What does this implement/fix? Explain your changes.

This PR implements sklearn.preprocessing.PowerTransformer. Power transforms are a family of monotonic, parametric transformations used to transform skewed distributions to as close to Gaussian as possible. This could be useful for models that require homoschedasticity, or any other situations where normality is desirable.

At the moment, only the Box-Cox transform is supported, which requires strictly positive data. The optimal parameters for stabilizing variance and minimizing skewness are determined using maximum likelihood, and the transformation is applied to the dataset feature-wise.

Any other comments?

We will consider implementing the Yeo-Johnson transform - a power transformation that can be applied to negative data - in a future PR.

Thanks to @maniteja123 for kicking it off!

maniteja123 and others added some commits May 14, 2016

TST: Fix CI test errors by making data nonzero in check_estimators.py…
… when Box-Cox is being tested. Fix docstring test failure.

@chang chang changed the title from [MRG] Implement BoxCoxTransformer to [WIP] Implement BoxCoxTransformer Nov 27, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 4, 2017

Member

Actually a comment of my own on plot_power_transformer: it might be useful to put all the inputs onto roughly the same scale (perhaps using minmax_scale(feature_range=(1e-12, 1)))...?

Member

jnothman commented Dec 4, 2017

Actually a comment of my own on plot_power_transformer: it might be useful to put all the inputs onto roughly the same scale (perhaps using minmax_scale(feature_range=(1e-12, 1)))...?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 4, 2017

Member

Also, I wonder whether there's any use in reporting the model likelihood as a goodness-of-fit measure...

Member

jnothman commented Dec 4, 2017

Also, I wonder whether there's any use in reporting the model likelihood as a goodness-of-fit measure...

@chang

This comment has been minimized.

Show comment
Hide comment
@chang

chang Dec 5, 2017

Contributor

Thanks for the doc fix @glemaitre. Good suggestion on normalizing the distributions, Joel - I used minmax_scale(X, feature_range=(1e-10, 10)).

Contributor

chang commented Dec 5, 2017

Thanks for the doc fix @glemaitre. Good suggestion on normalizing the distributions, Joel - I used minmax_scale(X, feature_range=(1e-10, 10)).

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 5, 2017

Member
Member

jnothman commented Dec 5, 2017

@chang

This comment has been minimized.

Show comment
Hide comment
@chang

chang Dec 5, 2017

Contributor

Fixed the issues - thanks!

Contributor

chang commented Dec 5, 2017

Fixed the issues - thanks!

@glemaitre

My last nitpicks. @jnothman I am fine to merge.

Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 5, 2017

Member

Great work @ericchang00!

Member

amueller commented Dec 5, 2017

Great work @ericchang00!

Show outdated Hide outdated doc/modules/preprocessing.rst Outdated
Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 5, 2017

Member

I'm still confused as to how maximum likelihood relates to skewdness. The wikipedia article of box-cox doesn't mention skew.... Is it just that empirically it decreases skew or is there some more formal statement?

Member

amueller commented Dec 5, 2017

I'm still confused as to how maximum likelihood relates to skewdness. The wikipedia article of box-cox doesn't mention skew.... Is it just that empirically it decreases skew or is there some more formal statement?

@chang

This comment has been minimized.

Show comment
Hide comment
@chang

chang Dec 5, 2017

Contributor

Added the final tweaks.

@amueller, I think the 'skewness' vocabulary came from an earlier review. It's more of an empirical observation - the main purpose of Box-Cox is to make data normal and stabilize variance. Skewness does not necessarily imply higher variance, but it does imply nonnormality, so the description still makes sense, IMO.

edit: fixed flake8 error

Contributor

chang commented Dec 5, 2017

Added the final tweaks.

@amueller, I think the 'skewness' vocabulary came from an earlier review. It's more of an empirical observation - the main purpose of Box-Cox is to make data normal and stabilize variance. Skewness does not necessarily imply higher variance, but it does imply nonnormality, so the description still makes sense, IMO.

edit: fixed flake8 error

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 5, 2017

Member

Maybe not in this PR, but a direct comparison against quantile transformer would be nice, right?

Member

amueller commented Dec 5, 2017

Maybe not in this PR, but a direct comparison against quantile transformer would be nice, right?

@amueller

LGTM. Green button on green CI?

@chang

This comment has been minimized.

Show comment
Hide comment
@chang

chang Dec 5, 2017

Contributor

Agreed - comparison with quantile transformer + a linear model example for a future PR. Looks like we're good to go :)

Contributor

chang commented Dec 5, 2017

Agreed - comparison with quantile transformer + a linear model example for a future PR. Looks like we're good to go :)

@jnothman jnothman merged commit 62e9bb8 into scikit-learn:master Dec 5, 2017

6 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.13%)
Details
codecov/project 96.14% (+<.01%) compared to 78ccdd1
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 5, 2017

Member

Congrats @ericchang00 and @maniteja123

Member

jnothman commented Dec 5, 2017

Congrats @ericchang00 and @maniteja123

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 5, 2017

Member

Sweeeet!

Member

amueller commented Dec 5, 2017

Sweeeet!

@chang

This comment has been minimized.

Show comment
Hide comment
@chang

chang Dec 5, 2017

Contributor

Awesome thanks so much guys! This is very exciting :)

Contributor

chang commented Dec 5, 2017

Awesome thanks so much guys! This is very exciting :)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 5, 2017

Member

Btw, @amueller, what's your opinion on having a standardize parameter to centre and scale the output of PowerTransformer? After all #6675 does describe box-cox as reshaping the data into a standard normal.

Member

jnothman commented Dec 5, 2017

Btw, @amueller, what's your opinion on having a standardize parameter to centre and scale the output of PowerTransformer? After all #6675 does describe box-cox as reshaping the data into a standard normal.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 8, 2017

Member

I think it would be good, and we might have it on by default. I don't think it'll surprise anyone, and it'll make things easier. I can't really imagine a case when it would be a bad idea.

Member

amueller commented Dec 8, 2017

I think it would be good, and we might have it on by default. I don't think it'll surprise anyone, and it'll make things easier. I can't really imagine a case when it would be a bad idea.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 9, 2017

Member
Member

jnothman commented Dec 9, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment