New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Feature: Implement PowerTransformer #10210
[MRG+1] Feature: Implement PowerTransformer #10210
Conversation
…_ to store Box-Cox parameters
… when Box-Cox is being tested. Fix docstring test failure.
Thanks for the doc fix @glemaitre. Good suggestion on normalizing the distributions, Joel - I used |
There are a couple of other small things @glemaitre requested that are
unaddressed as far as I can tell.
…On 5 December 2017 at 18:32, Eric Chang ***@***.***> wrote:
Thanks for the doc fix @glemaitre <https://github.com/glemaitre>. Good
suggestion on normalizing the distributions, Joel - I used minmax_scale(X,
feature_range=(1e-10, 10)).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10210 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67ZpQ-N0-cJkfbJX9_0gYHl7eo7Nks5s9PGVgaJpZM4QrWr2>
.
|
…s in plot example
3b0d703
to
234d69d
Compare
Fixed the issues - thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My last nitpicks. @jnothman I am fine to merge.
sklearn/preprocessing/data.py
Outdated
power_transform : Equivalent function without the estimator API. | ||
|
||
QuantileTransformer : Maps data to a standard normal distribution with | ||
the parameter output_distribution='normal'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_distribution='normal'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting backticks? That's not obvious from the rendering ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups ... thanks to point this out :)
sklearn/preprocessing/data.py
Outdated
API (as part of a preprocessing :class:`sklearn.pipeline.Pipeline`). | ||
|
||
quantile_transform : Maps data to a standard normal distribution with | ||
the parameter output_distribution='normal'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_distribution='normal'
Great work @ericchang00! |
'font.size': 6, | ||
'hist.bins': 150 | ||
} | ||
matplotlib.rcParams.update(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a good idea? Depending on how careful sphinxgallery
is with global state, I feel this could go wrong? Or for people copy and pasting the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the global parameter setting
|
||
params = { | ||
'font.size': 6, | ||
'hist.bins': 150 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe slightly less bins would make it more clear?
doc/modules/preprocessing.rst
Outdated
In many modeling scenarios, normality of the features in a dataset is desirable. | ||
Power transforms are a family of parametric, monotonic transformations that aim | ||
to map data from any distribution to as close to a Gaussian distribution as | ||
possible in order to minimize skewness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all power transformation aim to minimize skewness? (I actually don't know)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - it might be clearer as 'minimize skewness and stabilize variance'.
sklearn/preprocessing/data.py
Outdated
that are applied to make data more Gaussian-like. This is useful for | ||
modeling issues related to heteroscedasticity (non-constant variance), | ||
or other situations where normality is desired. Note that power | ||
transforms do not result in standard normal distributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. mean might be far from zero and standard deviation not one?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant maybe say that explicitly ;)
I'm still confused as to how maximum likelihood relates to skewdness. The wikipedia article of box-cox doesn't mention skew.... Is it just that empirically it decreases skew or is there some more formal statement? |
Added the final tweaks. @amueller, I think the 'skewness' vocabulary came from an earlier review. It's more of an empirical observation - the main purpose of Box-Cox is to make data normal and stabilize variance. Skewness does not necessarily imply higher variance, but it does imply nonnormality, so the description still makes sense, IMO. edit: fixed flake8 error |
e67e5f6
to
6653100
Compare
2b55109
to
c303ae6
Compare
Maybe not in this PR, but a direct comparison against quantile transformer would be nice, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Green button on green CI?
Agreed - comparison with quantile transformer + a linear model example for a future PR. Looks like we're good to go :) |
Congrats @ericchang00 and @maniteja123 |
Sweeeet! |
Awesome thanks so much guys! This is very exciting :) |
I think it would be good, and we might have it on by default. I don't think it'll surprise anyone, and it'll make things easier. I can't really imagine a case when it would be a bad idea. |
so let us try to get a default standardisation in before next release
|
Reference Issues/PRs
Fixes #6675
Fixes #6781
What does this implement/fix? Explain your changes.
This PR implements
sklearn.preprocessing.PowerTransformer
. Power transforms are a family of monotonic, parametric transformations used to transform skewed distributions to as close to Gaussian as possible. This could be useful for models that require homoschedasticity, or any other situations where normality is desirable.At the moment, only the Box-Cox transform is supported, which requires strictly positive data. The optimal parameters for stabilizing variance and minimizing skewness are determined using maximum likelihood, and the transformation is applied to the dataset feature-wise.
Any other comments?
We will consider implementing the Yeo-Johnson transform - a power transformation that can be applied to negative data - in a future PR.
Thanks to @maniteja123 for kicking it off!