[MRG] Add Yeo-Johnson transform to PowerTransformer #11520
Conversation
The issue was from an error in the log likelihood function
Lambda param estimation should be fixed now, thanks @amueller. Replication of this example with Yeo-Johnson instead of Box-Cox: |
Need to write inverse_transform to continue
# get rid of them to compute them. | ||
_, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None) | ||
col_trans = boxcox(col, lmbda) | ||
else: # neo-johnson |
amueller
Jul 15, 2018
Member
yeo?
yeo?
ogrisel
Jul 15, 2018
•
Member
Follow the white rabbit.
Follow the white rabbit.
amueller
Jul 15, 2018
Member
this took me a while.
this took me a while.
We think it's working now, right? So we need a test for the optimization, and then documentation and adding it to an example? |
# when x >= 0 | ||
if lmbda < 1e-19: | ||
out[pos] = np.log(x[pos] + 1) | ||
else: #lmbda != 0 |
amueller
Jul 15, 2018
Member
space after #
space after #
n = x.shape[0] | ||
|
||
# Estimated mean and variance of the normal distribution | ||
mu = psi.sum() / n |
amueller
Jul 15, 2018
Member
do we need from __future__ import division
?
do we need from __future__ import division
?
NicolasHug
Jul 15, 2018
Author
Member
it's here already
it's here already
amueller
Jul 15, 2018
Member
thanks, was hard to see from the diff and I was lazy ;)
thanks, was hard to see from the diff and I was lazy ;)
@@ -2076,7 +2078,7 @@ def test_power_transformer_strictly_positive_exception(): | |||
pt.fit, X_with_negatives) | |||
|
|||
assert_raise_message(ValueError, not_positive_message, | |||
power_transform, X_with_negatives) | |||
power_transform, X_with_negatives, 'box-cox') |
amueller
Jul 15, 2018
Member
why is this needed? The default value shouldn't change, right? Or do we want to start a cycle to change the default to yeo-johnson?
why is this needed? The default value shouldn't change, right? Or do we want to start a cycle to change the default to yeo-johnson?
NicolasHug
Jul 15, 2018
Author
Member
I find it clearer and explicit?
I don't know if we'll change the default but it should still be fine as PowerTransform hasn't been released yet AFAIK
I find it clearer and explicit?
I don't know if we'll change the default but it should still be fine as PowerTransform hasn't been released yet AFAIK
amueller
Jul 15, 2018
Member
good point. We should discuss before the release. I think yeo-johnson would make more sense.
good point. We should discuss before the release. I think yeo-johnson would make more sense.
ogrisel
Jul 15, 2018
Member
The fact that Yeo-Johnson accepts negative values while Box-Cox does not makes me feel like we should use it by default. From a usability point of view, it's nicer to our users.
The fact that Yeo-Johnson accepts negative values while Box-Cox does not makes me feel like we should use it by default. From a usability point of view, it's nicer to our users.
NicolasHug
Jul 15, 2018
Author
Member
I have the same feeling. Plus, it is designed to be a generalization of Box-Cox, even though that's not strictly the case.
I have the same feeling. Plus, it is designed to be a generalization of Box-Cox, even though that's not strictly the case.
amueller
Jul 15, 2018
Member
+1
+1
NicolasHug
Jul 15, 2018
Author
Member
shall I change the default then?
shall I change the default then?
amueller
Jul 15, 2018
Member
think so.
think so.
NicolasHug
Jul 15, 2018
Author
Member
Done
Done
Also added related test
pt = PowerTransformer(method=method, standardize=False) | ||
pt.lambdas_ = [lmbda] | ||
X_inv = pt.inverse_transform(X) | ||
pt.lambdas_ = [9999] # just to make sure |
ogrisel
Jul 15, 2018
Member
Why not:
del pt.lambdas_
Why not:
del pt.lambdas_
ogrisel
Jul 15, 2018
Member
Alternatively, create a new pt
object from scratch to make the motivation of the test easier to read:
ground_truth_transform = PowerTransformer(method=method, standardize=False)
ground_truth_transform.lambdas_ = [lmbda]
X_inv = pt.inverse_transform(X)
estimated_transform = PowerTransformer(method=method, standardize=False)
X_inv_trans = estimated_transform.fit_transform(X_inv)
Alternatively, create a new pt
object from scratch to make the motivation of the test easier to read:
ground_truth_transform = PowerTransformer(method=method, standardize=False)
ground_truth_transform.lambdas_ = [lmbda]
X_inv = pt.inverse_transform(X)
estimated_transform = PowerTransformer(method=method, standardize=False)
X_inv_trans = estimated_transform.fit_transform(X_inv)
X_inv_trans = pt.fit_transform(X_inv) | ||
|
||
assert_almost_equal(0, np.linalg.norm(X - X_inv_trans) / n_samples, | ||
decimal=2) |
ogrisel
Jul 15, 2018
Member
Please also add an assertion that checks that X_inv_trans.mean(axis=0)
is close to [0.]
and X_inv_trans.std(axis=0)
is close to [1.]
.
Please also add an assertion that checks that X_inv_trans.mean(axis=0)
is close to [0.]
and X_inv_trans.std(axis=0)
is close to [1.]
.
|
||
rng = np.random.RandomState(0) | ||
n_samples = 1000 | ||
X = rng.normal(size=(n_samples, 1)) |
ogrisel
Jul 15, 2018
Member
to make the test more explicit you can write: X = rng.normal(loc=0., scale=1., size=(n_samples, 1))
to make the test more explicit you can write: X = rng.normal(loc=0., scale=1., size=(n_samples, 1))
If we want this to be the default then this is a blocker, right? |
lmbda_no_nans = pt.lambdas_[0] | ||
|
||
# concat nans at the end and check lambda stays the same | ||
X = np.concatenate([X, np.full_like(X, np.nan)]) |
ogrisel
Jul 15, 2018
Member
To make sure that the location of the NaNs does not impact the estimation:
from sklearn.utils import shuffle
...
X = np.concatenate([X, np.full_like(X, np.nan)])
X = shuffle(X, random_state=0)
To make sure that the location of the NaNs does not impact the estimation:
from sklearn.utils import shuffle
...
X = np.concatenate([X, np.full_like(X, np.nan)])
X = shuffle(X, random_state=0)
NicolasHug
Jul 15, 2018
Author
Member
Done
Done
|
||
|
||
@pytest.mark.parametrize("method, lmbda", [('box-cox', .5), | ||
('yeo-johnson', .1)]) |
ogrisel
Jul 15, 2018
Member
Could you add more values for lmbda
for each method? E.g.:
[
('box-cox', .1),
('box-cox', .5),
('yeo-johnson', .1),
('yeo-johnson', .5),
('yeo-johnson', 1.),
]
Could you add more values for lmbda
for each method? E.g.:
[
('box-cox', .1),
('box-cox', .5),
('yeo-johnson', .1),
('yeo-johnson', .5),
('yeo-johnson', 1.),
]
applied to six different probability distributions: Lognormal, Chi-squared, | ||
Weibull, Gaussian, Uniform, and Bimodal. | ||
The power transform is useful as a transformation in modeling problems where | ||
homoscedasticity and normality are desired. Below are examples of Box-Cox and |
ogrisel
Jul 15, 2018
•
Member
I don't understand what "modeling problems where homoscedasticity is desired" mean in this context: to me heteroscedasticity is a property of the noise of the output variable that is not the same for different regions of the input space a conditional model.
It does not seem trivial how power transform can improve homoscedasticity.
I don't understand what "modeling problems where homoscedasticity is desired" mean in this context: to me heteroscedasticity is a property of the noise of the output variable that is not the same for different regions of the input space a conditional model.
It does not seem trivial how power transform can improve homoscedasticity.
ogrisel
Jul 15, 2018
Member
Actually, this statement seems to be correct:
http://article.sapub.org/10.5923.j.ajms.20180801.02.html
It might be interesting to try to come up with a good example to show this corrective effect in a (maybe synthetic) linear regression problem. However, this is probably outside of the scope of the current PR.
Actually, this statement seems to be correct:
http://article.sapub.org/10.5923.j.ajms.20180801.02.html
It might be interesting to try to come up with a good example to show this corrective effect in a (maybe synthetic) linear regression problem. However, this is probably outside of the scope of the current PR.
tagged for 0.20 and added blocker label. I don't like that we keep adding stuff but if we want to make it default we should do it now. |
Couple of opened comments. |
The power transform method. Currently, 'box-cox' (Box-Cox transform) | ||
is the only option available. | ||
method : str, (default='yeo-johnson') | ||
The power transform method. Available methods are 'box-cox' and |
glemaitre
Jul 16, 2018
Contributor
We can maybe have a bullet point list for each method referring to the reference section.
We can maybe have a bullet point list for each method referring to the reference section.
@@ -2490,12 +2494,18 @@ def fit(self, X, y=None): | |||
self.lambdas_ = [] | |||
transformed = [] | |||
|
|||
opt_fun = {'box-cox': self._box_cox_optimize, |
glemaitre
Jul 16, 2018
Contributor
I would have expect func
instead of fun
:)
I would have expect func
instead of fun
:)
glemaitre
Jul 16, 2018
Contributor
optim_function
optim_function
opt_fun = {'box-cox': self._box_cox_optimize, | ||
'yeo-johnson': self._yeo_johnson_optimize | ||
}[self.method] | ||
trans_fun = {'box-cox': boxcox, |
glemaitre
Jul 16, 2018
Contributor
probably transform_function
is not so long to be called
probably transform_function
is not so long to be called
|
||
return x_inv | ||
|
||
def _yeo_johnson_transform(self, x, lmbda): |
glemaitre
Jul 16, 2018
Contributor
we cannot just define the forward transform and take 1 / _yeo_johnson_transform
for the inverse?
we cannot just define the forward transform and take 1 / _yeo_johnson_transform
for the inverse?
NicolasHug
Jul 16, 2018
Author
Member
The inverse here means f^{-1}, not 1 / f
The inverse here means f^{-1}, not 1 / f
"""Return the negative log likelihood of the observed data x as a | ||
function of lambda.""" | ||
psi = self._yeo_johnson_transform(x, lmbda) | ||
n = x.shape[0] |
glemaitre
Jul 16, 2018
Contributor
n_samples instead
n_samples instead
"""Return the negative log likelihood of the observed data x as a | ||
function of lambda.""" | ||
psi = self._yeo_johnson_transform(x, lmbda) | ||
n = x.shape[0] |
glemaitre
Jul 16, 2018
Contributor
Uhm missing x most probably
Uhm missing x most probably
glemaitre
Jul 16, 2018
Contributor
Oh I see, can we pass x as an argument as well as in the optimize function?
Oh I see, can we pass x as an argument as well as in the optimize function?
NicolasHug
Jul 16, 2018
Author
Member
yes this is a nested function so x is implicitely passed anyway
yes this is a nested function so x is implicitely passed anyway
|
||
# Estimated mean and variance of the normal distribution | ||
mu = psi.sum() / n | ||
sig_sq = np.power(psi - mu, 2).sum() / n |
glemaitre
Jul 16, 2018
Contributor
Stupid question: is sig_sq
the variance? If this is the case, you might want to call it var
Stupid question: is sig_sq
the variance? If this is the case, you might want to call it var
NicolasHug
Jul 16, 2018
Author
Member
I was following the paper's notation. Should I use mean (or mean_) also then?
I was following the paper's notation. Should I use mean (or mean_) also then?
I made a quick example to illustrate the use of Yeo-Johnson vs. Box-Cox + offset. As Box-Cox only accepts positive data, one solution is to shift the data by a fixed offset value (typically min(data) + eps): One thing we see is that the "after offset and Box-Cox" isn't as symmetric as the eo-Johnson and most importantly the values are much higher. Is it worth adding this as an example @amueller? TBH I wouldn't be able to mathematically or intuitively explain those results. |
Thanks for the added tests. |
self._scaler = StandardScaler() | ||
if force_compute_transform: | ||
transformed = self._scaler.fit_transform(transformed) | ||
self._scaler = StandardScaler(copy=self.copy) |
TomDLT
Jul 17, 2018
Member
actually you should be able to use copy=False
here, since a copy has already been done just before.
actually you should be able to use copy=False
here, since a copy has already been done just before.
Couple of changes |
"""Return inverse-transformed input x following Yeo-Johnson inverse | ||
transform with parameter lambda. | ||
Note |
glemaitre
Jul 17, 2018
Contributor
Notes
Notes
"""Return transformed input x following Yeo-Johnson transform with | ||
parameter lambda. | ||
Note |
glemaitre
Jul 17, 2018
Contributor
Notes
Notes
@@ -2566,7 +2720,8 @@ def _check_input(self, X, check_positive=False, check_shape=False, | |||
X : array-like, shape (n_samples, n_features) | |||
check_positive : bool | |||
If True, check that all data is positive and non-zero. | |||
If True, check that all data is positive and non-zero (only if | |||
self.method is box-cox). |
glemaitre
Jul 17, 2018
Contributor
only if self.method=='box-cox'
only if self.method=='box-cox'
I am waiting to check the example in the documentation |
…into yeojohnson
@glemaitre @ogrisel, I think the plot looks pretty OK now. |
I'm finding the plots in Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc. |
Can I clarify why |
Personally I find it easier to compare the transformations when they're stacked on each other, especially since the axes limits are uniform across the plots. I don't have anything against having the transformation names on the left. It would also make sense to me to have one dataset per column (limiting the plot to 4 rows instead of 8), but that would make the plot wider which can be annoying on mobile. Thanks for mentioning |
…into yeojohnson
Looks like 14e7c32 broke
Should I open an issue for this? I'm not sure if this comes from my env (I created a new one from scratch, still same). Doesn't the CI check that all the examples are passing? |
@NicolasHug it's now "fixed" but you need to remove your |
Thanks, just updated |
The matplotlib rendering of the 2 examples is good enough for now: https://29575-843222-gh.circle-artifacts.com/0/doc/auto_examples/index.html#preprocessing Merging. Thanks @NicolasHug for this nice contribution! |
yay! |
I agree with @jnothman (#11520 (comment)) that using a layout similar to the cluster comparison plot would improve the readability even further but I don't want to delay the release for this. |
Yey!!
|
Reference Issues/PRs
Closes #10261
What does this implement/fix? Explain your changes.
This PR implements the Yeo-Johnson transform as part of the PowerTransformer class.
PowerTransformer currently only support Box-Cox which only works for positive values, Yeo-Johnson works for the whole real line.
Original paper : link.
TODO:
Any other comments?
The lambda parameter estimation is a bit tricky
and currently does not work.(should be OK now, see below). Unlike for Box-Cox there's no scipy built-in that we can rely on. I'm having a hard time finding decent guidelines, tried to implement likelihood maximization with the brent optimizer (just like for Box-Cox) but run into overflow issues.The transform code seems to work though:
which is a reproduction of
From Quantile regression via vector generalized additive models by Thomas W. Yee.
Code for figure (hacky):