{{ message }}

# [MRG] Add Yeo-Johnson transform to PowerTransformer #11520

Merged
merged 44 commits into from Jul 20, 2018
Merged

# [MRG] Add Yeo-Johnson transform to PowerTransformer#11520

merged 44 commits into from Jul 20, 2018

## Conversation

### NicolasHug commented Jul 14, 2018 • edited

Closes #10261

#### What does this implement/fix? Explain your changes.

This PR implements the Yeo-Johnson transform as part of the PowerTransformer class.

PowerTransformer currently only support Box-Cox which only works for positive values, Yeo-Johnson works for the whole real line.

TODO:

• Write transform
• Fix lambda param estimation
• Write inverse transform
• Write docs
• Write tests
• Update examples

The lambda parameter estimation is a bit tricky and currently does not work. (should be OK now, see below). Unlike for Box-Cox there's no scipy built-in that we can rely on. I'm having a hard time finding decent guidelines, tried to implement likelihood maximization with the brent optimizer (just like for Box-Cox) but run into overflow issues.

The transform code seems to work though:

which is a reproduction of

From Quantile regression via vector generalized additive models by Thomas W. Yee.

Code for figure (hacky):

import numpy as np
from sklearn.preprocessing import PowerTransformer
import matplotlib.pyplot as plt

yj = PowerTransformer(method='yeo-johnson', standardize=False)
bc = PowerTransformer(method='box-cox', standardize=False)

X = np.arange(-4, 4, .1).reshape(-1, 1)
fig, axes = plt.subplots(ncols=2)

for lmbda in (0, .5, 1, 1.5, 2):
X_pos = X[X > 0].reshape(-1, 1)
bc.fit(X_pos)
bc.lambdas_ = [lmbda]
X_trans = bc.transform(X_pos)
axes[0].plot(X_pos, X_trans, label=r'$\lambda = {}$'.format(lmbda))
axes[0].set_title('Box-Cox')

yj.fit(X)
yj.lambdas_ = [lmbda]
X_trans = yj.transform(X)
axes[1].plot(X, X_trans, label=r'$\lambda = {}$'.format(lmbda))
axes[1].set_title('Yeo-Johnson')

for ax in axes:
ax.set(xlim=[-4, 4], ylim=[-5, 5], aspect='equal')
ax.legend()
ax.grid()

plt.show()
added 2 commits Jul 14, 2018
 WIP - First draft on Yeo-Johnson transform 
 06891eb 
 Fixed lambda param optimization 
 a88d168 
The issue was from an error in the log likelihood function

### NicolasHug commented Jul 14, 2018

 Lambda param estimation should be fixed now, thanks @amueller. Replication of this example with Yeo-Johnson instead of Box-Cox:
added 2 commits Jul 15, 2018
 Some first tests 
 ee09d7f 
Need to write inverse_transform to continue
 Put helper method for yeo-johnson at the end 
 aea0842 
reviewed
 # get rid of them to compute them. _, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None) col_trans = boxcox(col, lmbda) else: # neo-johnson

yeo?

#### amueller Jul 15, 2018 Member

this took me a while.

reviewed

### amueller left a comment

 We think it's working now, right? So we need a test for the optimization, and then documentation and adding it to an example?
 # when x >= 0 if lmbda < 1e-19: out[pos] = np.log(x[pos] + 1) else: #lmbda != 0

#### amueller Jul 15, 2018 Member

space after #

 n = x.shape[0] # Estimated mean and variance of the normal distribution mu = psi.sum() / n

#### amueller Jul 15, 2018 Member

do we need from __future__ import division?

#### amueller Jul 15, 2018 Member

thanks, was hard to see from the diff and I was lazy ;)

 @@ -2076,7 +2078,7 @@ def test_power_transformer_strictly_positive_exception(): pt.fit, X_with_negatives) assert_raise_message(ValueError, not_positive_message, power_transform, X_with_negatives) power_transform, X_with_negatives, 'box-cox')

#### amueller Jul 15, 2018 Member

why is this needed? The default value shouldn't change, right? Or do we want to start a cycle to change the default to yeo-johnson?

#### NicolasHug Jul 15, 2018 Author Member

I find it clearer and explicit?

I don't know if we'll change the default but it should still be fine as PowerTransform hasn't been released yet AFAIK

#### amueller Jul 15, 2018 Member

good point. We should discuss before the release. I think yeo-johnson would make more sense.

#### ogrisel Jul 15, 2018 Member

The fact that Yeo-Johnson accepts negative values while Box-Cox does not makes me feel like we should use it by default. From a usability point of view, it's nicer to our users.

#### NicolasHug Jul 15, 2018 Author Member

I have the same feeling. Plus, it is designed to be a generalization of Box-Cox, even though that's not strictly the case.

+1

#### NicolasHug Jul 15, 2018 Author Member

shall I change the default then?

think so.

#### NicolasHug Jul 15, 2018 Author Member

Done

added 7 commits Jul 15, 2018
 Added inverse transform + some tests 
 fba12eb 
 Added test for the optimization procedures 
 ed5a411 
 Created _box_cox_optimize method for better code symmetry 
 8bab32e 
 Opt for yeo-johnson not influenced by Nan 
 0525bab 
Also added related test
 Added doc 
 8e187c4 
 Better test for nan in transform() 
 4173df3 
 Updated more docs and example 
 61e2183 
reviewed
 pt = PowerTransformer(method=method, standardize=False) pt.lambdas_ = [lmbda] X_inv = pt.inverse_transform(X) pt.lambdas_ = [9999] # just to make sure

#### ogrisel Jul 15, 2018 Member

Why not:

del pt.lambdas_

#### ogrisel Jul 15, 2018 Member

Alternatively, create a new pt object from scratch to make the motivation of the test easier to read:

ground_truth_transform = PowerTransformer(method=method, standardize=False)
ground_truth_transform.lambdas_ = [lmbda]
X_inv = pt.inverse_transform(X)

estimated_transform = PowerTransformer(method=method, standardize=False)
X_inv_trans = estimated_transform.fit_transform(X_inv) 
 updated test 
 b1ac8d4 
reviewed
 X_inv_trans = pt.fit_transform(X_inv) assert_almost_equal(0, np.linalg.norm(X - X_inv_trans) / n_samples, decimal=2)

#### ogrisel Jul 15, 2018 Member

Please also add an assertion that checks that X_inv_trans.mean(axis=0) is close to [0.] and X_inv_trans.std(axis=0) is close to [1.].

reviewed
 rng = np.random.RandomState(0) n_samples = 1000 X = rng.normal(size=(n_samples, 1))

#### ogrisel Jul 15, 2018 Member

to make the test more explicit you can write: X = rng.normal(loc=0., scale=1., size=(n_samples, 1))

### amueller commented Jul 15, 2018

 If we want this to be the default then this is a blocker, right?
reviewed
 lmbda_no_nans = pt.lambdas_[0] # concat nans at the end and check lambda stays the same X = np.concatenate([X, np.full_like(X, np.nan)])

#### ogrisel Jul 15, 2018 Member

To make sure that the location of the NaNs does not impact the estimation:

from sklearn.utils import shuffle
...

X = np.concatenate([X, np.full_like(X, np.nan)])
X = shuffle(X, random_state=0)

#### NicolasHug Jul 15, 2018 Author Member

Done

reviewed
 @pytest.mark.parametrize("method, lmbda", [('box-cox', .5), ('yeo-johnson', .1)])

#### ogrisel Jul 15, 2018 Member

Could you add more values for lmbda for each method? E.g.:

[
('box-cox', .1),
('box-cox', .5),
('yeo-johnson', .1),
('yeo-johnson', .5),
('yeo-johnson', 1.),
]
added 2 commits Jul 15, 2018
 Modified tests according to reviews 
 489bc70 
 Changed default method from cox-box to yeo-johnson 
 6783e3a 
reviewed
 applied to six different probability distributions: Lognormal, Chi-squared, Weibull, Gaussian, Uniform, and Bimodal. The power transform is useful as a transformation in modeling problems where homoscedasticity and normality are desired. Below are examples of Box-Cox and

#### ogrisel Jul 15, 2018 • edited Member

I don't understand what "modeling problems where homoscedasticity is desired" mean in this context: to me heteroscedasticity is a property of the noise of the output variable that is not the same for different regions of the input space a conditional model.

It does not seem trivial how power transform can improve homoscedasticity.

#### ogrisel Jul 15, 2018 Member

Actually, this statement seems to be correct:

http://article.sapub.org/10.5923.j.ajms.20180801.02.html

It might be interesting to try to come up with a good example to show this corrective effect in a (maybe synthetic) linear regression problem. However, this is probably outside of the scope of the current PR.

added the label Jul 15, 2018
added this to the 0.20 milestone Jul 15, 2018

### amueller commented Jul 15, 2018

 tagged for 0.20 and added blocker label. I don't like that we keep adding stuff but if we want to make it default we should do it now.
approved these changes

### glemaitre left a comment

 Couple of opened comments. If I am not wrong we should have something in the common estimator_checks which force the input to be positive to work with box-cox. We probably want to change this behavior with we change the default.
 The power transform method. Currently, 'box-cox' (Box-Cox transform) is the only option available. method : str, (default='yeo-johnson') The power transform method. Available methods are 'box-cox' and

#### glemaitre Jul 16, 2018 Contributor

We can maybe have a bullet point list for each method referring to the reference section.

 @@ -2490,12 +2494,18 @@ def fit(self, X, y=None): self.lambdas_ = [] transformed = [] opt_fun = {'box-cox': self._box_cox_optimize,

#### glemaitre Jul 16, 2018 Contributor

I would have expect func instead of fun :)

#### glemaitre Jul 16, 2018 Contributor

optim_function

 opt_fun = {'box-cox': self._box_cox_optimize, 'yeo-johnson': self._yeo_johnson_optimize }[self.method] trans_fun = {'box-cox': boxcox,

#### glemaitre Jul 16, 2018 Contributor

probably transform_function is not so long to be called

 return x_inv def _yeo_johnson_transform(self, x, lmbda):

#### glemaitre Jul 16, 2018 Contributor

we cannot just define the forward transform and take 1 / _yeo_johnson_transform for the inverse?

#### NicolasHug Jul 16, 2018 Author Member

The inverse here means f^{-1}, not 1 / f

 """Return the negative log likelihood of the observed data x as a function of lambda.""" psi = self._yeo_johnson_transform(x, lmbda) n = x.shape[0]

#### glemaitre Jul 16, 2018 Contributor

 """Return the negative log likelihood of the observed data x as a function of lambda.""" psi = self._yeo_johnson_transform(x, lmbda) n = x.shape[0]

#### glemaitre Jul 16, 2018 Contributor

Uhm missing x most probably

#### glemaitre Jul 16, 2018 Contributor

Oh I see, can we pass x as an argument as well as in the optimize function?

#### NicolasHug Jul 16, 2018 Author Member

yes this is a nested function so x is implicitely passed anyway

 # Estimated mean and variance of the normal distribution mu = psi.sum() / n sig_sq = np.power(psi - mu, 2).sum() / n

#### glemaitre Jul 16, 2018 Contributor

Stupid question: is sig_sq the variance? If this is the case, you might want to call it var

#### NicolasHug Jul 16, 2018 Author Member

I was following the paper's notation. Should I use mean (or mean_) also then?

added this to PRs tagged in scikit-learn 0.20 Jul 16, 2018

### NicolasHug commented Jul 16, 2018 • edited

 I made a quick example to illustrate the use of Yeo-Johnson vs. Box-Cox + offset. As Box-Cox only accepts positive data, one solution is to shift the data by a fixed offset value (typically min(data) + eps): One thing we see is that the "after offset and Box-Cox" isn't as symmetric as the eo-Johnson and most importantly the values are much higher. Is it worth adding this as an example @amueller? TBH I wouldn't be able to mathematically or intuitively explain those results.
approved these changes

### TomDLT left a comment

 Thanks for the added tests. We might want to add them as common tests at some point, but it might be for another pull-request.
 self._scaler = StandardScaler() if force_compute_transform: transformed = self._scaler.fit_transform(transformed) self._scaler = StandardScaler(copy=self.copy)

#### TomDLT Jul 17, 2018 Member

actually you should be able to use copy=False here, since a copy has already been done just before.

added 2 commits Jul 17, 2018
 set copy to False for the scaler 
 597a85d 
 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
 593c818 
…into yeojohnson
requested changes

### glemaitre left a comment

 Couple of changes
 """Return inverse-transformed input x following Yeo-Johnson inverse transform with parameter lambda. Note

#### glemaitre Jul 17, 2018 Contributor

Notes

 """Return transformed input x following Yeo-Johnson transform with parameter lambda. Note

#### glemaitre Jul 17, 2018 Contributor

Notes

 @@ -2566,7 +2720,8 @@ def _check_input(self, X, check_positive=False, check_shape=False, X : array-like, shape (n_samples, n_features) check_positive : bool If True, check that all data is positive and non-zero. If True, check that all data is positive and non-zero (only if self.method is box-cox).

#### glemaitre Jul 17, 2018 Contributor

only if self.method=='box-cox'

 Addressed comments from glemaitre 
 8022cc3 

### glemaitre commented Jul 17, 2018

 I am waiting to check the example in the documentation
 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
 0c120fb 
…into yeojohnson

### NicolasHug commented Jul 18, 2018 • edited

 @glemaitre @ogrisel, I think the plot looks pretty OK now.

### jnothman commented Jul 19, 2018

 I'm finding the plots in plot_map_data_to_normal relatively hard to navigate intuitively. It's not a blocker, but I think it needs to look more tabular: at the moment it takes some effort to see that each row is a different transformation; a label on the left of the row would be more helpful. Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc.
reviewed

### jnothman left a comment

 Can I clarify why plot_all_scaling still only shows box-cox?

### NicolasHug commented Jul 19, 2018

 Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc. Personally I find it easier to compare the transformations when they're stacked on each other, especially since the axes limits are uniform across the plots. I don't have anything against having the transformation names on the left. It would also make sense to me to have one dataset per column (limiting the plot to 4 rows instead of 8), but that would make the plot wider which can be annoying on mobile. Thanks for mentioning plot_all_scaling, I missed that one.
 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
 8234a3e 
…into yeojohnson

### NicolasHug commented Jul 19, 2018 • edited

 Looks like 14e7c32 broke plot_all_scaling on master: Traceback (most recent call last): File "examples/preprocessing/plot_all_scaling.py", line 71, in dataset = fetch_california_housing() File "/home/nico/dev/sklearn/sklearn/datasets/california_housing.py", line 128, in fetch_california_housing cal_housing = joblib.load(filepath) File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 578, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 508, in _unpickle obj = unpickler.load() File "/usr/lib64/python3.6/pickle.py", line 1050, in load dispatch[key[0]](self) File "/usr/lib64/python3.6/pickle.py", line 1338, in load_global klass = self.find_class(module, name) File "/usr/lib64/python3.6/pickle.py", line 1388, in find_class __import__(module, level=0) ModuleNotFoundError: No module named 'sklearn.externals._joblib.numpy_pickle'  Should I open an issue for this? I'm not sure if this comes from my env (I created a new one from scratch, still same). Doesn't the CI check that all the examples are passing?

### amueller commented Jul 20, 2018

 @NicolasHug it's now "fixed" but you need to remove your scikit_learn_data folder in your home folder.
added 2 commits Jul 20, 2018
 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
 7d529df 
…into yeojohnson
 Updated plot_all_scaling.py example 
 c0a01df 

### NicolasHug commented Jul 20, 2018

 Thanks, just updated plot_all_scaling

### ogrisel commented Jul 20, 2018 • edited

 The matplotlib rendering of the 2 examples is good enough for now: https://29575-843222-gh.circle-artifacts.com/0/doc/auto_examples/index.html#preprocessing Merging. Thanks @NicolasHug for this nice contribution!
merged commit 2d232ac into scikit-learn:master Jul 20, 2018
7 checks passed
7 checks passed
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 98.74% of diff hit (target 95.3%)
Details
codecov/project 95.3% (+<.01%) compared to 1984dac
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
scikit-learn 0.20 automation moved this from Blockers to Done Jul 20, 2018

 yay!

### ogrisel commented Jul 20, 2018 • edited

 I agree with @jnothman (#11520 (comment)) that using a layout similar to the cluster comparison plot would improve the readability even further but I don't want to delay the release for this.
deleted the NicolasHug:yeojohnson branch Jul 20, 2018

### GaelVaroquaux commented Jul 20, 2018

 Yey!!
mentioned this pull request Aug 7, 2018
This was referenced Sep 21, 2018
mentioned this pull request Oct 7, 2018
mentioned this pull request Nov 5, 2018