# [MRG] Add Yeo-Johnson transform to PowerTransformer #11520

Merged
merged 44 commits into from Jul 20, 2018

## Conversation

Projects
7 participants
Contributor

### NicolasHug commented Jul 14, 2018 • edited

Closes #10261

#### What does this implement/fix? Explain your changes.

This PR implements the Yeo-Johnson transform as part of the PowerTransformer class.

PowerTransformer currently only support Box-Cox which only works for positive values, Yeo-Johnson works for the whole real line.

TODO:

• Write transform
• Fix lambda param estimation
• Write inverse transform
• Write docs
• Write tests
• Update examples

The lambda parameter estimation is a bit tricky and currently does not work. (should be OK now, see below). Unlike for Box-Cox there's no scipy built-in that we can rely on. I'm having a hard time finding decent guidelines, tried to implement likelihood maximization with the brent optimizer (just like for Box-Cox) but run into overflow issues.

The transform code seems to work though:

which is a reproduction of

From Quantile regression via vector generalized additive models by Thomas W. Yee.

Code for figure (hacky):

import numpy as np
from sklearn.preprocessing import PowerTransformer
import matplotlib.pyplot as plt

yj = PowerTransformer(method='yeo-johnson', standardize=False)
bc = PowerTransformer(method='box-cox', standardize=False)

X = np.arange(-4, 4, .1).reshape(-1, 1)
fig, axes = plt.subplots(ncols=2)

for lmbda in (0, .5, 1, 1.5, 2):
X_pos = X[X > 0].reshape(-1, 1)
bc.fit(X_pos)
bc.lambdas_ = [lmbda]
X_trans = bc.transform(X_pos)
axes[0].plot(X_pos, X_trans, label=r'$\lambda = {}$'.format(lmbda))
axes[0].set_title('Box-Cox')

yj.fit(X)
yj.lambdas_ = [lmbda]
X_trans = yj.transform(X)
axes[1].plot(X, X_trans, label=r'$\lambda = {}$'.format(lmbda))
axes[1].set_title('Yeo-Johnson')

for ax in axes:
ax.set(xlim=[-4, 4], ylim=[-5, 5], aspect='equal')
ax.legend()
ax.grid()

plt.show()

### NicolasHug added some commits Jul 14, 2018

 WIP - First draft on Yeo-Johnson transform 
 06891eb 
 Fixed lambda param optimization 
The issue was from an error in the log likelihood function
 a88d168 
Contributor

### NicolasHug commented Jul 14, 2018

 Lambda param estimation should be fixed now, thanks @amueller. Replication of this example with Yeo-Johnson instead of Box-Cox:

### NicolasHug added some commits Jul 15, 2018

 Some first tests 
Need to write inverse_transform to continue
 ee09d7f 
 Put helper method for yeo-johnson at the end 
 aea0842 

### amueller reviewed Jul 15, 2018

sklearn/preprocessing/data.py

### amueller reviewed Jul 15, 2018

We think it's working now, right? So we need a test for the optimization, and then documentation and adding it to an example?

sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
 @@ -2076,7 +2078,7 @@ def test_power_transformer_strictly_positive_exception(): pt.fit, X_with_negatives) assert_raise_message(ValueError, not_positive_message, power_transform, X_with_negatives) power_transform, X_with_negatives, 'box-cox')

#### amueller Jul 15, 2018

Member

why is this needed? The default value shouldn't change, right? Or do we want to start a cycle to change the default to yeo-johnson?

#### NicolasHug Jul 15, 2018

Contributor

I find it clearer and explicit?

I don't know if we'll change the default but it should still be fine as PowerTransform hasn't been released yet AFAIK

#### amueller Jul 15, 2018

Member

good point. We should discuss before the release. I think yeo-johnson would make more sense.

#### ogrisel Jul 15, 2018

Member

The fact that Yeo-Johnson accepts negative values while Box-Cox does not makes me feel like we should use it by default. From a usability point of view, it's nicer to our users.

#### NicolasHug Jul 15, 2018

Contributor

I have the same feeling. Plus, it is designed to be a generalization of Box-Cox, even though that's not strictly the case.

Member

+1

#### NicolasHug Jul 15, 2018

Contributor

shall I change the default then?

Member

think so.

Contributor

Done

### NicolasHug added some commits Jul 15, 2018

 Added inverse transform + some tests 
 fba12eb 
 Added test for the optimization procedures 
 ed5a411 
 Created _box_cox_optimize method for better code symmetry 
 8bab32e 
 Opt for yeo-johnson not influenced by Nan 
Also added related test
 0525bab 
 Added doc 
 8e187c4 
 Better test for nan in transform() 
 4173df3 
 Updated more docs and example 
 61e2183 

### ogrisel reviewed Jul 15, 2018

sklearn/preprocessing/tests/test_data.py
 updated test 
 b1ac8d4 

### ogrisel reviewed Jul 15, 2018

sklearn/preprocessing/tests/test_data.py

### ogrisel reviewed Jul 15, 2018

sklearn/preprocessing/tests/test_data.py
Member

### amueller commented Jul 15, 2018

 If we want this to be the default then this is a blocker, right?

### ogrisel reviewed Jul 15, 2018

 lmbda_no_nans = pt.lambdas_[0] # concat nans at the end and check lambda stays the same X = np.concatenate([X, np.full_like(X, np.nan)])

#### ogrisel Jul 15, 2018

Member

To make sure that the location of the NaNs does not impact the estimation:

from sklearn.utils import shuffle
...

X = np.concatenate([X, np.full_like(X, np.nan)])
X = shuffle(X, random_state=0)

Contributor

Done

### ogrisel reviewed Jul 15, 2018

sklearn/preprocessing/tests/test_data.py

### NicolasHug added some commits Jul 15, 2018

 Modified tests according to reviews 
 489bc70 
 Changed default method from cox-box to yeo-johnson 
 6783e3a 

Member

### amueller commented Jul 15, 2018

 tagged for 0.20 and added blocker label. I don't like that we keep adding stuff but if we want to make it default we should do it now.

### glemaitre approved these changes Jul 16, 2018

If I am not wrong we should have something in the common estimator_checks which force the input to be positive to work with box-cox. We probably want to change this behavior with we change the default.

sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
 return x_inv def _yeo_johnson_transform(self, x, lmbda):

#### glemaitre Jul 16, 2018

Contributor

we cannot just define the forward transform and take 1 / _yeo_johnson_transform for the inverse?

#### NicolasHug Jul 16, 2018

Contributor

The inverse here means f^{-1}, not 1 / f

sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py

Contributor

### NicolasHug commented Jul 16, 2018 • edited

 I made a quick example to illustrate the use of Yeo-Johnson vs. Box-Cox + offset. As Box-Cox only accepts positive data, one solution is to shift the data by a fixed offset value (typically min(data) + eps): One thing we see is that the "after offset and Box-Cox" isn't as symmetric as the eo-Johnson and most importantly the values are much higher. Is it worth adding this as an example @amueller? TBH I wouldn't be able to mathematically or intuitively explain those results.
Member

### GaelVaroquaux commented Jul 16, 2018

 +1 on comments by @glemaitre . Also, the tests are failing.

### NicolasHug added some commits Jul 16, 2018

 Addressed most comments from @glemaitre, fixed flake8 
 dfd1ecc 
 Removed box-cox specific checks in estimator_checks 
The default is now Yeo-Johnson which can handle negative data
 78169f6 
 More explicit variable names for mean and variance 
 f48a17b 

### glemaitre requested changes Jul 16, 2018

doc/modules/preprocessing.rst
examples/preprocessing/plot_power_transformer.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/tests/test_data.py
sklearn/preprocessing/tests/test_data.py
sklearn/preprocessing/tests/test_data.py

### NicolasHug added some commits Jul 16, 2018

 Addressed comments from glemaitre 
 67eaa98 
 Changed number of bins in plots to auto 
 7a2bce7 
 Fixed Nan issues (ignored warnings) 
Also removed other checks about positive data now that Yeo-Johnson is the
default
 2b56a9d 

### NicolasHug added some commits Jul 17, 2018

 Should fix test for Python 2.7 
 a0d97ee 
 Should fix example plot 
 23c3ddd 
 Addressed comments from TomDLT 
 0c3b268 
Contributor

### NicolasHug commented Jul 17, 2018 • edited

 Suggestion: When calling fit and transform on the same data, the transform is done twice. We could implement a fit_transform method to do it only once. Good idea, I tried to implement it but I don't think it's possible: Currently when calling fit_transform, the pipeline goes like this: X -> transform -> scaler_fit -> scaler_transform -> transform So transform is indeed called twice, but not on the same data. I don't think avoiding one of them is possible, but I may be missing something? In any case I changed fit so that transform is only called when needed, that is when standardize is True.
Member

### ogrisel commented Jul 17, 2018 • edited

 It's fine to call just transfom on the test. The goal here is to call fit_transform on the training set instead of fit and then transform on the training set as there is a redundant computation that happens when subsequently calling fit and then transform on the same data.
Member

### TomDLT commented Jul 17, 2018

 What I have in mind is: def fit(self, X): _ = self._fit_transform(X, transform=False) return self def fit_transform(self, X): Xt = self._fit_transform(X, transform=True) return Xt def transform(self, X): check_is_fitted(self, 'lambdas_') X = self._check_input(X, check_positive=True, check_shape=True, copy=self.copy) Xt = self._transform_inplace(X) return Xt def _fit_transform(self, X, transform=True): X = self._check_input(X, check_positive=True, check_method=True, copy=self.copy or not transform) optim_function = ... self.lambdas_ = [] for col in X.T: with np.errstate(invalid='ignore'): # hide NaN warnings lmbda = optim_function(col) self.lambdas_.append(lmbda) self.lambdas_ = np.array(self.lambdas_) Xt = None if self.standardize or transform: Xt = self._transform_inplace(X) if self.standardize: self._scaler = StandardScaler() self._scaler.fit(Xt) if transform: Xt = self._scaler.transform(Xt) return Xt def _transform_inplace(self, X): transform_function = ... for i, lmbda in enumerate(self.lambdas_): with np.errstate(invalid='ignore'): # hide NaN warnings X[:, i] = transform_function(X[:, i], lmbda) return X Therefore we reuse the transformed data computed for the standardization.
Member

### TomDLT commented Jul 17, 2018

 In any case, we probably need a test to verify that: The input is not changed by fit, transform or fit_transform when copy=True. The input is not changed by fit, when copy=False. The input is changed by transform or fit_transform, when copy=False.
Contributor

### NicolasHug commented Jul 17, 2018

 Isn't there a general test for all estimators for that?

### ogrisel and others added some commits Jul 17, 2018

 OPTIM implement fit_transform 
 7be0376 
 Added test fit_transform() == fit().transform() 
 420476c 
Member

### ogrisel commented Jul 17, 2018

 I pushed a quick implementation for fit_transform but indeed it does not handle in place scaling properly, although it was also wrong when standardize=True: we also need to pass self._scaler = StandardScaler(copy=self.copy).

### ogrisel reviewed Jul 17, 2018

sklearn/preprocessing/data.py

### NicolasHug added some commits Jul 17, 2018

 Added tests for the copy parameter 
 1287f94 
 Fixed flake8 issues in example plot 
 0ce4b36 

### TomDLT approved these changes Jul 17, 2018

We might want to add them as common tests at some point, but it might be for another pull-request.

sklearn/preprocessing/data.py

### NicolasHug added some commits Jul 17, 2018

 set copy to False for the scaler 
 597a85d 
 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
…into yeojohnson
 593c818 

### glemaitre requested changes Jul 17, 2018

Couple of changes

sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
 Addressed comments from glemaitre 
 8022cc3 
Contributor

### glemaitre commented Jul 17, 2018

 I am waiting to check the example in the documentation
 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
…into yeojohnson
 0c120fb 
Contributor

### NicolasHug commented Jul 18, 2018 • edited

 @glemaitre @ogrisel, I think the plot looks pretty OK now.
Member

### jnothman commented Jul 19, 2018

 I'm finding the plots in plot_map_data_to_normal relatively hard to navigate intuitively. It's not a blocker, but I think it needs to look more tabular: at the moment it takes some effort to see that each row is a different transformation; a label on the left of the row would be more helpful. Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc.

### jnothman reviewed Jul 19, 2018

Can I clarify why plot_all_scaling still only shows box-cox?

Contributor

### NicolasHug commented Jul 19, 2018

 Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc. Personally I find it easier to compare the transformations when they're stacked on each other, especially since the axes limits are uniform across the plots. I don't have anything against having the transformation names on the left. It would also make sense to me to have one dataset per column (limiting the plot to 4 rows instead of 8), but that would make the plot wider which can be annoying on mobile. Thanks for mentioning plot_all_scaling, I missed that one.
 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
…into yeojohnson
 8234a3e 
Contributor

### NicolasHug commented Jul 19, 2018 • edited

 Looks like 14e7c32 broke plot_all_scaling on master: Traceback (most recent call last): File "examples/preprocessing/plot_all_scaling.py", line 71, in dataset = fetch_california_housing() File "/home/nico/dev/sklearn/sklearn/datasets/california_housing.py", line 128, in fetch_california_housing cal_housing = joblib.load(filepath) File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 578, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 508, in _unpickle obj = unpickler.load() File "/usr/lib64/python3.6/pickle.py", line 1050, in load dispatch[key[0]](self) File "/usr/lib64/python3.6/pickle.py", line 1338, in load_global klass = self.find_class(module, name) File "/usr/lib64/python3.6/pickle.py", line 1388, in find_class __import__(module, level=0) ModuleNotFoundError: No module named 'sklearn.externals._joblib.numpy_pickle'  Should I open an issue for this? I'm not sure if this comes from my env (I created a new one from scratch, still same). Doesn't the CI check that all the examples are passing?
Member

### amueller commented Jul 20, 2018

 @NicolasHug it's now "fixed" but you need to remove your scikit_learn_data folder in your home folder.

### NicolasHug added some commits Jul 20, 2018

 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
…into yeojohnson
 7d529df 
 Updated plot_all_scaling.py example 
 c0a01df 
Contributor

### NicolasHug commented Jul 20, 2018

 Thanks, just updated plot_all_scaling
Member

### ogrisel commented Jul 20, 2018 • edited

 The matplotlib rendering of the 2 examples is good enough for now: https://29575-843222-gh.circle-artifacts.com/0/doc/auto_examples/index.html#preprocessing Merging. Thanks @NicolasHug for this nice contribution!

### ogrisel merged commit 2d232ac into scikit-learn:master Jul 20, 2018 7 checks passed

#### 7 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 98.74% of diff hit (target 95.3%)
Details
codecov/project 95.3% (+<.01%) compared to 1984dac
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

Member

 yay!
Member

### ogrisel commented Jul 20, 2018 • edited

 I agree with @jnothman (#11520 (comment)) that using a layout similar to the cluster comparison plot would improve the readability even further but I don't want to delay the release for this.

Member

 Yey!!

Merged