# [MRG] Add Yeo-Johnson transform to PowerTransformer #11520

## Conversation

Contributor

NicolasHug commented Jul 14, 2018

Closes #10261

#### What does this implement/fix? Explain your changes.

This PR implements the Yeo-Johnson transform as part of the PowerTransformer class.

PowerTransformer currently only support Box-Cox which only works for positive values, Yeo-Johnson works for the whole real line.

TODO:

• Write transform
• Fix lambda param estimation
• Write inverse transform
• Write docs
• Write tests
• Update examples

The lambda parameter estimation is a bit tricky and currently does not work. (should be OK now, see below). Unlike for Box-Cox there's no scipy built-in that we can rely on. I'm having a hard time finding decent guidelines, tried to implement likelihood maximization with the brent optimizer (just like for Box-Cox) but run into overflow issues.

The transform code seems to work though:

which is a reproduction of

From Quantile regression via vector generalized additive models by Thomas W. Yee.

Code for figure (hacky):

import numpy as np
from sklearn.preprocessing import PowerTransformer
import matplotlib.pyplot as plt

yj = PowerTransformer(method='yeo-johnson', standardize=False)
bc = PowerTransformer(method='box-cox', standardize=False)

X = np.arange(-4, 4, .1).reshape(-1, 1)
fig, axes = plt.subplots(ncols=2)

for lmbda in (0, .5, 1, 1.5, 2):
X_pos = X[X > 0].reshape(-1, 1)
bc.fit(X_pos)
bc.lambdas_ = [lmbda]
X_trans = bc.transform(X_pos)
axes[0].plot(X_pos, X_trans, label=r'$\lambda = {}$'.format(lmbda))
axes[0].set_title('Box-Cox')

yj.fit(X)
yj.lambdas_ = [lmbda]
X_trans = yj.transform(X)
axes[1].plot(X, X_trans, label=r'$\lambda = {}$'.format(lmbda))
axes[1].set_title('Yeo-Johnson')

for ax in axes:
ax.set(xlim=[-4, 4], ylim=[-5, 5], aspect='equal')
ax.legend()
ax.grid()

plt.show()

 WIP - First draft on Yeo-Johnson transform 
 Fixed lambda param optimization 
The issue was from an error in the log likelihood function
Contributor

NicolasHug commented Jul 14, 2018

 Lambda param estimation should be fixed now, thanks @amueller. Replication of this example with Yeo-Johnson instead of Box-Cox:

 Some first tests 
Need to write inverse_transform to continue
 Put helper method for yeo-johnson at the end 
amueller reviewed Jul 15, 2018

sklearn/preprocessing/data.py

amueller reviewed Jul 15, 2018

We think it's working now, right? So we need a test for the optimization, and then documentation and adding it to an example?

sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
 @@ -2076,7 +2078,7 @@ def test_power_transformer_strictly_positive_exception(): pt.fit, X_with_negatives) assert_raise_message(ValueError, not_positive_message, power_transform, X_with_negatives) power_transform, X_with_negatives, 'box-cox')

amueller Jul 15, 2018

Member

why is this needed? The default value shouldn't change, right? Or do we want to start a cycle to change the default to yeo-johnson?

NicolasHug Jul 15, 2018

Contributor

I find it clearer and explicit?

I don't know if we'll change the default but it should still be fine as PowerTransform hasn't been released yet AFAIK

amueller Jul 15, 2018

Member

good point. We should discuss before the release. I think yeo-johnson would make more sense.

ogrisel Jul 15, 2018

Member

The fact that Yeo-Johnson accepts negative values while Box-Cox does not makes me feel like we should use it by default. From a usability point of view, it's nicer to our users.

NicolasHug Jul 15, 2018

Contributor

I have the same feeling. Plus, it is designed to be a generalization of Box-Cox, even though that's not strictly the case.

Member

+1

NicolasHug Jul 15, 2018

Contributor

shall I change the default then?

Member

think so.

Contributor

Done

 Added inverse transform + some tests 
 Added test for the optimization procedures 
 Created _box_cox_optimize method for better code symmetry 
 Opt for yeo-johnson not influenced by Nan 
 0525bab 
 8e187c4 
 4173df3 
 Updated more docs and example 
ogrisel reviewed Jul 15, 2018

sklearn/preprocessing/tests/test_data.py
 updated test 
ogrisel reviewed Jul 15, 2018

sklearn/preprocessing/tests/test_data.py

ogrisel reviewed Jul 15, 2018

sklearn/preprocessing/tests/test_data.py
Member

amueller commented Jul 15, 2018

 If we want this to be the default then this is a blocker, right?

### ogrisel reviewed Jul 15, 2018

 lmbda_no_nans = pt.lambdas_[0] # concat nans at the end and check lambda stays the same X = np.concatenate([X, np.full_like(X, np.nan)])

ogrisel Jul 15, 2018

Member

To make sure that the location of the NaNs does not impact the estimation:

from sklearn.utils import shuffle
...

X = np.concatenate([X, np.full_like(X, np.nan)])
X = shuffle(X, random_state=0)

Contributor

Done

ogrisel reviewed Jul 15, 2018

sklearn/preprocessing/tests/test_data.py

 Modified tests according to reviews 
 Changed default method from cox-box to yeo-johnson 
Member

amueller commented Jul 15, 2018

 tagged for 0.20 and added blocker label. I don't like that we keep adding stuff but if we want to make it default we should do it now.

glemaitre approved these changes Jul 16, 2018

If I am not wrong we should have something in the common estimator_checks which force the input to be positive to work with box-cox. We probably want to change this behavior with we change the default.

sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
 return x_inv def _yeo_johnson_transform(self, x, lmbda):

glemaitre Jul 16, 2018

Contributor

we cannot just define the forward transform and take 1 / _yeo_johnson_transform for the inverse?

NicolasHug Jul 16, 2018

Contributor

The inverse here means f^{-1}, not 1 / f

sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py

Contributor

NicolasHug commented Jul 16, 2018

 I made a quick example to illustrate the use of Yeo-Johnson vs. Box-Cox + offset. As Box-Cox only accepts positive data, one solution is to shift the data by a fixed offset value (typically min(data) + eps): One thing we see is that the "after offset and Box-Cox" isn't as symmetric as the eo-Johnson and most importantly the values are much higher. Is it worth adding this as an example @amueller? TBH I wouldn't be able to mathematically or intuitively explain those results.
Member

GaelVaroquaux commented Jul 16, 2018

 +1 on comments by @glemaitre . Also, the tests are failing.

 Addressed most comments from @glemaitre, fixed flake8 
 Removed box-cox specific checks in estimator_checks 
glemaitre requested changes Jul 16, 2018

doc/modules/preprocessing.rst
examples/preprocessing/plot_power_transformer.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/tests/test_data.py
sklearn/preprocessing/tests/test_data.py
sklearn/preprocessing/tests/test_data.py

 Addressed comments from glemaitre 
 Changed number of bins in plots to auto 
 Fixed Nan issues (ignored warnings) 
 Should fix test for Python 2.7 
 Should fix example plot 
 Addressed comments from TomDLT 
Contributor

NicolasHug commented Jul 17, 2018

 Suggestion: When calling fit and transform on the same data, the transform is done twice. We could implement a fit_transform method to do it only once. Good idea, I tried to implement it but I don't think it's possible: Currently when calling fit_transform, the pipeline goes like this: X -> transform -> scaler_fit -> scaler_transform -> transform So transform is indeed called twice, but not on the same data. I don't think avoiding one of them is possible, but I may be missing something? In any case I changed fit so that transform is only called when needed, that is when standardize is True.
Member

ogrisel commented Jul 17, 2018

 It's fine to call just transfom on the test. The goal here is to call fit_transform on the training set instead of fit and then transform on the training set as there is a redundant computation that happens when subsequently calling fit and then transform on the same data.
Member

TomDLT commented Jul 17, 2018

 What I have in mind is: def fit(self, X): _ = self._fit_transform(X, transform=False) return self def fit_transform(self, X): Xt = self._fit_transform(X, transform=True) return Xt def transform(self, X): check_is_fitted(self, 'lambdas_') X = self._check_input(X, check_positive=True, check_shape=True, copy=self.copy) Xt = self._transform_inplace(X) return Xt def _fit_transform(self, X, transform=True): X = self._check_input(X, check_positive=True, check_method=True, copy=self.copy or not transform) optim_function = ... self.lambdas_ = [] for col in X.T: with np.errstate(invalid='ignore'): # hide NaN warnings lmbda = optim_function(col) self.lambdas_.append(lmbda) self.lambdas_ = np.array(self.lambdas_) Xt = None if self.standardize or transform: Xt = self._transform_inplace(X) if self.standardize: self._scaler = StandardScaler() self._scaler.fit(Xt) if transform: Xt = self._scaler.transform(Xt) return Xt def _transform_inplace(self, X): transform_function = ... for i, lmbda in enumerate(self.lambdas_): with np.errstate(invalid='ignore'): # hide NaN warnings X[:, i] = transform_function(X[:, i], lmbda) return X Therefore we reuse the transformed data computed for the standardization.
Member

TomDLT commented Jul 17, 2018

 In any case, we probably need a test to verify that: The input is not changed by fit, transform or fit_transform when copy=True. The input is not changed by fit, when copy=False. The input is changed by transform or fit_transform, when copy=False.
Contributor

NicolasHug commented Jul 17, 2018

 Isn't there a general test for all estimators for that?

 OPTIM implement fit_transform 
 Added test fit_transform() == fit().transform() 
Member

ogrisel commented Jul 17, 2018

 I pushed a quick implementation for fit_transform but indeed it does not handle in place scaling properly, although it was also wrong when standardize=True: we also need to pass self._scaler = StandardScaler(copy=self.copy).

ogrisel reviewed Jul 17, 2018

sklearn/preprocessing/data.py

 Added tests for the copy parameter 
 Fixed flake8 issues in example plot 
TomDLT approved these changes Jul 17, 2018

We might want to add them as common tests at some point, but it might be for another pull-request.

sklearn/preprocessing/data.py

 set copy to False for the scaler 
glemaitre requested changes Jul 17, 2018

Couple of changes

sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
sklearn/preprocessing/data.py
 8022cc3 
glemaitre commented Jul 17, 2018

 I am waiting to check the example in the documentation
Contributor

NicolasHug commented Jul 18, 2018

 @glemaitre @ogrisel, I think the plot looks pretty OK now.
Member

jnothman commented Jul 19, 2018

 I'm finding the plots in plot_map_data_to_normal relatively hard to navigate intuitively. It's not a blocker, but I think it needs to look more tabular: at the moment it takes some effort to see that each row is a different transformation; a label on the left of the row would be more helpful. Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc.

jnothman reviewed Jul 19, 2018

Can I clarify why plot_all_scaling still only shows box-cox?

Contributor

NicolasHug commented Jul 19, 2018

 Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc. Personally I find it easier to compare the transformations when they're stacked on each other, especially since the axes limits are uniform across the plots. I don't have anything against having the transformation names on the left. It would also make sense to me to have one dataset per column (limiting the plot to 4 rows instead of 8), but that would make the plot wider which can be annoying on mobile. Thanks for mentioning plot_all_scaling, I missed that one.
Contributor

NicolasHug commented Jul 19, 2018

 Looks like 14e7c32 broke plot_all_scaling on master: Traceback (most recent call last): File "examples/preprocessing/plot_all_scaling.py", line 71, in dataset = fetch_california_housing() File "/home/nico/dev/sklearn/sklearn/datasets/california_housing.py", line 128, in fetch_california_housing cal_housing = joblib.load(filepath) File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 578, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 508, in _unpickle obj = unpickler.load() File "/usr/lib64/python3.6/pickle.py", line 1050, in load dispatch[key[0]](self) File "/usr/lib64/python3.6/pickle.py", line 1338, in load_global klass = self.find_class(module, name) File "/usr/lib64/python3.6/pickle.py", line 1388, in find_class __import__(module, level=0) ModuleNotFoundError: No module named 'sklearn.externals._joblib.numpy_pickle'  Should I open an issue for this? I'm not sure if this comes from my env (I created a new one from scratch, still same). Doesn't the CI check that all the examples are passing?
Member

amueller commented Jul 20, 2018

 @NicolasHug it's now "fixed" but you need to remove your scikit_learn_data folder in your home folder.

 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
 7d529df 
 c0a01df 
NicolasHug commented Jul 20, 2018

 Thanks, just updated plot_all_scaling
Member

ogrisel commented Jul 20, 2018

 The matplotlib rendering of the 2 examples is good enough for now: https://29575-843222-gh.circle-artifacts.com/0/doc/auto_examples/index.html#preprocessing Merging. Thanks @NicolasHug for this nice contribution!

Member

 yay!
Member

ogrisel commented Jul 20, 2018

 I agree with @jnothman (#11520 (comment)) that using a layout similar to the cluster comparison plot would improve the readability even further but I don't want to delay the release for this.

Member

 Yey!!

