ENH: Added Yeo-Johnson power transformation #9305

NicolasHug · 2018-09-24T19:27:59Z

Closes #6141.

This PR adds support for the Yeo-Johnson power transform. Unlike the (already implemented) Box-Cox transform, Yeo-Johnson is able to deal with negative values.

I based the implementation on that of scikit-learn (scikit-learn/scikit-learn#11520) and tried to mirror that of Box-Cox.

Done:

Differences with the boxcox implementation:

Here the code is pure numpy. The actual boxcox transformation is Cythonized
Only MLE is available to optimize lambda parameter. Boxcox also supports pearsonr.
yeojohnson does not supports the alpha parameter for returning a confidence interval.
~~yeojohnson_llf only supports 1D arrays (same for 'yeojohnson' and 'boxcox' anyway)~~ yeojohnson_llf now behaves like boxcox_llf

NicolasHug · 2018-09-24T20:19:46Z

scipy/stats/morestats.py

-    if lb <= la:
-        raise ValueError("`lb` has to be larger than `la`.")
+    if lmbda is not None:
+        return _yeojohnson_transform(x, lmbda)


The diff is messing up and can be mostly ignored: I simply moved the code from boxcox_normplot into a more general function _normplot that is used in both boxcox_normplot and yeojohnson_normplot. boxcox_normplot and yeojohnson_normplot do the exact same thing (only the tranformation changes of course)

chrisb83 · 2018-09-25T05:46:53Z

Thanks, this looks like a very nice feature. I will try to review it soon

chrisb83

Looks already in good shape, i will review some parts in more detail soon. I think a few more tests should be added

chrisb83 · 2018-10-01T20:54:06Z

scipy/stats/morestats.py

+    lmbdas = np.linspace(la, lb, num=N)
+    ppcc = lmbdas * 0.0
+    for i, val in enumerate(lmbdas):
+        # Determine for each lmbda the correlation coefficient of transformed x


according to documentation of probplot, "r is the square root of the coefficient of determination", so the comment and the notation r2 are a bit misleading

chrisb83 · 2018-10-01T20:56:27Z

scipy/stats/morestats.py

+    We now use `yeojohnson` to transform the data so it's closest to normal:
+
+    >>> ax2 = fig.add_subplot(212)
+    >>> xt, _ = stats.yeojohnson(x)


i think it would be nice to replace _ by lmbda and say "optimal lambda used to transform the data is …" to explain this value

You mean to add a print statement?

As if you want to see the values in the terminal output, I am not sure if the explanation is relevant since this is the documentation of the same function anyways.

>>> xt, lmbda = stats.yeojohnson(x) >>> lmbda <value of lmbda ...>

chrisb83 · 2018-10-01T21:01:19Z

scipy/stats/morestats.py

+    pos = x >= 0  # binary mask
+
+    # when x >= 0
+    if abs(lmbda) < 1e-19:


why 1e-19?

Because that's the threshold used in boxcox.

To be fair I don't entirely understand the rationale behind this so I just used the same value.

That won't make a strict rule if regular floats are concerned. You can use np.spacing(1.) instead.

chrisb83 · 2018-10-01T21:04:02Z

scipy/stats/morestats.py

+
+    # when x >= 0
+    if abs(lmbda) < 1e-19:
+        out[pos] = np.log(x[pos] + 1)


use log1p for better accuracy for small x, same in line 1363

chrisb83 · 2018-10-01T21:15:50Z

scipy/stats/tests/test_morestats.py

+
+class TestYeojohnson(object):
+
+    def test_fixed_lmbda(self):


can you add further test cases for negative x and lmbda == 2 / lmbda != 2?

and also vector x with negative and positive numbers?

Just did in 1c898bb

chrisb83 · 2018-10-01T21:22:25Z

scipy/stats/tests/test_morestats.py

+
+    def test_mle(self):
+        maxlog = stats.yeojohnson_normmax(self.x)
+        assert_allclose(maxlog, 1.876393, rtol=1e-6)


why is this the expected result?

There's no particular reason. The tests fox boxcox also use hardcoded values. I assume the main use case is that if some future modifications change the output of yeojohnson, at least we'll know it.

chrisb83 · 2018-10-01T21:28:23Z

scipy/stats/morestats.py

+        llf = N/2 \log(\sigma^2) + (\lambda - 1)
+              \sum_i \text{ sign }(x_i)\log(|x_i| + 1)
+
+    where :math:`\sigma^2` is estimated variance of the the Yeo-Johnson


typically denoted by sigma hat, but not sure if this is more clear

chrisb83 · 2018-10-01T21:33:46Z

scipy/stats/morestats.py

+    trans = _yeojohnson_transform(data, lmb)
+
+    # Estimated mean and variance of the normal distribution
+    est_mean = trans.sum() / n_samples


slightly shorter: trans.mean() which can be inserted directly into the line below
or use x.var() directly

chrisb83 · 2018-10-01T21:37:58Z

regarding Tests: the original paper of Yeo and Johnson contains an example at the end

NicolasHug · 2018-10-02T22:54:30Z

Thanks for the review I think I've addressed all comments

ilayn · 2018-10-03T07:29:00Z

doc/release/1.2.0-notes.rst

 Issues closed
 -------------

+* `#6141 <https://github.com/scipy/scipy/issues/6141>`__: Request: transformation functions - Yeo-Johnson


This and the authors list are generated during the release so you don't need to populate just now.

ilayn · 2018-10-03T07:30:35Z

scipy/stats/morestats.py

+    ----------
+    x : ndarray
+        Input array.  Should be 1-dimensional.
+    lmbda : {None, scalar}, optional


None is implicit in the signature. float, optional is sufficient here.

ilayn · 2018-10-03T07:32:40Z

scipy/stats/morestats.py

+    x : ndarray
+        Input array.  Should be 1-dimensional.
+    lmbda : {None, scalar}, optional
+        If `lmbda` is not None, do the transformation for that value.


These paragraphs can be combined. Note that you need double backticks for code formatting.

If ``lmbda`` is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument. Otherwise the transformation is done for the given value.

scipy/stats/morestats.py

ilayn · 2018-10-03T07:45:10Z

I've left some minor comments but in general I think you can reorder the examples such that the yeo-johnson code and plotting parts are separated to make it more streamlined.

NicolasHug · 2018-10-03T12:52:23Z

I addressed most of the comments, however I'm not sure what you mean about the examples?
Those were almost entirely copy/pasted from the boxcox examples.

chrisb83 · 2018-10-04T06:45:52Z

Cross-ref bug report for boxcox in #6873 and proposed fix in #9271. Not sure if also relevant here.

NicolasHug · 2018-10-04T13:13:41Z

Yes, I could reproduce the issue in #6873 with yeojohnson. We should probably wait for #9271 to be merged then?

chrisb83

All my comments have been adressed, testing looks good now. I added one more comment on the input format(1d). Otherwise I would be in favor of adding this function (even before the known issues of boxcox are resolved, this might still take a while and this feature should be useful without the fix).

We still need a decision regarding output format. @rgommers, what is your view?

chrisb83 · 2018-10-11T18:55:02Z

scipy/stats/morestats.py

+    Parameters
+    ----------
+    x : array_like
+        Input array.


This must be 1d as well, right?

General question: would it be better to raise an error if input ist not 1d?

I just noted: boxcox uses axis=0.

Right, I just pushed a change so that yeo_johnsonllf now accepts multi-dimensional arrays and behaves like boxcox_llf so we can keep the docstring like that.

chrisb83 · 2018-10-26T06:37:33Z

Looks like there are no major objections to keep the output consistent to boxcox (i.e. variable number of outputs)? I think this PR can then go into release 1.2.0.

rgommers · 2018-10-29T06:48:30Z

We still need a decision regarding output format. @rgommers, what is your view?

Looks like that's about the conversation that @ev-br resolved where using Bunches is the ideal outcome (but for another) PR - commented on that, current status of this PR seems fine.

I think this PR can then go into release 1.2.0.

+1

rgommers · 2018-10-30T05:03:58Z

TravisCI test failures are real

NicolasHug · 2018-10-30T12:07:08Z

I have no idea what's going on :/

The only difference between 6cd0370 (passing) and dce2ea0 (failing) is a comment.

Also the tested failing version of scipy isn't synched with mine: logs show an error on line 1308 which is not the same as in my file.

I've merged master, let's see...

chrisb83 · 2018-10-30T16:48:03Z

The tests fail when you check the inverse transformation. Results look good (i.e. close to expected result) but apparently do not pass the criterion.

Arrays are not almost equal to 2 decimals
E ACTUAL: 0.0070733797505410281
E DESIRED: 0

NicolasHug · 2018-10-30T18:29:24Z

But why now? Those exact tests were passing before.

For some reason I can't install numpy 1.8.2 (I get a compilation error) so I can't reproduce the error for now

chrisb83 · 2018-11-02T08:24:25Z

I don't know why it fails in the 3.5 build. As a workaround you could try assert_equal(abs(expected-desired)<0.0xy, True)

(not very nice, maybe someone has an insight why the current test fails...)

ilayn · 2018-11-05T11:04:20Z

Let's try to get this one in before the 1.2 branch.

@NicolasHug Can you please switch to assert_allclose anyways instead of assert_equal? In the past there were a lot of issues about it (didn't track the status). So maybe the old NumPy version is interfering

chrisb83 · 2018-11-05T14:18:51Z

@NicolasHug, all: thanks, all checks are green, merged.

ilayn · 2018-11-05T15:20:06Z

This one probably should go into the release notes too.

rgommers · 2018-11-05T16:36:23Z

This one probably should go into the release notes too.

Agreed. @NicolasHug would you mind adding a note to https://github.com/scipy/scipy/wiki/Release-note-entries-for-SciPy-1.2.0?

NicolasHug · 2018-11-05T16:42:58Z

Done :)

rgommers · 2018-11-05T16:46:04Z

great, thanks

NicolasHug added 7 commits September 23, 2018 21:22

WIP, first draft for yeojohnson

5eba455

Added tests and fixed some bugs meanwhile

69e4199

Added docstrings

d5eb78d

Added versionadded tags

1e9f572

updated release notes

8049d51

added to __init__.py

e437d5b

Added contribution to THANKS.txt

3ca1cd7

NicolasHug commented Sep 24, 2018

View reviewed changes

pvanmulbregt added the scipy.stats label Sep 25, 2018

chrisb83 added the enhancement A new feature or improvement label Sep 25, 2018

chrisb83 reviewed Oct 1, 2018

View reviewed changes

NicolasHug added 4 commits October 2, 2018 17:30

Addressed some comments

138c02a

Added more tests in test_fixed_lmbda

1c898bb

Added example from original paper

f0775a0

Fixed pep8

9f21fa6

ilayn reviewed Oct 3, 2018

View reviewed changes

scipy/stats/morestats.py Show resolved Hide resolved

NicolasHug added 3 commits October 3, 2018 08:42

Used np.spacing(1.)

83bbc77

reverted notes.rst

eb31547

updated desc. of lmbda parameter

b1ca78a

chang mentioned this pull request Oct 7, 2018

Use scipy.stats.yeojohnson as the estimator in PowerTransformer scikit-learn/scikit-learn#12318

Closed

chrisb83 reviewed Oct 11, 2018

View reviewed changes

NicolasHug added 3 commits October 12, 2018 11:37

Fixed return name boxcox -> yeojohnson

34acb28

yeojohnson_llf now accepts multi-dimentional arrays

d20a711

Merge branch 'master' of git://github.com/scipy/scipy into yeojohnson

6cd0370

rgommers added this to the 1.2.0 milestone Oct 29, 2018

Minimal typo

dce2ea0

Merge branch 'master' into yeojohnson

002531d

chrisb83 mentioned this pull request Oct 31, 2018

Yeo-Johnson Transform formulation correct? #9420

Closed

used allclose instead of almostequal

6b3ad3c

chrisb83 merged commit fc1fa2c into scipy:master Nov 5, 2018

NicolasHug mentioned this pull request Nov 5, 2018

[MRG] Fixes to YeoJohnson transform scikit-learn/scikit-learn#12522

Merged

pvanmulbregt mentioned this pull request Feb 23, 2020

MAINT: Removed unused imports; fixed unused assignments in scipy/stats. #11573

Merged

ENH: Added Yeo-Johnson power transformation #9305

ENH: Added Yeo-Johnson power transformation #9305

Conversation

NicolasHug commented Sep 24, 2018 • edited Loading

Choose a reason for hiding this comment

chrisb83 commented Sep 25, 2018

chrisb83 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisb83 commented Oct 1, 2018

NicolasHug commented Oct 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilayn commented Oct 3, 2018

NicolasHug commented Oct 3, 2018

chrisb83 commented Oct 4, 2018

NicolasHug commented Oct 4, 2018

chrisb83 left a comment

Choose a reason for hiding this comment

chrisb83 Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisb83 commented Oct 26, 2018

rgommers commented Oct 29, 2018

rgommers commented Oct 30, 2018

NicolasHug commented Oct 30, 2018 • edited Loading

chrisb83 commented Oct 30, 2018

NicolasHug commented Oct 30, 2018

chrisb83 commented Nov 2, 2018

ilayn commented Nov 5, 2018

chrisb83 commented Nov 5, 2018

ilayn commented Nov 5, 2018

rgommers commented Nov 5, 2018

NicolasHug commented Nov 5, 2018

rgommers commented Nov 5, 2018

NicolasHug commented Sep 24, 2018 •

edited

Loading

chrisb83 Oct 11, 2018 •

edited

Loading

NicolasHug commented Oct 30, 2018 •

edited

Loading