Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added Yeo-Johnson power transformation #9305

Merged
merged 20 commits into from
Nov 5, 2018

Conversation

NicolasHug
Copy link
Contributor

@NicolasHug NicolasHug commented Sep 24, 2018

Closes #6141.

This PR adds support for the Yeo-Johnson power transform. Unlike the (already implemented) Box-Cox transform, Yeo-Johnson is able to deal with negative values.

I based the implementation on that of scikit-learn (scikit-learn/scikit-learn#11520) and tried to mirror that of Box-Cox.

Done:

  • yeojohnson
  • yeojohnson_llf (log likelihood function)
  • yeojohnson_normmax (optimization function, using the brent optimizer just like boxcox)
  • yeojohnson_normplot
  • tests
  • docstrings
  • versionadded tag
  • updated release notes
  • updated Thanks.txt

Differences with the boxcox implementation:

  • Here the code is pure numpy. The actual boxcox transformation is Cythonized
  • Only MLE is available to optimize lambda parameter. Boxcox also supports pearsonr.
  • yeojohnson does not supports the alpha parameter for returning a confidence interval.
  • yeojohnson_llf only supports 1D arrays (same for 'yeojohnson' and 'boxcox' anyway) yeojohnson_llf now behaves like boxcox_llf

if lb <= la:
raise ValueError("`lb` has to be larger than `la`.")
if lmbda is not None:
return _yeojohnson_transform(x, lmbda)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is messing up and can be mostly ignored: I simply moved the code from boxcox_normplot into a more general function _normplot that is used in both boxcox_normplot and yeojohnson_normplot. boxcox_normplot and yeojohnson_normplot do the exact same thing (only the tranformation changes of course)

@chrisb83 chrisb83 added the enhancement A new feature or improvement label Sep 25, 2018
@chrisb83
Copy link
Member

Thanks, this looks like a very nice feature. I will try to review it soon

Copy link
Member

@chrisb83 chrisb83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already in good shape, i will review some parts in more detail soon. I think a few more tests should be added

lmbdas = np.linspace(la, lb, num=N)
ppcc = lmbdas * 0.0
for i, val in enumerate(lmbdas):
# Determine for each lmbda the correlation coefficient of transformed x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to documentation of probplot, "r is the square root of the coefficient of determination", so the comment and the notation r2 are a bit misleading

We now use `yeojohnson` to transform the data so it's closest to normal:

>>> ax2 = fig.add_subplot(212)
>>> xt, _ = stats.yeojohnson(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be nice to replace _ by lmbda and say "optimal lambda used to transform the data is …" to explain this value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to add a print statement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As if you want to see the values in the terminal output, I am not sure if the explanation is relevant since this is the documentation of the same function anyways.

>>> xt, lmbda = stats.yeojohnson(x)
>>> lmbda
<value of lmbda ...>

pos = x >= 0 # binary mask

# when x >= 0
if abs(lmbda) < 1e-19:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 1e-19?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that's the threshold used in boxcox.

To be fair I don't entirely understand the rationale behind this so I just used the same value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't make a strict rule if regular floats are concerned. You can use np.spacing(1.) instead.


# when x >= 0
if abs(lmbda) < 1e-19:
out[pos] = np.log(x[pos] + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use log1p for better accuracy for small x, same in line 1363


class TestYeojohnson(object):

def test_fixed_lmbda(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add further test cases for negative x and lmbda == 2 / lmbda != 2?

and also vector x with negative and positive numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did in 1c898bb


def test_mle(self):
maxlog = stats.yeojohnson_normmax(self.x)
assert_allclose(maxlog, 1.876393, rtol=1e-6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this the expected result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no particular reason. The tests fox boxcox also use hardcoded values. I assume the main use case is that if some future modifications change the output of yeojohnson, at least we'll know it.

llf = N/2 \log(\sigma^2) + (\lambda - 1)
\sum_i \text{ sign }(x_i)\log(|x_i| + 1)

where :math:`\sigma^2` is estimated variance of the the Yeo-Johnson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically denoted by sigma hat, but not sure if this is more clear

trans = _yeojohnson_transform(data, lmb)

# Estimated mean and variance of the normal distribution
est_mean = trans.sum() / n_samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly shorter: trans.mean() which can be inserted directly into the line below
or use x.var() directly

@chrisb83
Copy link
Member

chrisb83 commented Oct 1, 2018

regarding Tests: the original paper of Yeo and Johnson contains an example at the end

@NicolasHug
Copy link
Contributor Author

Thanks for the review I think I've addressed all comments

Issues closed
-------------

* `#6141 <https://github.com/scipy/scipy/issues/6141>`__: Request: transformation functions - Yeo-Johnson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the authors list are generated during the release so you don't need to populate just now.

----------
x : ndarray
Input array. Should be 1-dimensional.
lmbda : {None, scalar}, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is implicit in the signature. float, optional is sufficient here.

x : ndarray
Input array. Should be 1-dimensional.
lmbda : {None, scalar}, optional
If `lmbda` is not None, do the transformation for that value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These paragraphs can be combined. Note that you need double backticks for code formatting.

If ``lmbda`` is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument. Otherwise the transformation is done for the given value.

@ilayn
Copy link
Member

ilayn commented Oct 3, 2018

I've left some minor comments but in general I think you can reorder the examples such that the yeo-johnson code and plotting parts are separated to make it more streamlined.

@NicolasHug
Copy link
Contributor Author

I addressed most of the comments, however I'm not sure what you mean about the examples?
Those were almost entirely copy/pasted from the boxcox examples.

@chrisb83
Copy link
Member

chrisb83 commented Oct 4, 2018

Cross-ref bug report for boxcox in #6873 and proposed fix in #9271. Not sure if also relevant here.

@NicolasHug
Copy link
Contributor Author

Yes, I could reproduce the issue in #6873 with yeojohnson. We should probably wait for #9271 to be merged then?

Copy link
Member

@chrisb83 chrisb83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments have been adressed, testing looks good now. I added one more comment on the input format(1d). Otherwise I would be in favor of adding this function (even before the known issues of boxcox are resolved, this might still take a while and this feature should be useful without the fix).

We still need a decision regarding output format. @rgommers, what is your view?

Parameters
----------
x : array_like
Input array.
Copy link
Member

@chrisb83 chrisb83 Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be 1d as well, right?

General question: would it be better to raise an error if input ist not 1d?

I just noted: boxcox uses axis=0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just pushed a change so that yeo_johnsonllf now accepts multi-dimensional arrays and behaves like boxcox_llf so we can keep the docstring like that.

@chrisb83
Copy link
Member

Looks like there are no major objections to keep the output consistent to boxcox (i.e. variable number of outputs)? I think this PR can then go into release 1.2.0.

@rgommers
Copy link
Member

We still need a decision regarding output format. @rgommers, what is your view?

Looks like that's about the conversation that @ev-br resolved where using Bunches is the ideal outcome (but for another) PR - commented on that, current status of this PR seems fine.

I think this PR can then go into release 1.2.0.

+1

@rgommers rgommers added this to the 1.2.0 milestone Oct 29, 2018
@rgommers
Copy link
Member

TravisCI test failures are real

@NicolasHug
Copy link
Contributor Author

NicolasHug commented Oct 30, 2018

I have no idea what's going on :/

The only difference between 6cd0370 (passing) and dce2ea0 (failing) is a comment.

Also the tested failing version of scipy isn't synched with mine: logs show an error on line 1308 which is not the same as in my file.

I've merged master, let's see...

@chrisb83
Copy link
Member

The tests fail when you check the inverse transformation. Results look good (i.e. close to expected result) but apparently do not pass the criterion.

Arrays are not almost equal to 2 decimals
E ACTUAL: 0.0070733797505410281
E DESIRED: 0

@NicolasHug
Copy link
Contributor Author

But why now? Those exact tests were passing before.

For some reason I can't install numpy 1.8.2 (I get a compilation error) so I can't reproduce the error for now

@chrisb83
Copy link
Member

chrisb83 commented Nov 2, 2018

I don't know why it fails in the 3.5 build. As a workaround you could try assert_equal(abs(expected-desired)<0.0xy, True)

(not very nice, maybe someone has an insight why the current test fails...)

@ilayn
Copy link
Member

ilayn commented Nov 5, 2018

Let's try to get this one in before the 1.2 branch.

@NicolasHug Can you please switch to assert_allclose anyways instead of assert_equal? In the past there were a lot of issues about it (didn't track the status). So maybe the old NumPy version is interfering

@chrisb83 chrisb83 merged commit fc1fa2c into scipy:master Nov 5, 2018
@chrisb83
Copy link
Member

chrisb83 commented Nov 5, 2018

@NicolasHug, all: thanks, all checks are green, merged.

@ilayn
Copy link
Member

ilayn commented Nov 5, 2018

This one probably should go into the release notes too.

@rgommers
Copy link
Member

rgommers commented Nov 5, 2018

This one probably should go into the release notes too.

Agreed. @NicolasHug would you mind adding a note to https://github.com/scipy/scipy/wiki/Release-note-entries-for-SciPy-1.2.0?

@NicolasHug
Copy link
Contributor Author

Done :)

@rgommers
Copy link
Member

rgommers commented Nov 5, 2018

great, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants