-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added Yeo-Johnson power transformation #9305
Conversation
if lb <= la: | ||
raise ValueError("`lb` has to be larger than `la`.") | ||
if lmbda is not None: | ||
return _yeojohnson_transform(x, lmbda) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is messing up and can be mostly ignored: I simply moved the code from boxcox_normplot
into a more general function _normplot
that is used in both boxcox_normplot
and yeojohnson_normplot
. boxcox_normplot
and yeojohnson_normplot
do the exact same thing (only the tranformation changes of course)
Thanks, this looks like a very nice feature. I will try to review it soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks already in good shape, i will review some parts in more detail soon. I think a few more tests should be added
scipy/stats/morestats.py
Outdated
lmbdas = np.linspace(la, lb, num=N) | ||
ppcc = lmbdas * 0.0 | ||
for i, val in enumerate(lmbdas): | ||
# Determine for each lmbda the correlation coefficient of transformed x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to documentation of probplot, "r is the square root of the coefficient of determination", so the comment and the notation r2 are a bit misleading
scipy/stats/morestats.py
Outdated
We now use `yeojohnson` to transform the data so it's closest to normal: | ||
|
||
>>> ax2 = fig.add_subplot(212) | ||
>>> xt, _ = stats.yeojohnson(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would be nice to replace _ by lmbda and say "optimal lambda used to transform the data is …" to explain this value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean to add a print
statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As if you want to see the values in the terminal output, I am not sure if the explanation is relevant since this is the documentation of the same function anyways.
>>> xt, lmbda = stats.yeojohnson(x)
>>> lmbda
<value of lmbda ...>
scipy/stats/morestats.py
Outdated
pos = x >= 0 # binary mask | ||
|
||
# when x >= 0 | ||
if abs(lmbda) < 1e-19: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 1e-19
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that's the threshold used in boxcox.
To be fair I don't entirely understand the rationale behind this so I just used the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't make a strict rule if regular floats are concerned. You can use np.spacing(1.)
instead.
scipy/stats/morestats.py
Outdated
|
||
# when x >= 0 | ||
if abs(lmbda) < 1e-19: | ||
out[pos] = np.log(x[pos] + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use log1p
for better accuracy for small x, same in line 1363
|
||
class TestYeojohnson(object): | ||
|
||
def test_fixed_lmbda(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add further test cases for negative x and lmbda == 2
/ lmbda != 2
?
and also vector x with negative and positive numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did in 1c898bb
|
||
def test_mle(self): | ||
maxlog = stats.yeojohnson_normmax(self.x) | ||
assert_allclose(maxlog, 1.876393, rtol=1e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this the expected result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no particular reason. The tests fox boxcox also use hardcoded values. I assume the main use case is that if some future modifications change the output of yeojohnson
, at least we'll know it.
scipy/stats/morestats.py
Outdated
llf = N/2 \log(\sigma^2) + (\lambda - 1) | ||
\sum_i \text{ sign }(x_i)\log(|x_i| + 1) | ||
|
||
where :math:`\sigma^2` is estimated variance of the the Yeo-Johnson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typically denoted by sigma hat, but not sure if this is more clear
scipy/stats/morestats.py
Outdated
trans = _yeojohnson_transform(data, lmb) | ||
|
||
# Estimated mean and variance of the normal distribution | ||
est_mean = trans.sum() / n_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly shorter: trans.mean()
which can be inserted directly into the line below
or use x.var() directly
regarding Tests: the original paper of Yeo and Johnson contains an example at the end |
Thanks for the review I think I've addressed all comments |
doc/release/1.2.0-notes.rst
Outdated
Issues closed | ||
------------- | ||
|
||
* `#6141 <https://github.com/scipy/scipy/issues/6141>`__: Request: transformation functions - Yeo-Johnson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the authors list are generated during the release so you don't need to populate just now.
scipy/stats/morestats.py
Outdated
---------- | ||
x : ndarray | ||
Input array. Should be 1-dimensional. | ||
lmbda : {None, scalar}, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is implicit in the signature. float, optional
is sufficient here.
scipy/stats/morestats.py
Outdated
x : ndarray | ||
Input array. Should be 1-dimensional. | ||
lmbda : {None, scalar}, optional | ||
If `lmbda` is not None, do the transformation for that value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These paragraphs can be combined. Note that you need double backticks for code formatting.
If ``lmbda`` is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument. Otherwise the transformation is done for the given value.
I've left some minor comments but in general I think you can reorder the examples such that the yeo-johnson code and plotting parts are separated to make it more streamlined. |
I addressed most of the comments, however I'm not sure what you mean about the examples? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my comments have been adressed, testing looks good now. I added one more comment on the input format(1d). Otherwise I would be in favor of adding this function (even before the known issues of boxcox are resolved, this might still take a while and this feature should be useful without the fix).
We still need a decision regarding output format. @rgommers, what is your view?
Parameters | ||
---------- | ||
x : array_like | ||
Input array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be 1d as well, right?
General question: would it be better to raise an error if input ist not 1d?
I just noted: boxcox
uses axis=0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I just pushed a change so that yeo_johnsonllf
now accepts multi-dimensional arrays and behaves like boxcox_llf
so we can keep the docstring like that.
Looks like there are no major objections to keep the output consistent to |
Looks like that's about the conversation that @ev-br resolved where using Bunches is the ideal outcome (but for another) PR - commented on that, current status of this PR seems fine.
+1 |
TravisCI test failures are real |
The tests fail when you check the inverse transformation. Results look good (i.e. close to expected result) but apparently do not pass the criterion. Arrays are not almost equal to 2 decimals |
But why now? Those exact tests were passing before. For some reason I can't install numpy 1.8.2 (I get a compilation error) so I can't reproduce the error for now |
I don't know why it fails in the 3.5 build. As a workaround you could try (not very nice, maybe someone has an insight why the current test fails...) |
Let's try to get this one in before the 1.2 branch. @NicolasHug Can you please switch to |
@NicolasHug, all: thanks, all checks are green, merged. |
This one probably should go into the release notes too. |
Agreed. @NicolasHug would you mind adding a note to https://github.com/scipy/scipy/wiki/Release-note-entries-for-SciPy-1.2.0? |
Done :) |
great, thanks |
Closes #6141.
This PR adds support for the Yeo-Johnson power transform. Unlike the (already implemented) Box-Cox transform, Yeo-Johnson is able to deal with negative values.
I based the implementation on that of scikit-learn (scikit-learn/scikit-learn#11520) and tried to mirror that of Box-Cox.
Done:
yeojohnson
yeojohnson_llf
(log likelihood function)yeojohnson_normmax
(optimization function, using the brent optimizer just like boxcox)yeojohnson_normplot
Differences with the boxcox implementation:
pearsonr
.yeojohnson
does not supports thealpha
parameter for returning a confidence interval.yeojohnson_llf
only supports 1D arrays (same for 'yeojohnson' and 'boxcox' anyway)yeojohnson_llf
now behaves likeboxcox_llf