mannwhitneyu breaks backward compatibility in 0.17.0 #6062

Closed
tavinathanson opened this Issue Apr 13, 2016 · 9 comments

Projects

None yet

3 participants

@tavinathanson
Contributor

I recently found that mannwhitneyu changed from a one-sided to two-sided test, by default, in 0.17.0. I see that this PR cleaned up the documentation a bit to clarify the change.

However, my understanding from this thread is that mannwhitneyu should not have broken backward compatibility in the first place; rather, it'll be deprecated in favor of a new mann_whitney_u.

It looks like the change in functionality happened here.

I was quite surprised to see different p-values in my results on different machines, and it took me a greater part of a day to track down the issue to scipy.

What do you think about changing the default back until the function gets deprecated, issuing a warning, or some other solution to help ease any potential confusion?

@ev-br ev-br added the scipy.stats label Apr 14, 2016
@ev-br ev-br added this to the 0.17.1 milestone Apr 14, 2016
@ev-br
Member
ev-br commented Apr 14, 2016

Sigh. In scipy 0.14.1,

In [4]: from scipy.stats import mannwhitneyu

In [5]: np.random.seed(1234)

In [6]: from scipy.stats import norm

In [7]: x = norm.rvs(size=140)

In [8]: mannwhitneyu(x, x)
Out[8]: (9800.0, 0.49970555935013988)

while on master (ditto on 0.17.0),

In [38]: stats.mannwhitneyu(x, x)
Out[38]: MannwhitneyuResult(statistic=9800.0, pvalue=0.99941111870027977)

In [39]: stats.mannwhitneyu(x, x, alternative='less')
Out[39]: MannwhitneyuResult(statistic=9800.0, pvalue=0.50029444064986017)

Maybe we should restore the default to be alternative='less' instead of two-sided. @rgommers @josef-pkt @WarrenWeckesser thoughts?

@josef-pkt
Member

The problem is that the old behavior is "half of two-sided" or "one-sided" assuming the statistic/hypothesis is in the small tail. (AFAIR)

The new behavior is correct, but there is no way to detect the change in behavior except if you know it and do an explicit scipy version check.
I think reverting to old behavior in 0.17.1 is the better solution consistent with scipy deprecation policy, and hope 0.17.0 doesn't stay in usage.

Aside: It's difficult to see changes to unit tests if they involve moving test around and the tests have subtle adjustments. What I did in the past for one of our point releases in statsmodels in order to check if or where we break backwards compatibility, is to run on the old, previous version unit tests with the new code and go through the list of errors and failures.

@tavinathanson
Contributor

How about switching the alternative default value to None, and in that case using the old behavior of assuming the statistic is in the small tail (and issuing a warning that an alternative value should really be specified).

I'd be happy to submit a PR once this is decided.

@ev-br
Member
ev-br commented Apr 14, 2016

I'd rather keep the range of possible values of the alternatiev keyword. In this case, backwards compatible is satisfied if the default is switched to 'less', no?

Then of course, it's be very nice to finish up gh-4933 in the 0.18 time frame, so that we do provide a non-broken mann-whitney test... But I digress.

@josef-pkt
Member

@ev-br It's not equal to 'less'. It's either less or greater depending on which tail is smaller.
The old is one-sided if the Null goes in the direction of the data, but wrong if we are in the wrong tail (which won't usually happen).

@tavinathanson
Contributor
tavinathanson commented Apr 14, 2016 edited

To clarify the current (0.17.0) behavior:

from scipy.stats import mannwhitneyu
import numpy as np

np.random.seed(1234)

from scipy.stats import norm

x = norm.rvs(size=140)
y = norm.rvs(size=150)

mannwhitneyu(x, y)
MannwhitneyuResult(statistic=10714.0, pvalue=0.76480281221832502)

mannwhitneyu(y, x)
MannwhitneyuResult(statistic=10286.0, pvalue=0.76480281221832502)

mannwhitneyu(x, y, alternative='less')
MannwhitneyuResult(statistic=10714.0, pvalue=0.61813305588378875)

mannwhitneyu(x, y, alternative='greater')
MannwhitneyuResult(statistic=10714.0, pvalue=0.38240140610916251)

mannwhitneyu(y, x, alternative='less')
MannwhitneyuResult(statistic=10286.0, pvalue=0.38240140610916251)

mannwhitneyu(y, x, alternative='greater')
MannwhitneyuResult(statistic=10286.0, pvalue=0.61813305588378875)

Versus old (0.16.1) behavior:

from scipy.stats import mannwhitneyu
import numpy as np

np.random.seed(1234)

from scipy.stats import norm

x = norm.rvs(size=140)
y = norm.rvs(size=150)

mannwhitneyu(x, y)
MannwhitneyuResult(statistic=10286.0, pvalue=0.38240140610916251)

mannwhitneyu(y, x)
MannwhitneyuResult(statistic=10286.0, pvalue=0.38240140610916251)
@ev-br
Member
ev-br commented Apr 15, 2016

OK then, I stand corrected, thanks Josef. All in all, I'm +1 for changing the default to None, which is supposed to mean "whatever scipy < 0.17.0 was doing". No warning should be raised, but it'd be worth documenting the behavior.

@tavinathanson
Contributor

Sounds good; I'll aim for a PR this week.

@ev-br ev-br added defect and removed needs-decision labels May 5, 2016
@ev-br
Member
ev-br commented May 5, 2016

Fixed in gh-6100, closing.

@ev-br ev-br closed this May 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment