New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary approximation in scipy.stats.wilcoxon(x, y)
#17530
Comments
I'm not immediately seeing why the p-value calculated with
As a sanity check, you can compute the exact p-value using If you reliably get the same p-value with |
Yeah, me neither 🙃 , I was just following the paper recommendation. Is this how you meant? # def wilcoxon_demsar(x, y):
# """Calculate the Wilcoxon signed-rank test.
# Simplified copy of scipy.stats.wilcoxon (version 1.9.3) with the following changes.
# The args below are fixed:
# - zero_method="zsplit"
# - alternative="two-sided"
# - correction=False
# """
import numpy as np
from scipy import stats
from scipy.stats import permutation_test, wilcoxon
def compare_statistics(num_samples=6):
rng = np.random.default_rng(0)
num_tests = 10
x = stats.uniform.rvs(size=(num_tests, num_samples), loc=0, random_state=rng)
y = stats.uniform.rvs(size=(num_tests, num_samples), loc=0.05, random_state=rng)
def get_statistic(x_, y_):
return wilcoxon(x_, y_, method='exact', zero_method='zsplit', alternative="two-sided").statistic
kwds = {
'vectorized': False,
'axis': 1,
'permutation_type': 'samples',
'random_state': rng,
'n_resamples': np.inf,
}
res_permutation = permutation_test((x, y), get_statistic, **kwds)
res_demsar = [wilcoxon_demsar(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
res_demsar = WilcoxonResult(*map(np.array, zip(*res_demsar)))
print('num_samples =', num_samples)
print('pvalue permutation')
print(res_permutation.pvalue)
print('pvalue demsar')
print(res_demsar.pvalue)
print('abs(diff)')
print(abs(res_permutation.pvalue - res_demsar.pvalue))
compare_statistics(num_samples=3)
compare_statistics(num_samples=6)
compare_statistics(num_samples=9)
# compare_statistics(num_samples=12) If yes, then indeed the adjustment suggested by demsar is actually quite different : /
|
I think i didn't quite understande what you meant @mdhaber because I changed my modified function by the official one res_permutation = permutation_test((x, y), get_statistic, **kwds)
res_wilcoxon = [wilcoxon(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
res_wilcoxon = WilcoxonResult(*map(np.array, zip(*res_wilcoxon)))
res_demsar = [wilcoxon_demsar(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
res_demsar = WilcoxonResult(*map(np.array, zip(*res_demsar))) and I got the same differences EDIT 1 Ok, it's because I was using Although, is something wrong with my test? because then the official code is also yielding those differences relative to the permutation method. EDIT 2 I changed the test to use Demsar's version has a lower average absolute error:
Code import numpy as np
from scipy import stats
from scipy.stats import permutation_test, wilcoxon
from anomasota.stats.wilcoxon import wilcoxon_simplified as wilcoxon_demsar
def compare_statistics(num_samples=6):
rng = np.random.default_rng(0)
num_tests = 300
# x = stats.uniform.rvs(size=(num_tests, num_samples), loc=0, random_state=rng)
# y = stats.uniform.rvs(size=(num_tests, num_samples), loc=0.05, random_state=rng)
x = rng.integers(0, 30, size=(num_tests, num_samples))
y = rng.integers(0, 30, size=(num_tests, num_samples))
def get_statistic(x_, y_):
return wilcoxon(x_, y_, method='exact', zero_method='zsplit', alternative="two-sided").statistic
kwds = {
'vectorized': False, 'axis': 1, 'permutation_type': 'samples',
'random_state': rng,'n_resamples': np.inf,
}
res_permutation = permutation_test((x, y), get_statistic, **kwds)
res_wilcoxon = [wilcoxon(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
res_wilcoxon = WilcoxonResult(*map(np.array, zip(*res_wilcoxon)))
res_demsar = [wilcoxon_demsar(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
res_demsar = WilcoxonResult(*map(np.array, zip(*res_demsar)))
print('num_samples =', num_samples)
print('sum(abs(diff)) permutation & demsar')
print(sum(abs(res_permutation.pvalue - res_demsar.pvalue)))
print('sum(abs(diff)) permutation & wilcoxon')
print(sum(abs(res_permutation.pvalue - res_wilcoxon.pvalue)))
compare_statistics(num_samples=3)
compare_statistics(num_samples=6)
compare_statistics(num_samples=9) |
I believe it's your Here is what I would do to test this. Note that since import numpy as np
from scipy import stats
rng = np.random.default_rng()
n = 7
x = rng.normal(size=n)
# x[1] = 0 # uncomment to add a zero
# x[2] = x[3] # uncomment to add a tie
zero_method = 'zsplit'
def statistic(x):
return stats.wilcoxon(x, zero_method=zero_method,
alternative='greater').statistic
res = stats.permutation_test((x,), statistic=statistic, n_resamples=np.inf,
permutation_type='samples')
print(res.pvalue, stats.wilcoxon(x, zero_method=zero_method).pvalue) |
Ok, so I put def compare_statistics_2(num_tests = 10, num_samples=6):
rng = numpy.random.default_rng(0)
x = rng.normal(size=(num_tests, num_samples))
x[1] = 0 # uncomment to add a zero
x[2] = x[3] # uncomment to add a tie
def get_statistic(x_):
return scipy.stats.wilcoxon(x_, method='exact', zero_method='zsplit', alternative="greater").pvalue
kwds = {
'vectorized': False, 'axis': 1, 'permutation_type': 'samples',
'random_state': rng,'n_resamples': np.inf,
}
res_permutation = scipy.stats.permutation_test((x,), get_statistic, **kwds)
res_wilcoxon = [
scipy.stats.wilcoxon(x[i, :], method="exact", zero_method='zsplit', alternative="greater")
for i in range(num_tests)
]
from scipy.stats._morestats import WilcoxonResult
res_wilcoxon = WilcoxonResult(*map(np.array, zip(*res_wilcoxon)))
print('num_samples =', num_samples)
print('permutation pvalue')
print(np.round(res_permutation.pvalue, 4))
print('wilcoxon pvalue')
print(np.round(res_wilcoxon.pvalue, 4))
print('\n')
compare_statistics_2(num_tests=5, num_samples=3)
compare_statistics_2(num_tests=5, num_samples=6)
compare_statistics_2(num_tests=5, num_samples=9)
|
Use The thing that shouldn't change is I did mess up one thing, though (which probably wouldn't change my results but is confusing) - in |
Ha, ok, now it looks better. So, the official implementation does seem to work as expected, with some differences with the permutations method when there is a zero, which AFAIK is also expected because The def compare_statistics_3(num_tests=10, num_samples=6, tested_method=scipy.stats.wilcoxon):
rng = numpy.random.default_rng(0)
x = rng.normal(size=(num_tests, num_samples))
x[0, 0] = 0 # add a zero
x[1, 1] = x[1, 2] # add a tie
common_kwargs_inside_permutation = dict(zero_method='zsplit', method='exact')
def get_statistic(x_):
return scipy.stats.wilcoxon(x_, alternative='greater', **common_kwargs_inside_permutation).statistic
common_kwargs_permutation = dict(alternative='two-sided')
kwds = {
'vectorized': False, 'axis': 1, 'permutation_type': 'samples', 'random_state': rng,'n_resamples': np.inf,
**common_kwargs_permutation,
}
res_permutation = scipy.stats.permutation_test((x,), get_statistic, **kwds)
# the zip-map is just packing the results into a WilcoxonResult that has arrays of shape (num_tests,)
res_wilcoxon = WilcoxonResult(*map(np.array, zip(*[
tested_method(x[i, :], **{**common_kwargs_inside_permutation, **common_kwargs_permutation})
for i in range(num_tests)
])))
print('num_samples =', num_samples)
print('permutation pvalue')
print(np.round(res_permutation.pvalue, 4))
print(f'{tested_method.__name__} pvalue')
print(np.round(res_wilcoxon.pvalue, 4))
print('abs(diff)')
print(np.round(abs(res_permutation.pvalue - res_wilcoxon.pvalue), 4))
print('\n')
compare_statistics_3(num_tests=8, num_samples=6)
compare_statistics_3(num_tests=8, num_samples=9)
compare_statistics_3(num_tests=8, num_samples=6, tested_method=wilcoxon_demsar)
compare_statistics_3(num_tests=8, num_samples=9, tested_method=wilcoxon_demsar)
|
(answered my question: only the first "test" has a zero. That's why we see so many zeros in What I see is that Maybe your original suggestion (#17530 (comment)) can be rephrased: when there are zeros or ties, it is still appropriate to use the same procedure as I've written about this sort of thing before in gh-13690 (although some of what I wrote was wrong at the time). Someone suggested that we should also switch to the normal approximation when there are ties (not just zeros):
But a smaller error in the p-value in two tests doesn't immediately suggest to me that we should change which approximation we use. Sometime in the next year or so, I'll be working on adding If you perform larger-scale tests, I might be convinced, though. |
Ha, nice to know. It would be great indeed to have the permutation method everywhere :D Sure, i can run more tests! Can you give me a suggestion of how much "large enough" could be? Or at least a good start? |
sorry for changing the test function every time; i think now it's won't need anymore Summary: the absolute error relative to the permutation method seems to be lower using Demsar's method than the current implementation of the wilcoxon test with So I started testing For each method I tested the three alternatives with 10 seeds for each of these scenarios:
You can check the results here: https://gist.github.com/jpcbertoldo/da22764eed915ee1f8a1d79e32aa72f6 Recall:
So the main changes are that the switch from This snippet if num_zeros % 2 == 1:
# discard one zero difference
# first [0] is for the tuple returned by np.where,
# second [0] is for the first element of the tuple
where_idx = np.where(d == 0)[0][0]
x = np.delete(x, where_idx)
if y is not None:
y = np.delete(y, where_idx)
d = np.delete(d, where_idx)
num_samples -= 1
num_zeros -= 1 replaces this scipy/scipy/stats/_morestats.py Line 3335 in de80faf
Test code def compare_statistics_4(tested_method, alternative, num_samples, num_zeros, num_ties, seed):
rng = numpy.random.default_rng(seed)
x = rng.normal(size=(num_samples,))
# add zeros
x[:num_zeros] = 0
# add ties
ties_start_idx = num_zeros
ties_end_idx = ties_start_idx + num_ties
x[ties_start_idx:ties_end_idx] = x[ties_start_idx]
common_kwargs_inside_permutation = dict(zero_method='zsplit', method='exact')
def get_statistic(x_):
return scipy.stats.wilcoxon(x_, alternative='greater', **common_kwargs_inside_permutation).statistic
common_kwargs_permutation = dict(alternative=alternative)
kwds = {
'vectorized': False, 'permutation_type': 'samples', 'random_state': rng,'n_resamples': np.inf,
**common_kwargs_permutation,
}
res_permutation = scipy.stats.permutation_test((x,), get_statistic, **kwds)
res_wilcoxon = tested_method(x, **{**common_kwargs_inside_permutation, **common_kwargs_permutation})
return {
"tested_method": tested_method.__name__,
"alternative": alternative,
"num_samples": num_samples,
"num_zeros": num_zeros,
"num_ties": num_ties,
"seed": seed,
"pvalue_permutation": res_permutation.pvalue,
"pvalue_tested": res_wilcoxon.pvalue,
}
METHODS = [scipy.stats.wilcoxon, wilcoxon_twosided_zsplitdemsar]
ALTERNATIVES = ['two-sided', 'greater', 'less']
NUMS = [
(3, 0, 0), (3, 1, 0), (3, 0, 1), (3, 1, 1),
(6, 0, 0),
(6, 1, 0), (6, 0, 1), (6, 1, 1),
(6, 3, 0), (6, 0, 3), (6, 2, 2),
(9, 0, 0),
(9, 1, 0), (9, 0, 1), (9, 1, 1),
(9, 4, 0), (9, 0, 4), (9, 3, 3),
]
SEEDS = list(range(10))
from itertools import product
ALL_COMBINATIONS = list(product(METHODS, ALTERNATIVES, NUMS, SEEDS))
print(f"Number of combinations: {len(ALL_COMBINATIONS)}")
from progressbar import progressbar
records = [None] * len(ALL_COMBINATIONS)
for idx, (tested_method, alternative, (num_samples, num_zeros, num_ties), seed) in progressbar(enumerate(ALL_COMBINATIONS)):
records[idx] = dict(
idx=idx,
**compare_statistics_4(tested_method, alternative, num_samples, num_zeros, num_ties, seed)
)
import pandas as pd
df = pd.DataFrame.from_records(records).set_index("idx")
df.to_csv("wilcoxon.csv") |
I updated the notebook gist with more tests. https://gist.github.com/jpcbertoldo/da22764eed915ee1f8a1d79e32aa72f6 I now included all possible combinations of Conclusion is still the same: demsar version does seem to have lower absolute error relative to the permutation method. Example of plot ( The update in the combinations: from itertools import product
NUMS = [
(3, num_zeros, num_ties)
for num_zeros, num_ties in product(range(3), range(3))
if num_zeros + num_ties <= 3
] + [
(6, num_zeros, num_ties)
for num_zeros, num_ties in product(range(6), range(6))
if num_zeros + num_ties <= 6
] + [
(9, num_zeros, num_ties)
for num_zeros, num_ties in product(range(9), range(9))
if num_zeros + num_ties <= 9
]
SEEDS = list(range(30)) |
Ok. I haven't looked carefully yet at how the tests are set up but this is the sort of thing I'm looking for. In which direction is the difference in p-value - is the new p-value typically larger or smaller? We'd want larger to prevent false-positives. If they are larger, go ahead and submit a PR. Do you see the same thing for the other zero methods? It would be nice if we could change them all. |
Hmm. Didn't look at that. |
https://gist.github.com/jpcbertoldo/da22764eed915ee1f8a1d79e32aa72f6 @mdhaber I updated the analysis with more tests, could you take a look? A summary here: https://gist.github.com/jpcbertoldo/da22764eed915ee1f8a1d79e32aa72f6?permalink_comment_id=4423840#gistcomment-4423840 You can skip
It becomes larger than the permutation more often than before (so 👍 )
No, the errors get bigger. Good to go for a PR? :) |
Wow, you've done your homework. Ok! Go for it!
Do they get more conservative (larger p-values)? |
Didn't check that.
Cool! Just some thoughts before that:
|
Thanks! |
If i properly understand the wilcoxon test, it depends (more generally) on the groups of ties* that you can have inside the When there differences of value zero it is a special case of that: all the zeros are tied together and the they always show at the first positions of the sorted So, I tried to figure it out by brute force (hehe 😅) how many cases would have to be covered: A "config" refers is a combination of those possibilities mentioned above. Source: https://gist.github.com/jpcbertoldo/4c6f13294d7d7e89082e6d765c8ad7db Is it all that much to be (for instance) stocked in a file?
Hm, I see. Adding another one about the API: 3. your suggestion is to just use Demsar's strategy by default for |
Note: "engagement" was supposed to be "enhancement". I edited my post above.
I thought that originally and then confused myself. I think you're right.
Might be. Depends on the impact to the wheel size. I'd say something on the scale of 100kb would be noticed. See, e.g. #17726 (comment). My hesitation would be for a different reason. I always try to think of
I'm not sure I understand the question. I know that I don't think we need to add a You have shown that when scipy/scipy/stats/_morestats.py Line 3335 in de80faf
so that when zero_method='zsplit' , it would no longer force method='approx' . I can agree with that.
When I asked "do they get more conservative?" (#17530 (comment)), I was hoping we could make a similar change for the other Was that an answer to the right question? |
🤔 ah, yes and I think for that many cases even for the
Ok, I see your point. As a non-maintainer my bias is to be more tolerant with those heterogenities in the code because I initially just wanted the (easy-to-use interface)
Not sure but I think I got the info I wanted haha: I will just change the behavior to use Demsar's strategy and avoid the exact-to-approx switch. |
Sounds good. Let's start there! |
Context:
scipy.stats.wilcoxon(x, y)
--x
andy
being 1D arrays of same size -- has two modes,approx
andexact
, and three policies for treating ties (i.e. whenx[i] == y[i]
), of which I believe there is an issue whenzero_method="zsplit"
.If there is at least one tie, the
mode
is switched toapprox
independently of the chosenzero_method
.This happens on this line:
scipy/scipy/stats/_morestats.py
Line 3335 in de80faf
Based on [1], I believe it shouldn't be necessary to use the
approx
method withzero_method="zsplit"
.In other words, the gaussian approximation, which normally should be used with
x.size > 50
, is causing unnecessary imprecision in the test computation.[1] J. Demšar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” Journal of Machine Learning Research, vol. 7, no. 1, pp. 1–30, 2006. Url: http://jmlr.org/papers/v7/demsar06a.html
Paragraph extracted from [1] (section "3.1.3 WILCOXON SIGNED-RANKS TEST", page 7):
I already made a (dirty) fix with the recommendations above on a local copy so I can open a (to-be-cleaned) PR, but I'd apreciate some validation from the maintainers that this makes sense before going through the review process.
The text was updated successfully, but these errors were encountered: