Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary approximation in scipy.stats.wilcoxon(x, y) #17530

Closed
jpcbertoldo opened this issue Dec 3, 2022 · 21 comments · Fixed by #19770
Closed

Unnecessary approximation in scipy.stats.wilcoxon(x, y) #17530

jpcbertoldo opened this issue Dec 3, 2022 · 21 comments · Fixed by #19770
Labels
enhancement A new feature or improvement scipy.stats
Milestone

Comments

@jpcbertoldo
Copy link

Context: scipy.stats.wilcoxon(x, y) -- x and y being 1D arrays of same size -- has two modes, approx and exact, and three policies for treating ties (i.e. when x[i] == y[i]), of which I believe there is an issue when zero_method="zsplit".

If there is at least one tie, the mode is switched to approx independently of the chosen zero_method.

This happens on this line:

if n_zero > 0 and mode == "exact":
.

Based on [1], I believe it shouldn't be necessary to use the approx method with zero_method="zsplit".

In other words, the gaussian approximation, which normally should be used with x.size > 50, is causing unnecessary imprecision in the test computation.

[1] J. Demšar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” Journal of Machine Learning Research, vol. 7, no. 1, pp. 1–30, 2006. Url: http://jmlr.org/papers/v7/demsar06a.html


Paragraph extracted from [1] (section "3.1.3 WILCOXON SIGNED-RANKS TEST", page 7):

Let di again be the difference between the performance scores of the two classifiers on i-th out of N data sets. The differences are ranked according to their absolute values; average ranks are assigned in case of ties. Let R+ be the sum of ranks for the data sets on which the second algorithm outperformed the first, and R− the sum of ranks for the opposite. Ranks of di = 0 are split evenly among the sums; if there is an odd number of them, one is ignored:


I already made a (dirty) fix with the recommendations above on a local copy so I can open a (to-be-cleaned) PR, but I'd apreciate some validation from the maintainers that this makes sense before going through the review process.

@j-bowhay j-bowhay added scipy.stats enhancement A new feature or improvement labels Dec 3, 2022
@mdhaber
Copy link
Contributor

mdhaber commented Dec 6, 2022

I'm not immediately seeing why the p-value calculated with method='exact' would still truly be exact in case of ties resolved with split.

I'd apreciate some validation from the maintainers that this makes sense before going through the review process.

As a sanity check, you can compute the exact p-value using permutation_test (e.g. see here or here). That is, use wilcoxon(x, y, method='approx', zero_method='zsplit') just to define the statistic, and use permutation_test will compute the p-value. Keep the sample sizes small (less than 10) and make n_resamples infinite to ensure that you get an exact p-value.

If you reliably get the same p-value with permutation_test as you get directly from wilcoxon (using your adjusted copy), you can submit a PR with your changes. Either way, please post your permutation_test code here.

@jpcbertoldo
Copy link
Author

I'm not immediately seeing why the p-value calculated with method='exact' would still truly be exact in case of ties resolved with split.

Yeah, me neither 🙃 , I was just following the paper recommendation.
But I guess it's just a hotfix for this edge case, no?


Is this how you meant?

# def wilcoxon_demsar(x, y):
#     """Calculate the Wilcoxon signed-rank test.
#     Simplified copy of scipy.stats.wilcoxon (version 1.9.3) with the following changes.
#     The args below are fixed: 
#         - zero_method="zsplit"
#         - alternative="two-sided"
#         - correction=False
#     """

import numpy as np
from scipy import stats
from scipy.stats import permutation_test, wilcoxon

def compare_statistics(num_samples=6):

    rng = np.random.default_rng(0)

    num_tests = 10
    x = stats.uniform.rvs(size=(num_tests, num_samples), loc=0, random_state=rng)
    y = stats.uniform.rvs(size=(num_tests, num_samples), loc=0.05, random_state=rng)

    def get_statistic(x_, y_):
        return wilcoxon(x_, y_, method='exact', zero_method='zsplit', alternative="two-sided").statistic

    kwds = {
        'vectorized': False, 
        'axis': 1,
        'permutation_type': 'samples', 
        'random_state': rng,
        'n_resamples': np.inf,
    }

    res_permutation = permutation_test((x, y), get_statistic, **kwds)

    res_demsar = [wilcoxon_demsar(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
    res_demsar = WilcoxonResult(*map(np.array, zip(*res_demsar)))

    print('num_samples =', num_samples)
    print('pvalue permutation')
    print(res_permutation.pvalue)
    print('pvalue demsar')
    print(res_demsar.pvalue)
    print('abs(diff)')
    print(abs(res_permutation.pvalue - res_demsar.pvalue))

compare_statistics(num_samples=3)
compare_statistics(num_samples=6)
compare_statistics(num_samples=9)
# compare_statistics(num_samples=12)

If yes, then indeed the adjustment suggested by demsar is actually quite different : /
The prints above give

num_samples = 3
pvalue permutation
[0.5 0.5 1.  1.  0.5 1.  1.  0.5 0.5 0.5]
pvalue demsar
[0.25 1.   0.5  0.75 1.   0.75 0.5  1.   0.25 0.25]
abs(diff)
[0.25 0.5  0.5  0.25 0.5  0.25 0.5  0.5  0.25 0.25]

num_samples = 6
pvalue permutation
[0.875  0.625  1.     0.1875 0.875  0.125  0.3125 0.625  0.3125 0.875 ]
pvalue demsar
[0.6875  0.84375 0.5625  0.09375 0.6875  0.0625  1.      0.3125  0.15625
 0.4375 ]
abs(diff)
[0.1875  0.21875 0.4375  0.09375 0.1875  0.0625  0.6875  0.3125  0.15625
 0.4375 ]

num_samples = 9
pvalue permutation
[0.1484375 0.6953125 1.        0.8515625 0.53125   1.        0.6015625
 0.6953125 0.1796875 0.6015625]
pvalue demsar
[0.07421875 0.734375   0.5703125  0.42578125 0.8203125  0.5703125
 0.30078125 0.734375   1.         0.30078125]
abs(diff)
[0.07421875 0.0390625  0.4296875  0.42578125 0.2890625  0.4296875
 0.30078125 0.0390625  0.8203125  0.30078125]

@jpcbertoldo
Copy link
Author

jpcbertoldo commented Dec 6, 2022

I think i didn't quite understande what you meant @mdhaber because I changed my modified function by the official one

    res_permutation = permutation_test((x, y), get_statistic, **kwds)

    res_wilcoxon = [wilcoxon(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
    res_wilcoxon = WilcoxonResult(*map(np.array, zip(*res_wilcoxon)))
    
    res_demsar = [wilcoxon_demsar(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
    res_demsar = WilcoxonResult(*map(np.array, zip(*res_demsar)))

and I got the same differences


EDIT 1

Ok, it's because I was using stats.uniform.rvs so there are no ties.

Although, is something wrong with my test? because then the official code is also yielding those differences relative to the permutation method.


EDIT 2

I changed the test to use rng.integers(0, 30) and got the differences between (permutation, official) pvalues and the same for (permutation, demsar). Then I checked the sum of absolute differences considering 300 tests of sample size 3, 6, and 9.

Demsar's version has a lower average absolute error:

num_samples = 3
sum(abs(diff)) permutation & demsar
110.5
sum(abs(diff)) permutation & wilcoxon
122.31971928242785
num_samples = 6
sum(abs(diff)) permutation & demsar
103.71875
sum(abs(diff)) permutation & wilcoxon
105.01349359109997
num_samples = 9
sum(abs(diff)) permutation & demsar
99.3828125
sum(abs(diff)) permutation & wilcoxon
99.99268338872903

Code

import numpy as np
from scipy import stats
from scipy.stats import permutation_test, wilcoxon

from anomasota.stats.wilcoxon import wilcoxon_simplified as wilcoxon_demsar

def compare_statistics(num_samples=6):
    rng = np.random.default_rng(0)
    num_tests = 300
    # x = stats.uniform.rvs(size=(num_tests, num_samples), loc=0, random_state=rng)
    # y = stats.uniform.rvs(size=(num_tests, num_samples), loc=0.05, random_state=rng)
    x = rng.integers(0, 30, size=(num_tests, num_samples))
    y = rng.integers(0, 30, size=(num_tests, num_samples))

    def get_statistic(x_, y_):
        return wilcoxon(x_, y_, method='exact', zero_method='zsplit', alternative="two-sided").statistic

    kwds = {
        'vectorized': False, 'axis': 1, 'permutation_type': 'samples', 
        'random_state': rng,'n_resamples': np.inf,
    }

    res_permutation = permutation_test((x, y), get_statistic, **kwds)

    res_wilcoxon = [wilcoxon(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
    res_wilcoxon = WilcoxonResult(*map(np.array, zip(*res_wilcoxon)))
    
    res_demsar = [wilcoxon_demsar(x[i, :], y[i, :], method="exact") for i in range(num_tests)]
    res_demsar = WilcoxonResult(*map(np.array, zip(*res_demsar)))

    print('num_samples =', num_samples)
    print('sum(abs(diff)) permutation & demsar')
    print(sum(abs(res_permutation.pvalue - res_demsar.pvalue)))
    print('sum(abs(diff)) permutation & wilcoxon')
    print(sum(abs(res_permutation.pvalue - res_wilcoxon.pvalue)))

compare_statistics(num_samples=3)
compare_statistics(num_samples=6)
compare_statistics(num_samples=9)

@mdhaber
Copy link
Contributor

mdhaber commented Dec 6, 2022

I believe it's your get_statistic function that is not quite right. The wilcoxon result statistic attribute is the minimum of two possible statistics, unfortunately. (Changing that default - like we did for mannwhitneyu - to always return the same statistic is arduous due to backward compatibility requirements.) As explained in the Gotchas section of the tutorial I linked, you need to set alternative to either less or greater in the get_statistic function regardless of the alternative you want from the permutation test.

Here is what I would do to test this. Note that since wilcoxon is really a one-sample test, if two samples are provided, it simply works on the difference between the two. Let's provide just one sample to make it easier to introduce ties and zeros.

import numpy as np
from scipy import stats
rng = np.random.default_rng()

n = 7
x = rng.normal(size=n)
# x[1] = 0  # uncomment to add a zero
# x[2] = x[3]  # uncomment to add a tie

zero_method = 'zsplit'
def statistic(x):
    return stats.wilcoxon(x, zero_method=zero_method, 
                          alternative='greater').statistic


res = stats.permutation_test((x,), statistic=statistic, n_resamples=np.inf,
                             permutation_type='samples')

print(res.pvalue, stats.wilcoxon(x, zero_method=zero_method).pvalue)

@jpcbertoldo
Copy link
Author

Ok, so I put 'greater' in both methods (permutation and just wilcoxon) and I got doubled values (as in the gotchas, which should be the case, right?) but not always.

def compare_statistics_2(num_tests = 10, num_samples=6):
    
    rng = numpy.random.default_rng(0)
    
    x = rng.normal(size=(num_tests, num_samples))
    x[1] = 0  # uncomment to add a zero
    x[2] = x[3]  # uncomment to add a tie

    def get_statistic(x_):
        return scipy.stats.wilcoxon(x_, method='exact', zero_method='zsplit', alternative="greater").pvalue

    kwds = {
        'vectorized': False, 'axis': 1, 'permutation_type': 'samples', 
        'random_state': rng,'n_resamples': np.inf,
    }
    res_permutation = scipy.stats.permutation_test((x,), get_statistic, **kwds)

    res_wilcoxon = [
        scipy.stats.wilcoxon(x[i, :], method="exact", zero_method='zsplit', alternative="greater") 
        for i in range(num_tests)
    ]

    from scipy.stats._morestats import WilcoxonResult
    res_wilcoxon = WilcoxonResult(*map(np.array, zip(*res_wilcoxon)))
    
    print('num_samples =', num_samples)
    print('permutation pvalue')
    print(np.round(res_permutation.pvalue, 4))
    print('wilcoxon pvalue')
    print(np.round(res_wilcoxon.pvalue, 4))
    print('\n')

compare_statistics_2(num_tests=5, num_samples=3)
compare_statistics_2(num_tests=5, num_samples=6)
compare_statistics_2(num_tests=5, num_samples=9)
num_samples = 3
permutation pvalue
[0.75 1.   0.5  0.5  0.25]
wilcoxon pvalue
[0.375 0.5   0.875 0.875 1.   ]


num_samples = 6
permutation pvalue
[0.6875 1.     0.3125 0.3125 0.6875]
wilcoxon pvalue
[0.3438 0.5    0.1562 0.1562 0.7188]


num_samples = 9
permutation pvalue
[0.4258 1.     0.7344 0.7344 0.1641]
wilcoxon pvalue
[0.2129 0.5    0.6738 0.6738 0.082 ]

@mdhaber
Copy link
Contributor

mdhaber commented Dec 7, 2022

Use alternative='greater' in the call to permutation_test if that's what you're using in the call to wilcoxon to get res_wilcoxon. In my example, I used the default - two-sided - for both, which also works. You could try alternative='less' for both, too. Those just need to match.

The thing that shouldn't change is alternative='greater'
in the wilcoxon call in get_statistic. It's correct right now, so it can be left alone when you change the alternative of the others.

I did mess up one thing, though (which probably wouldn't change my results but is confusing) - in get_statistiic, it should be .statistic instead of .pvalue - return the statistic attribute, not the p-value attribute. I corrected my code above.

@jpcbertoldo
Copy link
Author

Ha, ok, now it looks better.

So, the official implementation does seem to work as expected, with some differences with the permutations method when there is a zero, which AFAIK is also expected because scipy.stats.wilcoxon switches to the approx mode.

The wilcoxon_twosided_zsplitdemsar (my implementation with Demsar's recommendation) also seems to work like the official and with lower absolute error relative to the permutation method.

def compare_statistics_3(num_tests=10, num_samples=6, tested_method=scipy.stats.wilcoxon):
    
    rng = numpy.random.default_rng(0)
    
    x = rng.normal(size=(num_tests, num_samples))
    x[0, 0] = 0  # add a zero
    x[1, 1] = x[1, 2]  # add a tie

    common_kwargs_inside_permutation = dict(zero_method='zsplit', method='exact')
    
    def get_statistic(x_):
        return scipy.stats.wilcoxon(x_,  alternative='greater', **common_kwargs_inside_permutation).statistic

    common_kwargs_permutation = dict(alternative='two-sided')
    kwds = {
        'vectorized': False, 'axis': 1, 'permutation_type': 'samples', 'random_state': rng,'n_resamples': np.inf,
        **common_kwargs_permutation,
    }
    
    res_permutation = scipy.stats.permutation_test((x,), get_statistic, **kwds)
    
    # the zip-map is just packing the results into a WilcoxonResult that has arrays of shape (num_tests,)
    res_wilcoxon = WilcoxonResult(*map(np.array, zip(*[
        tested_method(x[i, :], **{**common_kwargs_inside_permutation, **common_kwargs_permutation}) 
        for i in range(num_tests)
    ])))
    
    print('num_samples =', num_samples)
    print('permutation pvalue')
    print(np.round(res_permutation.pvalue, 4))
    print(f'{tested_method.__name__} pvalue')
    print(np.round(res_wilcoxon.pvalue, 4))
    print('abs(diff)')
    print(np.round(abs(res_permutation.pvalue - res_wilcoxon.pvalue), 4))
    print('\n')

compare_statistics_3(num_tests=8, num_samples=6)
compare_statistics_3(num_tests=8, num_samples=9)
compare_statistics_3(num_tests=8, num_samples=6, tested_method=wilcoxon_demsar)
compare_statistics_3(num_tests=8, num_samples=9, tested_method=wilcoxon_demsar)
num_samples = 6
permutation pvalue
[0.75   0.5625 0.0312 0.3125 0.6875 0.8438 0.5625 0.0938]
wilcoxon pvalue
[0.675  0.5625 0.0312 0.3125 0.6875 0.8438 0.5625 0.0938]
abs(diff)
[0.075 0.    0.    0.    0.    0.    0.    0.   ]

num_samples = 9
permutation pvalue
[0.4609 0.0195 0.3008 0.7344 0.1641 0.0977 0.5703 0.4258]
wilcoxon pvalue
[0.4069 0.0195 0.3008 0.7344 0.1641 0.0977 0.5703 0.4258]
abs(diff)
[0.054 0.    0.    0.    0.    0.    0.    0.   ]

num_samples = 6
permutation pvalue
[0.75   0.5625 0.0312 0.3125 0.6875 0.8438 0.5625 0.0938]
wilcoxon_twosided_zsplitdemsar pvalue
[0.8125 0.5625 0.0312 0.3125 0.6875 0.8438 0.5625 0.0938]
abs(diff)
[0.0625 0.     0.     0.     0.     0.     0.     0.    ]

num_samples = 9
permutation pvalue
[0.4609 0.0195 0.3008 0.7344 0.1641 0.0977 0.5703 0.4258]
wilcoxon_twosided_zsplitdemsar pvalue
[0.4609 0.0195 0.3008 0.7344 0.1641 0.0977 0.5703 0.4258]
abs(diff)
[0. 0. 0. 0. 0. 0. 0. 0.]

@mdhaber
Copy link
Contributor

mdhaber commented Dec 7, 2022

Can you explain why there is only ever a difference between the first element of res_permutation.pvalue and res_wilcoxon.pvalue - never any of the other elements? If all of the eight "tests" are generated at random, it would be quite a coincidence if the discrepancies only occur in the first of eight independent tests.

(answered my question: only the first "test" has a zero. That's why we see so many zeros in abs(diff).)

What I see is that wilcoxon_twosided_zsplitdemsar is not exact, so some approximation is required when there are zeros.

Maybe your original suggestion (#17530 (comment)) can be rephrased: when there are zeros or ties, it is still appropriate to use the same procedure as method='exact' to generate p-values, even the resulting p-value is not truly exact.

I've written about this sort of thing before in gh-13690 (although some of what I wrote was wrong at the time). Someone suggested that we should also switch to the normal approximation when there are ties (not just zeros):

For this test (and others) Hollander and Wolfe "Nonparametric Statistical Methods" recommends "If there are ties among the N observations, use average ranks to compute W, and carry on as in [the case without ties]" (unless the sample is large, in which case it suggests the asymptotic approximation with tie correction). So for small samples there is precedent to use the same null distribution (computed without considering ties) even in the presence of ties. I believe this acceptable because ties tend to reduce the variance of the null distribution, so the returned p-values are conservative (more chance of type-II error, less chance of type I error).

But a smaller error in the p-value in two tests doesn't immediately suggest to me that we should change which approximation we use. Sometime in the next year or so, I'll be working on adding permutation_test capabilities into most hypothesis tests. That will provide truly exact p-values for small samples (even with ties/zeros), and for large samples, it can perform a randomized test that exactly controls the long-run Type I error rate.

If you perform larger-scale tests, I might be convinced, though. $n=2$ is just not enough for me.

@jpcbertoldo
Copy link
Author

Ha, nice to know. It would be great indeed to have the permutation method everywhere :D

Sure, i can run more tests!

Can you give me a suggestion of how much "large enough" could be? Or at least a good start?

@jpcbertoldo
Copy link
Author

jpcbertoldo commented Dec 8, 2022

sorry for changing the test function every time; i think now it's won't need anymore


Summary: the absolute error relative to the permutation method seems to be lower using Demsar's method than the current implementation of the wilcoxon test with zsplit for every alternative and considering several combinations of zeros and ties in samples of sizes 3, 6, and 9.


So I started testing scipy.stats.wilcoxon and wilcoxon_twosided_zsplitdemsar by looking at the absolute difference between the p-value yielded by each and the ideal p-value yielded by permutation_test.

For each method I tested the three alternatives with 10 seeds for each of these scenarios:

# (num_samples, num_zeros, num_ties), where num_zeros, num_ties are forced like in the previous tests
(3, 0, 0), (3, 1, 0), (3, 0, 1), (3, 1, 1),
(6, 0, 0), 
(6, 1, 0), (6, 0, 1), (6, 1, 1),
(6, 3, 0), (6, 0, 3), (6, 2, 2),
(9, 0, 0), 
(9, 1, 0), (9, 0, 1), (9, 1, 1),
(9, 4, 0), (9, 0, 4), (9, 3, 3),

You can check the results here: https://gist.github.com/jpcbertoldo/da22764eed915ee1f8a1d79e32aa72f6


Recall: wilcoxon_twosided_zsplitdemsar is a modified version of scipy.stats.wilcoxon:

  1. the options zero_method="zsplit" and correction=False are hardcoded;
  2. the twosided in the name is misleading because i first hardcoded that option but now it is enabled with alternative="greater" and alternative="less";
  3. there is no automatic mode switch;
  4. the policy to deal with zero differences is the recommendation from [1].

[1] J. Demšar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” Journal of Machine Learning Research, vol. 7, no. 1, pp. 1–30, 2006. Url: http://jmlr.org/papers/v7/demsar06a.html

So the main changes are that the switch from "exact" to "approx" does not happen in my function.

This snippet

    if num_zeros % 2 == 1:
        # discard one zero difference
        # first [0] is for the tuple returned by np.where, 
        # second [0] is for the first element of the tuple
        where_idx = np.where(d == 0)[0][0]
        x = np.delete(x, where_idx)
        if y is not None:
            y = np.delete(y, where_idx)
        d = np.delete(d, where_idx)
        num_samples -= 1
        num_zeros -= 1

replaces this if:

if n_zero > 0 and mode == "exact":


Test code

def compare_statistics_4(tested_method, alternative, num_samples, num_zeros, num_ties, seed):
    rng = numpy.random.default_rng(seed)
    x = rng.normal(size=(num_samples,))
    
    # add zeros
    x[:num_zeros] = 0  
    
    # add ties
    ties_start_idx = num_zeros
    ties_end_idx = ties_start_idx + num_ties
    x[ties_start_idx:ties_end_idx] = x[ties_start_idx]  
    
    common_kwargs_inside_permutation = dict(zero_method='zsplit', method='exact')
    def get_statistic(x_):
        return scipy.stats.wilcoxon(x_,  alternative='greater', **common_kwargs_inside_permutation).statistic
    
    common_kwargs_permutation = dict(alternative=alternative)
    kwds = {
        'vectorized': False, 'permutation_type': 'samples', 'random_state': rng,'n_resamples': np.inf,
        **common_kwargs_permutation,
    }
    
    res_permutation = scipy.stats.permutation_test((x,), get_statistic, **kwds)
    res_wilcoxon = tested_method(x, **{**common_kwargs_inside_permutation, **common_kwargs_permutation}) 
    
    return {
        "tested_method": tested_method.__name__,
        "alternative": alternative,
        "num_samples": num_samples,
        "num_zeros": num_zeros,
        "num_ties": num_ties,
        "seed": seed,
        "pvalue_permutation": res_permutation.pvalue,
        "pvalue_tested": res_wilcoxon.pvalue,
    }


METHODS = [scipy.stats.wilcoxon, wilcoxon_twosided_zsplitdemsar]
ALTERNATIVES = ['two-sided', 'greater', 'less']
NUMS = [
    (3, 0, 0), (3, 1, 0), (3, 0, 1), (3, 1, 1),
    (6, 0, 0), 
    (6, 1, 0), (6, 0, 1), (6, 1, 1),
    (6, 3, 0), (6, 0, 3), (6, 2, 2),
    (9, 0, 0), 
    (9, 1, 0), (9, 0, 1), (9, 1, 1),
    (9, 4, 0), (9, 0, 4), (9, 3, 3),
]
SEEDS = list(range(10))

from itertools import product
ALL_COMBINATIONS = list(product(METHODS, ALTERNATIVES, NUMS, SEEDS))
print(f"Number of combinations: {len(ALL_COMBINATIONS)}")

from progressbar import progressbar
records = [None] * len(ALL_COMBINATIONS)
for idx, (tested_method, alternative, (num_samples, num_zeros, num_ties), seed) in progressbar(enumerate(ALL_COMBINATIONS)):
    records[idx] = dict(
        idx=idx,
        **compare_statistics_4(tested_method, alternative, num_samples, num_zeros, num_ties, seed)
    )

import pandas as pd
df = pd.DataFrame.from_records(records).set_index("idx")
df.to_csv("wilcoxon.csv")

@jpcbertoldo
Copy link
Author

I updated the notebook gist with more tests.

https://gist.github.com/jpcbertoldo/da22764eed915ee1f8a1d79e32aa72f6

I now included all possible combinations of num_zeros and num_ties for each num_samples (recall num_zeros + num_ties <= num_samples), increased the number of seeds to 30 (for each (num_samples, num_zeros, num_ties) tuple), plotted some statistics of the absolute error, and added more detailed tables.

Conclusion is still the same: demsar version does seem to have lower absolute error relative to the permutation method.

Example of plot (alternative='two-sided')

image

The update in the combinations:

from itertools import product

NUMS = [
    (3, num_zeros, num_ties)
    for num_zeros, num_ties in product(range(3), range(3))
    if num_zeros + num_ties <= 3
] + [
    (6, num_zeros, num_ties)
    for num_zeros, num_ties in product(range(6), range(6))
    if num_zeros + num_ties <= 6
] +  [
    (9, num_zeros, num_ties)
    for num_zeros, num_ties in product(range(9), range(9))
    if num_zeros + num_ties <= 9
]

SEEDS = list(range(30))

@mdhaber
Copy link
Contributor

mdhaber commented Dec 9, 2022

Ok. I haven't looked carefully yet at how the tests are set up but this is the sort of thing I'm looking for.

In which direction is the difference in p-value - is the new p-value typically larger or smaller? We'd want larger to prevent false-positives. If they are larger, go ahead and submit a PR.

Do you see the same thing for the other zero methods? It would be nice if we could change them all.

@jpcbertoldo
Copy link
Author

Ok. I haven't looked carefully yet at how the tests are set up but this is the sort of thing I'm looking for.

In which direction is the difference in p-value - is the new p-value typically larger or smaller? We'd want larger to prevent false-positives. If they are larger, go ahead and submit a PR.

Do you see the same thing for the other zero methods? It would be nice if we could change them allm

Hmm. Didn't look at that.
I'll check.

@jpcbertoldo
Copy link
Author

jpcbertoldo commented Jan 4, 2023

https://gist.github.com/jpcbertoldo/da22764eed915ee1f8a1d79e32aa72f6

@mdhaber I updated the analysis with more tests, could you take a look?

A summary here: https://gist.github.com/jpcbertoldo/da22764eed915ee1f8a1d79e32aa72f6?permalink_comment_id=4423840#gistcomment-4423840

You can skip procedure, it's the same i was doing before. I increased the number of seeds to 50, considered all alternative values, and considered all zero_method values.

In which direction is the difference in p-value - is the new p-value typically larger or smaller?

It becomes larger than the permutation more often than before (so 👍 )

Do you see the same thing for the other zero methods?

No, the errors get bigger.


Good to go for a PR? :)

@mdhaber
Copy link
Contributor

mdhaber commented Jan 5, 2023

Wow, you've done your homework. Ok! Go for it!

No, the errors get bigger.

Do they get more conservative (larger p-values)?

@jpcbertoldo
Copy link
Author

jpcbertoldo commented Jan 5, 2023

No, the errors get bigger.

Do they get more conservative (larger p-values)?

Didn't check that.

Wow, you've done your homework. Ok! Go for it!

Cool!

Just some thoughts before that:

  1. Some pvalue errors are quite big in the right side of distribution (like 10 up to 20%), so couldn't it actually be reasonable to have a hardcoded table for low number of samples?

  2. Or actually do the permutation? (As long as the time to compute is comparable ofc).

@mdhaber
Copy link
Contributor

mdhaber commented Jan 5, 2023

  1. That would grow quite quickly since it depends on the number and location of zeros, right?
  2. Yes, I suggested that earlier. A PR in which I already implemented that sort of approach is ENH: stats.anderson_ksamp: add permutation version of test #16996. Let's start with a small PR though, just making your originally proposed enhancement, and we'll take it from there.

Thanks!

@jpcbertoldo
Copy link
Author

  1. That would grow quite quickly since it depends on the number and location of zeros, right?

If i properly understand the wilcoxon test, it depends (more generally) on the groups of ties* that you can have inside the abs(d) vector (d are the diffs. of perfs.) and to which sign each element of d is assigned.

When there differences of value zero it is a special case of that: all the zeros are tied together and the they always show at the first positions of the sorted abs(d).

So, I tried to figure it out by brute force (hehe 😅) how many cases would have to be covered:

image

A "config" refers is a combination of those possibilities mentioned above.

Source: https://gist.github.com/jpcbertoldo/4c6f13294d7d7e89082e6d765c8ad7db

Is it all that much to be (for instance) stocked in a file?

  1. Yes, I suggested that earlier. A PR in which I already implemented that sort of approach is ENH: stats.anderson_ksamp: add permutation version of test #16996. Let's start with a small PR though, just making your originally proposed engagement, and we'll take it from there.

Hm, I see.

Adding another one about the API: 3. your suggestion is to just use Demsar's strategy by default for zsplit or use it as an optional?

@mdhaber
Copy link
Contributor

mdhaber commented Jan 13, 2023

Note: "engagement" was supposed to be "enhancement". I edited my post above.

When there differences of value zero it is a special case of that: all the zeros are tied together and the they always show at the first positions of the sorted abs(d).

I thought that originally and then confused myself. I think you're right.

Is it all that much to be (for instance) stocked in a file?

Might be. Depends on the impact to the wheel size. I'd say something on the scale of 100kb would be noticed. See, e.g. #17726 (comment).

My hesitation would be for a different reason. I always try to think of scipy.stats as a whole, and I ask whether we'd want to do something similar for all other tests at this time. If not, then why does wilcoxon get this treatment when there is so much other work to be done? Stats maintainers are very active but still spread thin. Personally, I think it is more impactful - and simpler - to solve the problem in general with permutation_test than to rely on a table only for the smallest cases (when permutation_test will be fast anyway). I have code that is essentially ready for that, I am just waiting on your PR to avoid merge conflicts.

Adding another one about the API

I'm not sure I understand the question. I know that I don't think we need to add a zero_method in addition to 'zplit'. I think it's just a matter of what method is used when we notice that there is a zero.

You have shown that when zero_method='zsplit', using the exact p-value is more accurate yet still conservative compared to approx. So my understanding was that you planned to change the condition in

if n_zero > 0 and mode == "exact":

so that when zero_method='zsplit', it would no longer force method='approx'. I can agree with that.

When I asked "do they get more conservative?" (#17530 (comment)), I was hoping we could make a similar change for the other zero_method options. This would help keep the documentation reasonable, rather than having to include complicated decision tree that depends on sample size, presence of zeros, and the value of method to determine which method will actually get used.

Was that an answer to the right question?

@jpcbertoldo
Copy link
Author

Is it all that much to be (for instance) stocked in a file?

Might be. Depends on the impact to the wheel size. I'd say something on the scale of 100kb would be noticed. See, e.g. #17726 (comment).

🤔 ah, yes and I think for that many cases even for the num_samples=8 it would already require something like 1000kb.

My hesitation would be for a different reason. I always try to think of scipy.stats as a whole, and I ask whether we'd want to do something similar for all other tests at this time. If not, then why does wilcoxon get this treatment when there is so much other work to be done? Stats maintainers are very active but still spread thin. Personally, I think it is more impactful - and simpler - to solve the problem in general with permutation_test than to rely on a table only for the smallest cases (when permutation_test will be fast anyway). I have code that is essentially ready for that, I am just waiting on your PR to avoid merge conflicts.

Ok, I see your point.

As a non-maintainer my bias is to be more tolerant with those heterogenities in the code because I initially just wanted the (easy-to-use interface) scipy.stats.wilcoxon to get the best result possible, but I get that it makes the maintainers' lives harder hehe.

Was that an answer to the right question?

Not sure but I think I got the info I wanted haha: I will just change the behavior to use Demsar's strategy and avoid the exact-to-approx switch.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 14, 2023

Sounds good. Let's start there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
3 participants