Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: The exact p-value calculation in stats.cramervonmises_2samp can be improved #15685

Closed
fiveseven-lambda opened this issue Mar 2, 2022 · 31 comments · Fixed by #15709
Closed
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org enhancement A new feature or improvement scipy.stats
Milestone

Comments

@fiveseven-lambda
Copy link
Contributor

Is your feature request related to a problem? Please describe.

We have stats.cramervonmises_2samp() function for the 2-sample Cramér-von Mises test, but the exact p-value calculation takes a long time. The comment in the source code says:

The exact calculation will be very slow even for moderate sample sizes as the number of combinations increases rapidly with the size of the samples. If method=='auto', the exact approach is used if both samples contain less than 10 observations, otherwise the asymptotic distribution is used.

In fact, the following code runs in about 3 seconds on my computer (the actual code uses the exact approach if both samples contain equal to or less than 10 observations, not less than 10 as mentioned above).

import numpy as np
from scipy import stats
x = np.random.normal(0, 1, 10)
y = np.random.normal(0, 1, 10)
print(stats.cramervonmises_2samp(x, y))

Describe the solution you'd like.

The following paper seems to show a more efficient algorithm to calculate the exact p-value: Xiao, Y., Gordon, A., & Yakovlev, A. (2006). A C++ Program for the Cramér-Von Mises Two-Sample Test. Journal of Statistical Software, 17(8), 1–15. https://doi.org/10.18637/jss.v017.i08

Implementing this would significantly improve the calculation time of stats.cramervonmises_2samp().

Describe alternatives you've considered.

No response

Additional context (e.g. screenshots, GIFs)

No response

@fiveseven-lambda fiveseven-lambda added the enhancement A new feature or improvement label Mar 2, 2022
@tupui
Copy link
Member

tupui commented Mar 2, 2022

Hi @fiveseven-lambda, thanks for reporting.

The exact calculation will be very slow even for moderate sample sizes as the number of combinations increases rapidly with the size of the samples. If method=='auto', the exact approach is used if both samples contain less than 10 observations, otherwise the asymptotic distribution is used.

Looking at the code itself, it should read less or equal than 10. We should update the documentation here.

The following paper seems to show a more efficient algorithm to calculate the exact p-value: Xiao, Y., Gordon, A., & Yakovlev, A. (2006). A C++ Program for the Cramér-Von Mises Two-Sample Test. Journal of Statistical Software, 17(8), 1–15. doi.org/10.18637/jss.v017.i08

Implementing this would significantly improve the calculation time of stats.cramervonmises_2samp().

Are you proposing to do this? Before any work is done, my suggestion would be to time the current code. Maybe there are part that can be moved to Cython/Pythran before thinking about another algorithm.

@tupui tupui added scipy.stats Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org labels Mar 2, 2022
@fiveseven-lambda
Copy link
Contributor Author

Thank you for your advice. I am now trying what you said, and the current progress is here:

Timing the Whole Code

The function _pval_cvm_2samp_exact() is called inside cramervonmises_2samp() to calculate the exact p-value, and the complete definition of _pval_cvm_2samp_exact() is as follows:

import numpy as np
from itertools import combinations

# L673-L686 in stats/_hypotests.py
def _all_partitions(nx, ny):
    """
    Partition a set of indices into two fixed-length sets in all possible ways
    Partition a set of indices 0 ... nx + ny - 1 into two sets of length nx and
    ny in all possible ways (ignoring order of elements).
    """
    z = np.arange(nx+ny)
    for c in combinations(z, nx):
        x = np.array(c)
        mask = np.ones(nx+ny, bool)
        mask[x] = False
        y = z[mask]
        yield x, y

# L1265-L1288 in stats/_hypotests.py
def _pval_cvm_2samp_exact(s, nx, ny):
    """
    Compute the exact p-value of the Cramer-von Mises two-sample test
    for a given value s (float) of the test statistic by enumerating
    all possible combinations. nx and ny are the sizes of the samples.
    """
    rangex = np.arange(nx)
    rangey = np.arange(ny)

    us = []

    # x and y are all possible partitions of ranks from 0 to nx + ny - 1
    # into two sets of length nx and ny
    # Here, ranks are from 0 to nx + ny - 1 instead of 1 to nx + ny, but
    # this does not change the value of the statistic.
    for x, y in _all_partitions(nx, ny):
        # compute the statistic
        u = nx * np.sum((x - rangex)**2)
        u += ny * np.sum((y - rangey)**2)
        us.append(u)

    # compute the values of u and the frequencies
    u, cnt = np.unique(us, return_counts=True)
    return np.sum(cnt[u >= s]) / np.sum(cnt)

I measured the time it takes to evaluate _pval_cvm_2samp_exact(s, nx, ny) where nx = ny = 10 and s = 7000, 7500, 8000 with the following code:

import time

s = 7000 # edit here to specify the value of s

time_sum = 0
for _ in range(10):
    begin = time.perf_counter()
    _pval_cvm_2samp_exact(s, 10, 10)
    end = time.perf_counter()
    diff = end - begin
    print('%.3f' % diff)
    time_sum += diff
# the average of 10 trials
print('%.3f' % (time_sum / 10))

The result was:

s 1 2 3 4 5 6 7 8 9 10 average
7000 1.949 1.930 1.904 1.882 1.877 1.886 1.885 1.923 1.956 1.947 1.914
7500 1.951 1.964 1.973 2.018 1.966 1.961 1.946 1.951 1.945 1.956 1.963
8000 2.009 1.987 2.008 1.983 1.955 1.974 1.983 1.983 1.969 1.976 1.983

It takes about 2 seconds for all cases, and this is what I want to improve.

Since the value of s seems to have little effects on the computation time, I only measure the case s = 8000 below.

Timing the Slow Part

Next, I measured the part

    for x, y in _all_partitions(nx, ny):
        u = nx * np.sum((x - rangex)**2)
        u += ny * np.sum((y - rangey)**2)
        us.append(u)

by rewriting the code as

# the definition of _all_partitions() omitted

# the sum of 10 trials
time_sum = 0

def _pval_cvm_2samp_exact(s, nx, ny):
    global time_sum

    rangex = np.arange(nx)
    rangey = np.arange(ny)
    us = []

    begin = time.perf_counter()

    for x, y in _all_partitions(nx, ny):
        u = nx * np.sum((x - rangex)**2)
        u += ny * np.sum((y - rangey)**2)
        us.append(u)

    end = time.perf_counter()
    diff = end - begin
    print('%.3f' % diff)
    time_sum += diff

    u, cnt = np.unique(us, return_counts=True)
    return np.sum(cnt[u >= s]) / np.sum(cnt)

for _ in range(10):
    _pval_cvm_2samp_exact(8000, 10, 10)

# the average of 10 trials
print('%.3f' % (time_sum / 10))

Then I got

s 1 2 3 4 5 6 7 8 9 10 average
8000 1.792 1.814 1.798 1.818 1.806 1.816 1.861 1.847 1.774 1.801 1.813

These 4 lines takes 1.813 / 1.983 = 91.4% of the whole computation time.

@tupui
Copy link
Member

tupui commented Mar 2, 2022

Thank you for the details @fiveseven-lambda! Note that you would have an easier debugging session using tools like https://github.com/pyutils/line_profiler.

@fiveseven-lambda
Copy link
Contributor Author

@tupui Thank you. I didn't know such tools. I will use it since the next comment, though I use the legacy way to time the codes in this comment in order to align the conditions as above.

Trying Cython

Now I try using Cython in the following steps:

  1. Copy the definition of _all_partitions() _pval_cvm_2samp_exact() in a file named _pval_cvm_2samp_exact.pyx.
  2. Create setup.py and write:
    from setuptools import setup
    from Cython.Build import cythonize
    
    setup(
        ext_modules = cythonize("_pval_cvm_2samp_exact.pyx")
    )
  3. Run python setup.py build_ext --inplace.
  4. Run
    from _pval_cvm_2samp_exact import _pval_cvm_2samp_exact
    
    import time
    
    time_sum = 0
    for _ in range(10):
        begin = time.perf_counter()
        _pval_cvm_2samp_exact(8000, 10, 10)
        end = time.perf_counter()
        diff = end - begin
        print('%.3f' % diff)
        time_sum += diff
    # the average of 10 trials
    print('%.3f' % (time_sum / 10))

Then, the result was:

s 1 2 3 4 5 6 7 8 9 10 average
8000 1.996 1.984 1.988 1.986 2.001 1.990 1.996 1.991 2.036 2.069 2.004

It didn't seem to be much improved by using Cython.

@tupui
Copy link
Member

tupui commented Mar 2, 2022

If you just copied the functions without edits it will not do much I am afraid. Going to Cython means that you have to re-write the code using C-like syntax. So no use of NumPy functions. If you want to do this, you can try Pythran which has some support for NumPy. We have some examples in stats using Pythran. See https://pythran.readthedocs.io/en/latest/SUPPORT.html# seems like everything needed would be there.

@fiveseven-lambda
Copy link
Contributor Author

@tupui Thank you. I tried Pythran, but failed:

Trying Pythran (failed)

I added type hints before the function definitions as:

# pythran export _all_partitions(int, int)
def _all_partitions(nx, ny):
# pythran export _pval_cvm_2samp_exact(int, int, int)
def _pval_cvm_2samp_exact(s, nx, ny):

I ran pythran _pval_cvm_2samp_exact.py, then I got an error:

WARNING: Compilation error, trying hard to find its origin...
WARNING: Nop, I'm going to flood you with C++ errors!
CRITICAL: Cover me Jack. Jack? Jaaaaack!!!!
E: error: Command "gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -flto -ffat-lto-objects -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -flto -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -flto -fPIC -DENABLE_PYTHON_MODULE -D__PYTHRAN__=3 -DPYTHRAN_BLAS_BLAS -I/usr/lib/python3.10/site-packages/pythran -I/usr/lib/python3.10/site-packages/numpy/core/include -I/usr/local/include -I/usr/include/python3.10 -c /tmp/tmpe8_rnn4k.cpp -o /tmp/tmpbwpjnz5o/tmp/tmpe8_rnn4k.o -std=c++11 -fno-math-errno -fvisibility=hidden -fno-wrapv -Wno-unused-function -Wno-int-in-bool-context -Wno-unknown-warning-option" failed with exit status 1

@tupui
Copy link
Member

tupui commented Mar 2, 2022

I suppose it's because of yield. I don't think Pythran supports this Python keyword. So you would need to not use a generator but return a list or array instead.

@fiveseven-lambda
Copy link
Contributor Author

Thank you @tupui. You were right:

Trying Pythran (successful)

I rewrote the _pval_cvm_2samp_exact.py without using yield:

import numpy as np
from itertools import combinations

# pythran export _pval_cvm_2samp_exact(int, int, int)
def _pval_cvm_2samp_exact(s, nx, ny):
    rangex = np.arange(nx)
    rangey = np.arange(ny)

    us = []

    z = np.arange(nx+ny)
    for c in combinations(z, nx):
        x = np.array(c)
        mask = np.ones(nx+ny, bool)
        mask[x] = False
        y = z[mask]
        # compute the statistic
        u = nx * np.sum((x - rangex)**2)
        u += ny * np.sum((y - rangey)**2)
        us.append(u)

    u, cnt = np.unique(us, return_counts=True)
    return np.sum(cnt[u >= s]) / np.sum(cnt)

Then pythran _pval_cvm_2samp_exact.py and got:

1 2 3 4 5 6 7 8 9 10 average
0.031 0.030 0.030 0.031 0.031 0.032 0.031 0.030 0.031 0.030 0.031

It works! Thank you.

Another Problem of the Current Code

As mentioned above, stats.cramervonmises_2samp(x, y) uses the asymptotic method if either len(x) or len(y) is greater than 10, which means that the returned p-value can be a bit different from the true p-value if len(x) > 10 or len(y) > 10, especially when len(x) and len(y) are close to 10. In statistical testings, asymptotic method is used only when the sample size (i.e. the number of observations) is large enough so that the error of the result is negligible.

But the following result shows that the current threshold 10 is too small for stats.cramervonmises_2samp().

Running this code

from scipy import stats
x = [1, 2, 3, 6, 7, 8, 11, 12, 14, 15, 16]
y = [4, 5, 9, 10, 13, 17, 18, 19, 20, 21, 22]
# auto (asymptotic)
print(stats.cramervonmises_2samp(x, y))
# exact
print(stats.cramervonmises_2samp(x, y, method = 'exact'))

we have

CramerVonMisesResult(statistic=0.4690082644628095, pvalue=0.04778422596827481)
CramerVonMisesResult(statistic=0.4690082644628095, pvalue=0.05102405334603477)

The defference between the two results (0.003) is unignorable, because it crosses over the line 0.05, which is used widely as the significance level.

It seems the value of threshold 10 has to be raised, but I have no idea how much number suffices.

@tupui
Copy link
Member

tupui commented Mar 2, 2022

Great, thanks for pushing this! It sounds like you have a case to make a PR (this function would fit in _hypotests_pythran.py).

For the low value, maybe it was just done for performance reasons. We could do this in 2 steps, do the performance improvement and quick fix the doc, then change the threshold. Or if you find previous PR/issues on this code, maybe they explained the reason why it was set to 10.

@fiveseven-lambda
Copy link
Contributor Author

I found a conversation in this PR:

Also, there are more efficient methods for computing this, right? Prefer to save that for future work? This gives a short summary of some methods.

yes, the calculation is very similar to the KS test exact method and there are more efficient path counting methods comparable to the code implemented by @pvanmulbregt . I just wanted to add a simple way to compute the exact values for small samples.

OK.

It seems that a larger threshold or the more efficient method wasn't just their purpose.

@tupui
Copy link
Member

tupui commented Mar 2, 2022

Ok but then it seems like we can update the value as we want until performance are a concern again. During the review we will ping both Matt and Christoph.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 2, 2022

Yes, the current exact calculation is equivalent to permutation_test. It is not really worth moving that to Cython right now. At best, it might allow you to calculate the null distribution with one or two more observations, but the number of permutations grows too quickly as the sample size increases for a constant speedup to help much. The intent is to add efficient methods for calculating exact distributions for all test statistics where possible, but there are lots of other things to work on, and reviewer time is limited, so not all are done yet.

Slow speed is the only reason why the threshold was set so low. Check the literature for a recommendation about where to set the cutoff. If you can't find one, do some experiments to see when you think the error between the two is small enough.

I'd suggest starting with pure Python.

@tupui
Copy link
Member

tupui commented Mar 2, 2022

@mdhaber I don't think I understand. Going to Pythran is making the code go from 2s to 30ms. Why do you propose to stick with pure Python here?

@mdhaber
Copy link
Contributor

mdhaber commented Mar 2, 2022

I think increasing the number of observations per sample by 10 increases the number of computations by a factor of about a million. I don't think we can get very far with this brute force algorithm before Pythran gets slow.

@tupui
Copy link
Member

tupui commented Mar 2, 2022

Ah yes indeed... So you advice on not taking this in favour of using another algorithm?

@tupui
Copy link
Member

tupui commented Mar 2, 2022

@fiveseven-lambda, I am summarising a small talk we had @mdhaber and I.

Matt makes a good point that the current algorithm is not scalable due to the combination part (I skipped this part too fast). Yes it would still add a small value in terms of speed to migrate this part to Pythran, but at the cost of doing it/review and most importantly maintaining it because it's not pure Python anymore and it opens the door to new potential issues.

What would be preferable instead would be to implement in pure Python (at first at least) the new algorithm you proposed (I did not do a literature review, so I personally cannot say if there would be better approaches at that point). If you do that, first of all thank you! Second, please keep it first as close to the paper (no need to be creative on naming nor try to find tricks at first). mannwhitneyu should be a good example of the desired end state.

@fiveseven-lambda
Copy link
Contributor Author

I am reading the paper, and I implemented the algorithm 1 in page 6 so far. Further improvement is performed in the following sections.

So the function _pval_cvm_2samp_exact_new() defined as

import math
import itertools

def _pval_cvm_2samp_exact_new(s, m, n):
    # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
    # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)

    # compute T (eq. 9 in [2])
    # same as the code in cramervonmises_2samp(), this can be omitted if cramervonmises_2samp() passes the value of t instead of u to _pval_cvm_2samp_exact()
    k, N = m * n, m + n
    t = s / (k * N) - (4 * k - 1)/(6 * N)

    # [1, p. 3]
    l = math.lcm(m, n)
    # [1, p. 4], below eq. 3
    a = l // m
    b = l // n
    # eq. 2 in [1]
    zeta = t * (m + n) ** 2 * l ** 2 / (m * n) - 1e-6
    # Each dictionary g[u][v] is the frequency table of $g_{u, v}^+$ defined in [1, p. 6]
    g = [[{} for _ in range(m + 1)] for _ in range(n + 1)]
    for u in range(n + 1):
        for v in range(m + 1):
            if u == 0:
                # eq. 13 in [1]
                g[u][v][a * a * v * (v + 1) * (2 * v + 1) // 6] = 1
                continue
            if v == 0:
                # eq. 12 in [1]
                g[u][v][b * b * u * (u + 1) * (2 * u + 1) // 6] = 1
                continue
            # eq. 11 in [1]
            d = a * v - b * u
            for (value, frequency) in itertools.chain(g[u - 1][v].items(), g[u][v - 1].items()):
                g[u][v].setdefault(value + d * d, 0)
                g[u][v][value + d * d] += frequency
    return sum(frequency for (value, frequency) in g[n][m].items() if value >= zeta) / math.comb(m + n, m)

is pretty fast:

1 2 3 4 5 6 7 8 9 10 average
0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001

But the conversion between several statistics occur:

# from s (or u in cramervonmises_2samp()) to t
t = s / (k * N) - (4 * k - 1) / (6 * N)
# from t to zeta
zeta = t * (m + n) ** 2 * l ** 2 / (m * n) - 1e-6

and this causes floating-point error. The current code subtracts a small number 1e-6 from zeta to avoid the wrong result, but maybe the calculation can be done without using floating-point numbers.

@tupui
Copy link
Member

tupui commented Mar 3, 2022

Looks like a good start. Does this scale? The data structure g seems pretty huge from a quick test.

@fiveseven-lambda
Copy link
Contributor Author

The computation time for larger m and n was as follows:

m n s time (seconds)
10 10 0 0.001
15 15 0 0.011
20 20 0 0.053
25 25 0 0.178
30 30 0 0.476

I also counted the total size of the dicts in g by adding a line:

print(sum(sum(len(table) for table in row) for row in g))

result:

m n total size of g
10 10 4428
15 15 34376
20 20 149200
25 25 464279
30 30 1170635

We only have to count the total number of the values equal to or greater than zeta, so the dicts can be replaced with just lists, I think.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 3, 2022

Looks like a good start @fiveseven-lambda. I haven't checked too carefully, but one thing to think about is whether you can vectorize at least one of those for loops.

@fiveseven-lambda
Copy link
Contributor Author

Sorry I'm late. The current progress:

Eliminating floating-point arithmetic

I took seriously that a floating-point error occurs while converting from u to zeta, although u and zeta are both integers. So I read the paper and got the right formula for the conversion: zeta = l ** 2 * N * (6 * u - mn * (4 * mn - 1)) // (6 * mn ** 2). The following program checks if it is correct:

import numpy as np
import math

for N in range(2, 15):
    for xy in range(1, (1 << N) - 1):
        # array of 1 (represents X) and 0 (represents Y) of length N
        xy = xy >> np.arange(N) & 1
        # number of X's
        m = np.count_nonzero(xy)
        # number of Y's
        n = N - m
        # lcm
        l = math.lcm(m, n)
        a = l // m
        b = l // n
        # empirical distribution functions
        f = np.insert(xy.cumsum(), 0, 0)
        g = np.arange(N + 1) - f
        # zeta, must be an integer
        zeta = sum((a * f - b * g) ** 2)
        # ranks of x in xy
        r = np.arange(N)[xy == 1]
        # ranks of y in xy
        s = np.arange(N)[xy == 0]
        # u, must be an integer
        u = m * sum((r - np.arange(m)) ** 2) + n * sum((s - np.arange(n)) ** 2)
        # conversion from u to zeta
        mn = m * n
        zeta2 = l ** 2 * N * (6 * u - mn * (4 * mn - 1)) / (6 * mn ** 2)
        if zeta != zeta2:
            print(m, n, xy, zeta, u, zeta2)

This shows that the value of zeta2 calculated as l ** 2 * N * (6 * u - mn * (4 * mn - 1)) / (6 * mn ** 2) equals to zeta, and zeta must be an integer because it is defined as sum((a * f - b * g) ** 2) where a, b, f and g are all integers or integer arrays. Therefore, the division / (6 * mn ** 2) can be done with // (6 * mn ** 2) instead.

Now that we don't have to worry about the floating-point errors.

Faster algorithm described in the paper

The paper (Xiao, Y., Gordon, A., & Yakovlev, A. 2006) seems to introduce a faster algorithm (algorithm 2) than I implemented 2 days ago (algorithm 1). But I cannot insist that we should use the algorithm 2, because it includes convolution, and doesn't seem much faster than the algorithm 1 if m and n are not so large (the asymptotic method will do if they are so large). So the way I think the best is to improve the current code in algorithm 1 (by vectorizing some loops as @mdhaber said, for example).

@fiveseven-lambda
Copy link
Contributor Author

I have refactored the code. Just replacing _pval_cvm_2samp_exact() with this _pval_cvm_2samp_exact_new() allows us to raise the threshold to at least 20: _pval_cvm_2samp_exact_new(146600, 20, 20) was calculated in 0.005 second.

import math
import numpy as np

def _pval_cvm_2samp_exact_new(s, m, n):
    # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
    # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)

    # [1, p. 3]
    l = math.lcm(m, n)
    # [1, p. 4], below eq. 3
    a = l // m
    b = l // n
    # eq. 9 in [2] and eq. 2 in [1]
    mn = m * n
    zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
    # each g[u][v] is the cumulative frequency table of $g_{u, v}^+$ defined in [1, p. 6]
    g = [[np.zeros(zeta, int) for _ in range(m + 1)] for _ in range(n + 1)]
    g[0][0] = np.ones(zeta, int)
    for u in range(n + 1):
        for v in range(m + 1):
            # calculate g[u][v] by g[u - 1][v] and g[u][v - 1] with eq. 11 in [1]
            d = a * v - b * u
            if d * d >= zeta:
                continue
            if u > 0:
                g[u][v][d * d:] += g[u - 1][v][:zeta - d * d]
            if v > 0:
                g[u][v][d * d:] += g[u][v - 1][:zeta - d * d]
    # the number of all combinations
    total = math.comb(m + n, m)
    # >= zeta is the complement of <= zeta - 1
    return (total - g[n][m][zeta - 1]) / total

@fiveseven-lambda
Copy link
Contributor Author

fiveseven-lambda commented Mar 7, 2022

There was a problem in the code I wrote 2 day ago. It uses a 3-dimensional array of size zeta * (m + 1) * (n + 1), but the value of zeta can be very large if lcm(m, n) is large (see Table 1). On the other hand, the previous one uses (m + 1) * (n + 1) dicts, and the size of each dict is much smaller than zeta, even if m and n are co-prime (see Table 2).

Table 1: The maximum value of zeta for each m, n
m\n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 5 14 30 55 91 140 204 285 385 506 650 819 1015 1240 1496 1785 2109 2470 2870
2 5 6 65 34 245 100 609 220 1221 410 2145 686 3445 1064 5185 1560 7429 2190 10241 2970
3 14 65 19 350 620 111 1505 2156 330 3965 5159 730 8216 10115 1365 14744 17510 2289 24035 27830
4 30 34 350 44 1230 490 2926 260 5694 1890 9790 776 15470 4746 22990 1720 32606 9570 44574 3220
5 55 245 620 1230 85 3355 4970 7020 9555 505 16280 20570 25545 31255 1510 45080 53295 62445 72580 3350
6 91 100 111 490 3355 146 7735 2716 1635 4840 24871 870 38779 11830 6335 16984 80155 2604 108775 31330
7 140 609 1505 2926 4970 7735 231 15820 21336 27965 35805 44954 55510 1379 81235 96600 113764 132825 153881 177030
8 204 220 2156 260 7020 2716 15820 344 29580 9660 49324 3860 76076 23100 110860 2056 154700 45084 208620 14980
9 285 1221 330 5694 9555 1635 21336 29580 489 51585 65670 9114 100815 122199 16260 173400 203541 2925 273714 314070
10 385 410 3965 1890 505 4840 27965 9660 51585 670 85085 26510 130065 39340 7525 55640 260865 75810 349885 4010
11 506 2145 5159 9790 16280 24871 35805 49324 65670 85085 891 134090 164164 198275 236665 279576 327250 379929 437855 501270
12 650 686 730 776 20570 870 44954 3860 9114 26510 134090 1156 203450 61334 32490 21560 403274 12990 538346 38480
13 819 3445 8216 15470 25545 38779 55510 76076 100815 130065 164164 203450 1469 298935 355810 419224 489515 567021 652080 745030
14 1015 1064 10115 4746 31255 11830 1379 23100 122199 39340 198275 61334 298935 1834 427315 125720 586551 169680 779779 222530
15 1240 5185 1365 22990 1510 6335 81235 110860 16260 7525 236665 32490 355810 427315 2255 596440 694960 89265 922165 42070
16 1496 1560 14744 1720 45080 16984 96600 2056 173400 55640 279576 21560 419224 125720 596440 2736 815320 235416 1079960 76920
17 1785 7429 17510 32606 53295 80155 113764 154700 203541 260865 327250 403274 489515 586551 694960 815320 3281 1094205 1253886 1427830
18 2109 2190 2289 9570 62445 2604 132825 45084 2925 75810 379929 12990 567021 169680 89265 235416 1094205 3894 1444665 410970
19 2470 10241 24035 44574 72580 108775 153881 208620 273714 349885 437855 538346 652080 779779 922165 1079960 1253886 1444665 4579 1879670
20 2870 2970 27830 3220 3350 31330 177030 14980 314070 4010 501270 38480 745030 222530 42070 76920 1427830 410970 1879670 5340
Table 2: The maximum size of dicts
m\n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11
2 2 2 6 6 11 8 19 15 28 17 38 27 51 29 64 43 78 45 97 62
3 2 6 4 17 18 18 33 59 26 89 80 65 115 175 71 236 194 146 244 378
4 3 6 17 8 43 42 94 46 159 130 241 69 353 263 500 208 643 458 838 197
5 3 11 18 43 14 108 95 207 169 102 268 529 379 753 140 1021 703 1337 893 466
6 4 8 18 42 108 25 237 197 188 190 687 195 1010 397 584 936 1885 261 2432 1540
7 4 19 33 94 95 237 42 469 372 803 589 1247 845 332 1171 2397 1537 3141 1975 3979
8 5 15 59 46 207 197 469 63 867 640 1396 533 2063 1358 2869 521 3808 2368 4900 1611
9 5 28 26 159 169 188 372 867 92 1480 1043 885 1516 3235 773 4398 2738 773 3497 7249
10 6 17 89 130 102 190 803 640 1480 130 2378 1597 3479 1214 1140 3027 6442 2091 8250 1093
11 6 38 80 241 268 687 589 1396 1043 2378 177 3644 2362 5159 3263 6977 4275 9086 5451 11484
12 7 27 65 69 529 195 1247 533 885 1597 3644 233 5309 3362 2702 2374 9719 2086 12450 2001
13 7 51 115 353 379 1010 845 2063 1516 3479 2362 5309 305 7537 4698 10178 6147 13208 7831 16651
14 8 29 175 263 753 397 332 1358 3235 1214 5159 3362 7537 385 10407 6314 13722 4278 17537 10245
15 8 64 71 500 140 584 1171 2869 773 1140 3263 2702 4698 10407 478 14005 8401 6462 10678 4971
16 9 43 236 208 1021 936 2397 521 4398 3027 6977 2374 10178 6314 14005 591 18446 10897 23532 7052
17 9 78 194 643 703 1885 1537 3808 2738 6442 4275 9719 6147 13722 8401 18446 715 23895 13969 30000
18 10 45 146 458 1337 261 3141 2368 773 2091 9086 2086 13208 4278 6462 10897 23895 854 30407 17600
19 10 97 244 838 893 2432 1975 4900 3497 8250 5451 12450 7831 17537 10678 23532 13969 30407 1016 38185
20 11 62 378 197 466 1540 3979 1611 7249 1093 11484 2001 16651 10245 4971 7052 30000 17600 38185 1192

The previous one (using dicts) was better:

import math
from collections import Counter

def _pval_cvm_2samp_exact_new(s, m, n):
    # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
    # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)

    # [1, p. 3]
    l = math.lcm(m, n)
    # [1, p. 4], below eq. 3
    a = l // m
    b = l // n
    # eq. 9 in [2] and eq. 2 in [1]
    mn = m * n
    zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
    # each g[u][v] is the cumulative frequency table of $g_{u, v}^+$ defined in [1, p. 6]
    g = [[Counter() for _ in range(m + 1)] for _ in range(n + 1)]
    g[0][0][0] = 1
    for u in range(n + 1):
        for v in range(m + 1):
            tmp = Counter()
            if u > 0:
                tmp += g[u - 1][v]
            if v > 0:
                tmp += g[u][v - 1]
            d = a * v - b * u
            for value, frequency in tmp.items():
                g[u][v][value + d * d] = frequency
    return sum(frequency for value, frequency in g[n][m].items() if value >= zeta) / math.comb(m + n, m)

But I'm sorry I don't know how to vectorize for loops in this code. Could you please improve this code?

@tupui
Copy link
Member

tupui commented Mar 7, 2022

I am not an expert in vectorization. What you want to do this is to change the data structure, elements need to be in a NumPy array so you can use reduction on some axis.

@fiveseven-lambda
Copy link
Contributor Author

fiveseven-lambda commented Mar 7, 2022

To use NumPy array, we need a way equivalent to collections.Counter's + operation.
Given 2 arrays

[[10  5]
 [50  3]]

and

[[30  9]
 [50  4]
 [60  2]]

we want to merge these and sort:

[[10  5]
 [30  9]
 [50  7]
 [60  2]]

(Note that [50 3] and [50 4] are added into [50 7].)

Is there any way to do this in a short time?

@tupui
Copy link
Member

tupui commented Mar 7, 2022

I am not sure. Maybe a combination of np.unique, np.argwhere and summing for the reduction part.

@fiveseven-lambda
Copy link
Contributor Author

I've come up with this one, using np.intersect1d:

import numpy as np
a = np.array([[10, 5], [50, 3]])
b = np.array([[30, 9], [50, 4], [60, 2]])
av, af = a.T
bv, bf = b.T
intersection_v, a_index, b_index = np.intersect1d(av, bv, return_indices = True)
intersection = np.array([intersection_v, af[a_index] + bf[b_index]]).T
result = np.concatenate([intersection, np.delete(a, a_index, axis = 0), np.delete(b, b_index, axis = 0)])
result.sort(axis = 0)
print(result)

Does there seem to be a better way? If not, I will use this.

@fiveseven-lambda
Copy link
Contributor Author

So this is the code using NumPy arrays:

import math
import numpy as np

def _pval_cvm_2samp_exact_new(s, m, n):
    # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
    # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)

    # [1, p. 3]
    l = math.lcm(m, n)
    # [1, p. 4], below eq. 3
    a = l // m
    b = l // n
    # eq. 9 in [2] and eq. 2 in [1]
    mn = m * n
    zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
    # each g[u][v] is the cumulative frequency table of $g_{u, v}^+$ defined in [1, p. 6]
    g = [[None for _ in range(m + 1)] for _ in range(n + 1)]
    for u in range(n + 1):
        for v in range(m + 1):
            # calculate g[u][v] by g[u - 1][v] and g[u][v - 1] with eq. 11 in [1]
            if u == 0:
                if v == 0:
                    g[u][v] = np.array([[0, 1]])
                else:
                    g[u][v] = g[u][v - 1].copy()
            else:
                if v == 0:
                    g[u][v] = g[u - 1][v].copy()
                else:
                    v0, f0 = g[u][v - 1].T
                    v1, f1 = g[u - 1][v].T
                    vi, i0, i1 = np.intersect1d(v0, v1, return_indices = True)
                    g[u][v] = np.concatenate([
                        np.array([vi, f0[i0] + f1[i1]]).T,
                        np.delete(g[u][v - 1], i0, axis = 0),
                        np.delete(g[u - 1][v], i1, axis = 0)
                    ])
            d = a * v - b * u
            g[u][v][:, 0] += d * d
    last_g = g[n][m]
    return np.sum(last_g[last_g[:, 0] >= zeta][:, 1]) / math.comb(m + n, m)

I think this is good enough to create a pull request. Is there anything to do other than the two below?

  • Update the code of _pval_cvm_2samp_exact_new()
  • Update the code and the doc comment of cramervonmises_2samp() (about the shreshold)

@tupui
Copy link
Member

tupui commented Mar 7, 2022

Did the NumPy code help compared to the other version you had?

@fiveseven-lambda
Copy link
Contributor Author

Yes. I compared the one with cumulative frequency table, the one with collections.Counter, and the one with NumPy array.

Timer unit: 1e-06 s

Total time: 2.36472 s
File: a.py
Function: _pval_cvm_2samp_exact_cumulative_frequency_table at line 5

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     5                                           @profile
     6                                           def _pval_cvm_2samp_exact_cumulative_frequency_table(s, m, n):
     7                                               # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
     8                                               # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
     9
    10                                               # [1, p. 3]
    11         1          2.0      2.0      0.0      l = math.lcm(m, n)
    12                                               # [1, p. 4], below eq. 3
    13         1          1.0      1.0      0.0      a = l // m
    14         1          1.0      1.0      0.0      b = l // n
    15                                               # eq. 9 in [2] and eq. 2 in [1]
    16         1          1.0      1.0      0.0      mn = m * n
    17         1          2.0      2.0      0.0      zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
    18                                               # each g[u][v] is the cumulative frequency table of $g_{u, v}^+$ defined in [1, p. 6]
    19         1        826.0    826.0      0.0      g = [[np.zeros(zeta, int) for _ in range(m + 1)] for _ in range(n + 1)]
    20         1       1560.0   1560.0      0.1      g[0][0] = np.ones(zeta, int)
    21        22         14.0      0.6      0.0      for u in range(n + 1):
    22       441        366.0      0.8      0.0          for v in range(m + 1):
    23                                                       # calculate g[u][v] by g[u - 1][v] and g[u][v - 1] with eq. 11 in [1]
    24       420        319.0      0.8      0.0              d = a * v - b * u
    25       420        290.0      0.7      0.0              if d * d >= zeta:
    26                                                           continue
    27       420        170.0      0.4      0.0              if u > 0:
    28       400    1877266.0   4693.2     79.4                  g[u][v][d * d:] += g[u - 1][v][:zeta - d * d]
    29       420        354.0      0.8      0.0              if v > 0:
    30       399     483540.0   1211.9     20.4                  g[u][v][d * d:] += g[u][v - 1][:zeta - d * d]
    31                                               # the number of all combinations
    32         1          6.0      6.0      0.0      total = math.comb(m + n, m)
    33                                               # >= zeta is the complement of <= zeta - 1
    34         1          4.0      4.0      0.0      return (total - g[n][m][zeta - 1]) / total

Total time: 6.6444 s
File: a.py
Function: _pval_cvm_2samp_exact_counter at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    36                                           @profile
    37                                           def _pval_cvm_2samp_exact_counter(s, m, n):
    38                                               # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
    39                                               # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
    40
    41                                               # [1, p. 3]
    42         1          4.0      4.0      0.0      l = math.lcm(m, n)
    43                                               # [1, p. 4], below eq. 3
    44         1          1.0      1.0      0.0      a = l // m
    45         1          0.0      0.0      0.0      b = l // n
    46                                               # eq. 9 in [2] and eq. 2 in [1]
    47         1          0.0      0.0      0.0      mn = m * n
    48         1          3.0      3.0      0.0      zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
    49                                               # each g[u][v] is the frequency table of $g_{u, v}^+$ defined in [1, p. 6]
    50         1        474.0    474.0      0.0      g = [[Counter() for _ in range(m + 1)] for _ in range(n + 1)]
    51         1          3.0      3.0      0.0      g[0][0][0] = 1
    52        22         10.0      0.5      0.0      for u in range(n + 1):
    53       441        216.0      0.5      0.0          for v in range(m + 1):
    54       420       6938.0     16.5      0.1              tmp = Counter()
    55       420        223.0      0.5      0.0              if u > 0:
    56       400    1639626.0   4099.1     24.7                  tmp += g[u - 1][v]
    57       420        201.0      0.5      0.0              if v > 0:
    58       399    1334903.0   3345.6     20.1                  tmp += g[u][v - 1]
    59       420        286.0      0.7      0.0              d = a * v - b * u
    60   3708506    1492975.0      0.4     22.5              for value, frequency in tmp.items():
    61   3708086    2165124.0      0.6     32.6                  g[u][v][value + d * d] = frequency
    62         1       3415.0   3415.0      0.1      return sum(frequency for value, frequency in g[n][m].items() if value >= zeta) / math.comb(m + n, m)

Total time: 0.208749 s
File: a.py
Function: _pval_cvm_2samp_exact_numpy at line 64

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    64                                           @profile
    65                                           def _pval_cvm_2samp_exact_numpy(s, m, n):
    66                                               # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
    67                                               # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
    68
    69                                               # [1, p. 3]
    70         1          3.0      3.0      0.0      l = math.lcm(m, n)
    71                                               # [1, p. 4], below eq. 3
    72         1          1.0      1.0      0.0      a = l // m
    73         1          1.0      1.0      0.0      b = l // n
    74                                               # eq. 9 in [2] and eq. 2 in [1]
    75         1          1.0      1.0      0.0      mn = m * n
    76         1          3.0      3.0      0.0      zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
    77                                               # each g[u][v] is the frequency table of $g_{u, v}^+$ defined in [1, p. 6]
    78         1         46.0     46.0      0.0      g = [[None for _ in range(m + 1)] for _ in range(n + 1)]
    79        22          8.0      0.4      0.0      for u in range(n + 1):
    80       441        261.0      0.6      0.1          for v in range(m + 1):
    81                                                       # calculate g[u][v] by g[u - 1][v] and g[u][v - 1] with eq. 11 in [1]
    82       420        218.0      0.5      0.1              if u == 0:
    83        20          7.0      0.3      0.0                  if v == 0:
    84         1         11.0     11.0      0.0                      g[u][v] = np.array([[0, 1]])
    85                                                           else:
    86        19         23.0      1.2      0.0                      g[u][v] = g[u][v - 1].copy()
    87                                                       else:
    88       400        164.0      0.4      0.1                  if v == 0:
    89        20         36.0      1.8      0.0                      g[u][v] = g[u - 1][v].copy()
    90                                                           else:
    91       380        893.0      2.4      0.4                      v0, f0 = g[u][v - 1].T
    92       380        432.0      1.1      0.2                      v1, f1 = g[u - 1][v].T
    93       380     149054.0    392.2     71.4                      vi, i0, i1 = np.intersect1d(v0, v1, return_indices = True)
    94       760      11093.0     14.6      5.3                      g[u][v] = np.concatenate([
    95       380       8128.0     21.4      3.9                          np.array([vi, f0[i0] + f1[i1]]).T,
    96       380      17743.0     46.7      8.5                          np.delete(g[u][v - 1], i0, axis = 0),
    97       380      16579.0     43.6      7.9                          np.delete(g[u - 1][v], i1, axis = 0)
    98                                                               ])
    99       420        302.0      0.7      0.1              d = a * v - b * u
   100       420       3643.0      8.7      1.7              g[u][v][:, 0] += d * d
   101         1          1.0      1.0      0.0      last_g = g[n][m]
   102         1         98.0     98.0      0.0      return np.sum(last_g[last_g[:, 0] >= zeta][:, 1]) / math.comb(m + n, m)

@fiveseven-lambda
Copy link
Contributor Author

I am very sorry to create a pull request with a wrong file. My local environment setup had a problem, and I failed to detect the error. I am now retrying to run the tests correctly on my computer, but it may takes a while. I am sorry to bother you.

@mdhaber mdhaber added this to the 1.10.0 milestone Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants