-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: The exact p-value calculation in stats.cramervonmises_2samp
can be improved
#15685
Comments
Hi @fiveseven-lambda, thanks for reporting.
Looking at the code itself, it should read less or equal than 10. We should update the documentation here.
Are you proposing to do this? Before any work is done, my suggestion would be to time the current code. Maybe there are part that can be moved to Cython/Pythran before thinking about another algorithm. |
Thank you for your advice. I am now trying what you said, and the current progress is here: Timing the Whole CodeThe function import numpy as np
from itertools import combinations
# L673-L686 in stats/_hypotests.py
def _all_partitions(nx, ny):
"""
Partition a set of indices into two fixed-length sets in all possible ways
Partition a set of indices 0 ... nx + ny - 1 into two sets of length nx and
ny in all possible ways (ignoring order of elements).
"""
z = np.arange(nx+ny)
for c in combinations(z, nx):
x = np.array(c)
mask = np.ones(nx+ny, bool)
mask[x] = False
y = z[mask]
yield x, y
# L1265-L1288 in stats/_hypotests.py
def _pval_cvm_2samp_exact(s, nx, ny):
"""
Compute the exact p-value of the Cramer-von Mises two-sample test
for a given value s (float) of the test statistic by enumerating
all possible combinations. nx and ny are the sizes of the samples.
"""
rangex = np.arange(nx)
rangey = np.arange(ny)
us = []
# x and y are all possible partitions of ranks from 0 to nx + ny - 1
# into two sets of length nx and ny
# Here, ranks are from 0 to nx + ny - 1 instead of 1 to nx + ny, but
# this does not change the value of the statistic.
for x, y in _all_partitions(nx, ny):
# compute the statistic
u = nx * np.sum((x - rangex)**2)
u += ny * np.sum((y - rangey)**2)
us.append(u)
# compute the values of u and the frequencies
u, cnt = np.unique(us, return_counts=True)
return np.sum(cnt[u >= s]) / np.sum(cnt) I measured the time it takes to evaluate import time
s = 7000 # edit here to specify the value of s
time_sum = 0
for _ in range(10):
begin = time.perf_counter()
_pval_cvm_2samp_exact(s, 10, 10)
end = time.perf_counter()
diff = end - begin
print('%.3f' % diff)
time_sum += diff
# the average of 10 trials
print('%.3f' % (time_sum / 10)) The result was:
It takes about 2 seconds for all cases, and this is what I want to improve. Since the value of Timing the Slow PartNext, I measured the part for x, y in _all_partitions(nx, ny):
u = nx * np.sum((x - rangex)**2)
u += ny * np.sum((y - rangey)**2)
us.append(u) by rewriting the code as # the definition of _all_partitions() omitted
# the sum of 10 trials
time_sum = 0
def _pval_cvm_2samp_exact(s, nx, ny):
global time_sum
rangex = np.arange(nx)
rangey = np.arange(ny)
us = []
begin = time.perf_counter()
for x, y in _all_partitions(nx, ny):
u = nx * np.sum((x - rangex)**2)
u += ny * np.sum((y - rangey)**2)
us.append(u)
end = time.perf_counter()
diff = end - begin
print('%.3f' % diff)
time_sum += diff
u, cnt = np.unique(us, return_counts=True)
return np.sum(cnt[u >= s]) / np.sum(cnt)
for _ in range(10):
_pval_cvm_2samp_exact(8000, 10, 10)
# the average of 10 trials
print('%.3f' % (time_sum / 10)) Then I got
These 4 lines takes 1.813 / 1.983 = 91.4% of the whole computation time. |
Thank you for the details @fiveseven-lambda! Note that you would have an easier debugging session using tools like https://github.com/pyutils/line_profiler. |
@tupui Thank you. I didn't know such tools. I will use it since the next comment, though I use the legacy way to time the codes in this comment in order to align the conditions as above. Trying CythonNow I try using Cython in the following steps:
Then, the result was:
It didn't seem to be much improved by using Cython. |
If you just copied the functions without edits it will not do much I am afraid. Going to Cython means that you have to re-write the code using C-like syntax. So no use of NumPy functions. If you want to do this, you can try Pythran which has some support for NumPy. We have some examples in stats using Pythran. See https://pythran.readthedocs.io/en/latest/SUPPORT.html# seems like everything needed would be there. |
@tupui Thank you. I tried Pythran, but failed: Trying Pythran (failed)I added type hints before the function definitions as: # pythran export _all_partitions(int, int)
def _all_partitions(nx, ny): # pythran export _pval_cvm_2samp_exact(int, int, int)
def _pval_cvm_2samp_exact(s, nx, ny): I ran
|
I suppose it's because of |
Thank you @tupui. You were right: Trying Pythran (successful)I rewrote the import numpy as np
from itertools import combinations
# pythran export _pval_cvm_2samp_exact(int, int, int)
def _pval_cvm_2samp_exact(s, nx, ny):
rangex = np.arange(nx)
rangey = np.arange(ny)
us = []
z = np.arange(nx+ny)
for c in combinations(z, nx):
x = np.array(c)
mask = np.ones(nx+ny, bool)
mask[x] = False
y = z[mask]
# compute the statistic
u = nx * np.sum((x - rangex)**2)
u += ny * np.sum((y - rangey)**2)
us.append(u)
u, cnt = np.unique(us, return_counts=True)
return np.sum(cnt[u >= s]) / np.sum(cnt) Then
It works! Thank you. Another Problem of the Current CodeAs mentioned above, But the following result shows that the current threshold 10 is too small for Running this code from scipy import stats
x = [1, 2, 3, 6, 7, 8, 11, 12, 14, 15, 16]
y = [4, 5, 9, 10, 13, 17, 18, 19, 20, 21, 22]
# auto (asymptotic)
print(stats.cramervonmises_2samp(x, y))
# exact
print(stats.cramervonmises_2samp(x, y, method = 'exact')) we have
The defference between the two results (0.003) is unignorable, because it crosses over the line 0.05, which is used widely as the significance level. It seems the value of threshold 10 has to be raised, but I have no idea how much number suffices. |
Great, thanks for pushing this! It sounds like you have a case to make a PR (this function would fit in For the low value, maybe it was just done for performance reasons. We could do this in 2 steps, do the performance improvement and quick fix the doc, then change the threshold. Or if you find previous PR/issues on this code, maybe they explained the reason why it was set to 10. |
I found a conversation in this PR:
It seems that a larger threshold or the more efficient method wasn't just their purpose. |
Ok but then it seems like we can update the value as we want until performance are a concern again. During the review we will ping both Matt and Christoph. |
Yes, the current exact calculation is equivalent to Slow speed is the only reason why the threshold was set so low. Check the literature for a recommendation about where to set the cutoff. If you can't find one, do some experiments to see when you think the error between the two is small enough. I'd suggest starting with pure Python. |
@mdhaber I don't think I understand. Going to Pythran is making the code go from 2s to 30ms. Why do you propose to stick with pure Python here? |
I think increasing the number of observations per sample by 10 increases the number of computations by a factor of about a million. I don't think we can get very far with this brute force algorithm before Pythran gets slow. |
Ah yes indeed... So you advice on not taking this in favour of using another algorithm? |
@fiveseven-lambda, I am summarising a small talk we had @mdhaber and I. Matt makes a good point that the current algorithm is not scalable due to the combination part (I skipped this part too fast). Yes it would still add a small value in terms of speed to migrate this part to Pythran, but at the cost of doing it/review and most importantly maintaining it because it's not pure Python anymore and it opens the door to new potential issues. What would be preferable instead would be to implement in pure Python (at first at least) the new algorithm you proposed (I did not do a literature review, so I personally cannot say if there would be better approaches at that point). If you do that, first of all thank you! Second, please keep it first as close to the paper (no need to be creative on naming nor try to find tricks at first). |
I am reading the paper, and I implemented the algorithm 1 in page 6 so far. Further improvement is performed in the following sections. So the function import math
import itertools
def _pval_cvm_2samp_exact_new(s, m, n):
# [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
# [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
# compute T (eq. 9 in [2])
# same as the code in cramervonmises_2samp(), this can be omitted if cramervonmises_2samp() passes the value of t instead of u to _pval_cvm_2samp_exact()
k, N = m * n, m + n
t = s / (k * N) - (4 * k - 1)/(6 * N)
# [1, p. 3]
l = math.lcm(m, n)
# [1, p. 4], below eq. 3
a = l // m
b = l // n
# eq. 2 in [1]
zeta = t * (m + n) ** 2 * l ** 2 / (m * n) - 1e-6
# Each dictionary g[u][v] is the frequency table of $g_{u, v}^+$ defined in [1, p. 6]
g = [[{} for _ in range(m + 1)] for _ in range(n + 1)]
for u in range(n + 1):
for v in range(m + 1):
if u == 0:
# eq. 13 in [1]
g[u][v][a * a * v * (v + 1) * (2 * v + 1) // 6] = 1
continue
if v == 0:
# eq. 12 in [1]
g[u][v][b * b * u * (u + 1) * (2 * u + 1) // 6] = 1
continue
# eq. 11 in [1]
d = a * v - b * u
for (value, frequency) in itertools.chain(g[u - 1][v].items(), g[u][v - 1].items()):
g[u][v].setdefault(value + d * d, 0)
g[u][v][value + d * d] += frequency
return sum(frequency for (value, frequency) in g[n][m].items() if value >= zeta) / math.comb(m + n, m) is pretty fast:
But the conversion between several statistics occur: # from s (or u in cramervonmises_2samp()) to t
t = s / (k * N) - (4 * k - 1) / (6 * N) # from t to zeta
zeta = t * (m + n) ** 2 * l ** 2 / (m * n) - 1e-6 and this causes floating-point error. The current code subtracts a small number |
Looks like a good start. Does this scale? The data structure |
The computation time for larger
I also counted the total size of the dicts in print(sum(sum(len(table) for table in row) for row in g)) result:
We only have to count the total number of the values equal to or greater than |
Looks like a good start @fiveseven-lambda. I haven't checked too carefully, but one thing to think about is whether you can vectorize at least one of those |
Sorry I'm late. The current progress: Eliminating floating-point arithmeticI took seriously that a floating-point error occurs while converting from import numpy as np
import math
for N in range(2, 15):
for xy in range(1, (1 << N) - 1):
# array of 1 (represents X) and 0 (represents Y) of length N
xy = xy >> np.arange(N) & 1
# number of X's
m = np.count_nonzero(xy)
# number of Y's
n = N - m
# lcm
l = math.lcm(m, n)
a = l // m
b = l // n
# empirical distribution functions
f = np.insert(xy.cumsum(), 0, 0)
g = np.arange(N + 1) - f
# zeta, must be an integer
zeta = sum((a * f - b * g) ** 2)
# ranks of x in xy
r = np.arange(N)[xy == 1]
# ranks of y in xy
s = np.arange(N)[xy == 0]
# u, must be an integer
u = m * sum((r - np.arange(m)) ** 2) + n * sum((s - np.arange(n)) ** 2)
# conversion from u to zeta
mn = m * n
zeta2 = l ** 2 * N * (6 * u - mn * (4 * mn - 1)) / (6 * mn ** 2)
if zeta != zeta2:
print(m, n, xy, zeta, u, zeta2) This shows that the value of Now that we don't have to worry about the floating-point errors. Faster algorithm described in the paperThe paper (Xiao, Y., Gordon, A., & Yakovlev, A. 2006) seems to introduce a faster algorithm (algorithm 2) than I implemented 2 days ago (algorithm 1). But I cannot insist that we should use the algorithm 2, because it includes convolution, and doesn't seem much faster than the algorithm 1 if m and n are not so large (the asymptotic method will do if they are so large). So the way I think the best is to improve the current code in algorithm 1 (by vectorizing some loops as @mdhaber said, for example). |
I have refactored the code. Just replacing import math
import numpy as np
def _pval_cvm_2samp_exact_new(s, m, n):
# [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
# [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
# [1, p. 3]
l = math.lcm(m, n)
# [1, p. 4], below eq. 3
a = l // m
b = l // n
# eq. 9 in [2] and eq. 2 in [1]
mn = m * n
zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
# each g[u][v] is the cumulative frequency table of $g_{u, v}^+$ defined in [1, p. 6]
g = [[np.zeros(zeta, int) for _ in range(m + 1)] for _ in range(n + 1)]
g[0][0] = np.ones(zeta, int)
for u in range(n + 1):
for v in range(m + 1):
# calculate g[u][v] by g[u - 1][v] and g[u][v - 1] with eq. 11 in [1]
d = a * v - b * u
if d * d >= zeta:
continue
if u > 0:
g[u][v][d * d:] += g[u - 1][v][:zeta - d * d]
if v > 0:
g[u][v][d * d:] += g[u][v - 1][:zeta - d * d]
# the number of all combinations
total = math.comb(m + n, m)
# >= zeta is the complement of <= zeta - 1
return (total - g[n][m][zeta - 1]) / total |
There was a problem in the code I wrote 2 day ago. It uses a 3-dimensional array of size Table 1: The maximum value of zeta for each m, n
Table 2: The maximum size of dicts
The previous one (using dicts) was better: import math
from collections import Counter
def _pval_cvm_2samp_exact_new(s, m, n):
# [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
# [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
# [1, p. 3]
l = math.lcm(m, n)
# [1, p. 4], below eq. 3
a = l // m
b = l // n
# eq. 9 in [2] and eq. 2 in [1]
mn = m * n
zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
# each g[u][v] is the cumulative frequency table of $g_{u, v}^+$ defined in [1, p. 6]
g = [[Counter() for _ in range(m + 1)] for _ in range(n + 1)]
g[0][0][0] = 1
for u in range(n + 1):
for v in range(m + 1):
tmp = Counter()
if u > 0:
tmp += g[u - 1][v]
if v > 0:
tmp += g[u][v - 1]
d = a * v - b * u
for value, frequency in tmp.items():
g[u][v][value + d * d] = frequency
return sum(frequency for value, frequency in g[n][m].items() if value >= zeta) / math.comb(m + n, m) But I'm sorry I don't know how to vectorize |
I am not an expert in vectorization. What you want to do this is to change the data structure, elements need to be in a NumPy array so you can use reduction on some axis. |
To use NumPy array, we need a way equivalent to [[10 5]
[50 3]] and [[30 9]
[50 4]
[60 2]] we want to merge these and sort: [[10 5]
[30 9]
[50 7]
[60 2]] (Note that Is there any way to do this in a short time? |
I am not sure. Maybe a combination of |
I've come up with this one, using import numpy as np
a = np.array([[10, 5], [50, 3]])
b = np.array([[30, 9], [50, 4], [60, 2]])
av, af = a.T
bv, bf = b.T
intersection_v, a_index, b_index = np.intersect1d(av, bv, return_indices = True)
intersection = np.array([intersection_v, af[a_index] + bf[b_index]]).T
result = np.concatenate([intersection, np.delete(a, a_index, axis = 0), np.delete(b, b_index, axis = 0)])
result.sort(axis = 0)
print(result) Does there seem to be a better way? If not, I will use this. |
So this is the code using NumPy arrays: import math
import numpy as np
def _pval_cvm_2samp_exact_new(s, m, n):
# [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
# [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
# [1, p. 3]
l = math.lcm(m, n)
# [1, p. 4], below eq. 3
a = l // m
b = l // n
# eq. 9 in [2] and eq. 2 in [1]
mn = m * n
zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
# each g[u][v] is the cumulative frequency table of $g_{u, v}^+$ defined in [1, p. 6]
g = [[None for _ in range(m + 1)] for _ in range(n + 1)]
for u in range(n + 1):
for v in range(m + 1):
# calculate g[u][v] by g[u - 1][v] and g[u][v - 1] with eq. 11 in [1]
if u == 0:
if v == 0:
g[u][v] = np.array([[0, 1]])
else:
g[u][v] = g[u][v - 1].copy()
else:
if v == 0:
g[u][v] = g[u - 1][v].copy()
else:
v0, f0 = g[u][v - 1].T
v1, f1 = g[u - 1][v].T
vi, i0, i1 = np.intersect1d(v0, v1, return_indices = True)
g[u][v] = np.concatenate([
np.array([vi, f0[i0] + f1[i1]]).T,
np.delete(g[u][v - 1], i0, axis = 0),
np.delete(g[u - 1][v], i1, axis = 0)
])
d = a * v - b * u
g[u][v][:, 0] += d * d
last_g = g[n][m]
return np.sum(last_g[last_g[:, 0] >= zeta][:, 1]) / math.comb(m + n, m) I think this is good enough to create a pull request. Is there anything to do other than the two below?
|
Did the NumPy code help compared to the other version you had? |
Yes. I compared the one with cumulative frequency table, the one with Timer unit: 1e-06 s
Total time: 2.36472 s
File: a.py
Function: _pval_cvm_2samp_exact_cumulative_frequency_table at line 5
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 @profile
6 def _pval_cvm_2samp_exact_cumulative_frequency_table(s, m, n):
7 # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
8 # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
9
10 # [1, p. 3]
11 1 2.0 2.0 0.0 l = math.lcm(m, n)
12 # [1, p. 4], below eq. 3
13 1 1.0 1.0 0.0 a = l // m
14 1 1.0 1.0 0.0 b = l // n
15 # eq. 9 in [2] and eq. 2 in [1]
16 1 1.0 1.0 0.0 mn = m * n
17 1 2.0 2.0 0.0 zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
18 # each g[u][v] is the cumulative frequency table of $g_{u, v}^+$ defined in [1, p. 6]
19 1 826.0 826.0 0.0 g = [[np.zeros(zeta, int) for _ in range(m + 1)] for _ in range(n + 1)]
20 1 1560.0 1560.0 0.1 g[0][0] = np.ones(zeta, int)
21 22 14.0 0.6 0.0 for u in range(n + 1):
22 441 366.0 0.8 0.0 for v in range(m + 1):
23 # calculate g[u][v] by g[u - 1][v] and g[u][v - 1] with eq. 11 in [1]
24 420 319.0 0.8 0.0 d = a * v - b * u
25 420 290.0 0.7 0.0 if d * d >= zeta:
26 continue
27 420 170.0 0.4 0.0 if u > 0:
28 400 1877266.0 4693.2 79.4 g[u][v][d * d:] += g[u - 1][v][:zeta - d * d]
29 420 354.0 0.8 0.0 if v > 0:
30 399 483540.0 1211.9 20.4 g[u][v][d * d:] += g[u][v - 1][:zeta - d * d]
31 # the number of all combinations
32 1 6.0 6.0 0.0 total = math.comb(m + n, m)
33 # >= zeta is the complement of <= zeta - 1
34 1 4.0 4.0 0.0 return (total - g[n][m][zeta - 1]) / total
Total time: 6.6444 s
File: a.py
Function: _pval_cvm_2samp_exact_counter at line 36
Line # Hits Time Per Hit % Time Line Contents
==============================================================
36 @profile
37 def _pval_cvm_2samp_exact_counter(s, m, n):
38 # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
39 # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
40
41 # [1, p. 3]
42 1 4.0 4.0 0.0 l = math.lcm(m, n)
43 # [1, p. 4], below eq. 3
44 1 1.0 1.0 0.0 a = l // m
45 1 0.0 0.0 0.0 b = l // n
46 # eq. 9 in [2] and eq. 2 in [1]
47 1 0.0 0.0 0.0 mn = m * n
48 1 3.0 3.0 0.0 zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
49 # each g[u][v] is the frequency table of $g_{u, v}^+$ defined in [1, p. 6]
50 1 474.0 474.0 0.0 g = [[Counter() for _ in range(m + 1)] for _ in range(n + 1)]
51 1 3.0 3.0 0.0 g[0][0][0] = 1
52 22 10.0 0.5 0.0 for u in range(n + 1):
53 441 216.0 0.5 0.0 for v in range(m + 1):
54 420 6938.0 16.5 0.1 tmp = Counter()
55 420 223.0 0.5 0.0 if u > 0:
56 400 1639626.0 4099.1 24.7 tmp += g[u - 1][v]
57 420 201.0 0.5 0.0 if v > 0:
58 399 1334903.0 3345.6 20.1 tmp += g[u][v - 1]
59 420 286.0 0.7 0.0 d = a * v - b * u
60 3708506 1492975.0 0.4 22.5 for value, frequency in tmp.items():
61 3708086 2165124.0 0.6 32.6 g[u][v][value + d * d] = frequency
62 1 3415.0 3415.0 0.1 return sum(frequency for value, frequency in g[n][m].items() if value >= zeta) / math.comb(m + n, m)
Total time: 0.208749 s
File: a.py
Function: _pval_cvm_2samp_exact_numpy at line 64
Line # Hits Time Per Hit % Time Line Contents
==============================================================
64 @profile
65 def _pval_cvm_2samp_exact_numpy(s, m, n):
66 # [1] Y. Xiao, A. Gordon, and A. Yakovlev, “A C++ Program for the Cramér-Von Mises Two-Sample Test”, J. Stat. Soft., vol. 17, no. 8, pp. 1–15, Dec. 2006.
67 # [2] T. W. Anderson "On the Distribution of the Two-Sample Cramer-von Mises Criterion," The Annals of Mathematical Statistics, Ann. Math. Statist. 33(3), 1148-1159, (September, 1962)
68
69 # [1, p. 3]
70 1 3.0 3.0 0.0 l = math.lcm(m, n)
71 # [1, p. 4], below eq. 3
72 1 1.0 1.0 0.0 a = l // m
73 1 1.0 1.0 0.0 b = l // n
74 # eq. 9 in [2] and eq. 2 in [1]
75 1 1.0 1.0 0.0 mn = m * n
76 1 3.0 3.0 0.0 zeta = l ** 2 * (m + n) * (6 * s - mn * (4 * mn - 1)) // (6 * mn ** 2)
77 # each g[u][v] is the frequency table of $g_{u, v}^+$ defined in [1, p. 6]
78 1 46.0 46.0 0.0 g = [[None for _ in range(m + 1)] for _ in range(n + 1)]
79 22 8.0 0.4 0.0 for u in range(n + 1):
80 441 261.0 0.6 0.1 for v in range(m + 1):
81 # calculate g[u][v] by g[u - 1][v] and g[u][v - 1] with eq. 11 in [1]
82 420 218.0 0.5 0.1 if u == 0:
83 20 7.0 0.3 0.0 if v == 0:
84 1 11.0 11.0 0.0 g[u][v] = np.array([[0, 1]])
85 else:
86 19 23.0 1.2 0.0 g[u][v] = g[u][v - 1].copy()
87 else:
88 400 164.0 0.4 0.1 if v == 0:
89 20 36.0 1.8 0.0 g[u][v] = g[u - 1][v].copy()
90 else:
91 380 893.0 2.4 0.4 v0, f0 = g[u][v - 1].T
92 380 432.0 1.1 0.2 v1, f1 = g[u - 1][v].T
93 380 149054.0 392.2 71.4 vi, i0, i1 = np.intersect1d(v0, v1, return_indices = True)
94 760 11093.0 14.6 5.3 g[u][v] = np.concatenate([
95 380 8128.0 21.4 3.9 np.array([vi, f0[i0] + f1[i1]]).T,
96 380 17743.0 46.7 8.5 np.delete(g[u][v - 1], i0, axis = 0),
97 380 16579.0 43.6 7.9 np.delete(g[u - 1][v], i1, axis = 0)
98 ])
99 420 302.0 0.7 0.1 d = a * v - b * u
100 420 3643.0 8.7 1.7 g[u][v][:, 0] += d * d
101 1 1.0 1.0 0.0 last_g = g[n][m]
102 1 98.0 98.0 0.0 return np.sum(last_g[last_g[:, 0] >= zeta][:, 1]) / math.comb(m + n, m) |
I am very sorry to create a pull request with a wrong file. My local environment setup had a problem, and I failed to detect the error. I am now retrying to run the tests correctly on my computer, but it may takes a while. I am sorry to bother you. |
Is your feature request related to a problem? Please describe.
We have
stats.cramervonmises_2samp()
function for the 2-sample Cramér-von Mises test, but the exact p-value calculation takes a long time. The comment in the source code says:In fact, the following code runs in about 3 seconds on my computer (the actual code uses the exact approach if both samples contain equal to or less than 10 observations, not less than 10 as mentioned above).
Describe the solution you'd like.
The following paper seems to show a more efficient algorithm to calculate the exact p-value: Xiao, Y., Gordon, A., & Yakovlev, A. (2006). A C++ Program for the Cramér-Von Mises Two-Sample Test. Journal of Statistical Software, 17(8), 1–15. https://doi.org/10.18637/jss.v017.i08
Implementing this would significantly improve the calculation time of
stats.cramervonmises_2samp()
.Describe alternatives you've considered.
No response
Additional context (e.g. screenshots, GIFs)
No response
The text was updated successfully, but these errors were encountered: