Skip to content

ENH: _mannwhitneyu.py computation of exact MWU statistics may be simplified #17559

@rkompass

Description

@rkompass

Describe your issue.

This is not really a bug but an invitation for improvement and simplification. Imho the _MWU class in _mannwhitneyu.py may be abolished.
In pmf_recursive() or pmf_iterative() a 3-dimensional array is used, which stores fmnk s for all combinations of m, n, k i.e. stores k-distributions for all values of m and n.
This is not necessary as formula (5) of DiBucchianico:Combinatorics_computer_algebra_and_wilcoxon_mann_whitney_test
allows for recursive updating from all f(m)nk to f(m+1)nk by adding a k-shifted f(m)(n-1)k to f(m)nk.
An example: f2,0,k = (1,0,..) f2,1,k = (1,1,1,0,..) f2,2,k=(1,1,2,1,1,0,..),..
Now f3,1,k=(1,1,1,1,0,..) = f2,1,k + rightshift(f3,0,k|3) and f3,2,k=(1,1,2,2,2,1,1,0,..) = f2,2,k + rightshift( f3,1,k |3).
This is faster and requires only a 2-dimensional array. This way fmnk may be computed for higher m and n values.
Code for this is below, free for your adaptation.

Reproducing Code Example

import numpy as np
from scipy import special

# Computes f_mnk distribution needed for exact Mann-Whitney U statistics using formula (5) of
# DiBucchianico (1996): "Combinatorics, computer algebra and Wilcoxon-Mann-Whitney test".
#
# We use unsigned quads with max value of 2**64-1 = 18446744073709551615
#    as long as a safety test to check if math_comb(m+n,n) > 2**64-1 is fulfilled
#
def fmnk(m, n, k=None):  # if k is None, return full dist: (fmn0, fmn1,.., fmn(m*n)), else only (fmn0, fmn1,.., fmnk)
    dtyp = np.longdouble if special.binom(n+m, n) > 2**64 else np.uint64  # overflow protection: sum of all densities fits -> every entry fits
    n, m = sorted((m, n))  # make n the smaller of both (n requires memory), we iterate through m
    if k is None:          # v-- original: l = m*n+1   # return array with full fmn distribution for k  = 0,..,m*n
        l =  (m*n+1)//2 if m*n % 2 else (m*n)//2 + 1   # we compute only half of all distributions (including center), for symmetry reasons
    else:
        l = k+1            # compute and return only k-densities for k = 0,..,k
    fm = np.zeros((n+1, l), dtype=dtyp)
    fm[:,0] = 1
    for j in range(1, m+1):       # update all fjik until fmik in place: j=1,..., m
        for i in range(1, n+1):   # fj0k remains (1,0,..0); fjik will be updated for i >= 1
            fm[i,j:]+=fm[i-1,:-j] # add right shifted (by m value which is j) previous (n-1-th) k-distributrion
    if k is not None:             # this is formula (5) of DiBucchianico:Combinatorics_computer_algebra_and_wilcoxon_mann_whitney_test
        return fm[n,:]     # return f_mn which is k-density for k = 0,..,k
    r = np.zeros((m*n+1), dtype=dtyp)  # k was None and we had only computed half of the k distribution
    r[:l] = fm[n,:l]
    print(f'')
    r[l:] = fm[n,l-1::-1] if m*n % 2 else fm[n,l-2::-1]  # so we return a new array and fill it symmetrically                 
    return r          # return full f_mn which is k-density for k = 0,..,m*n

print(fmnk(3,3))

Error message

[1 1 2 3 3 3 3 2 1 1]

SciPy/NumPy/Python version information

1.9.3 1.23.4 sys.version_info(major=3, minor=10, micro=8, releaselevel='final', serial=0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions