multipletests wasn't written with memory consumption in mind, it also wasn't optimized for speed. Parts of it are still a mixture of research and production code.
A bigger change would be to work only with part of the p-value array that is close to a decision threshold.
One problem with this is that we have step-up and step-down methods and fdr and fdr two-stage accessible through the same function. My guess is that step-up (?, those starting with the smallest p-values) methods would be easy to truncate, but not so easy for step-down methods. Also we might not be able to determine in advance where the truncation should be.
Working in batches on the sorted p-values looks difficult because of the adjustment with minimum.accumulate or maximum.accumulate. (What happens in a later batch could/will influence the numbers in an earlier batch, which is a feature of step-up and step-down methods.)
see #1392 for using out of memory computation instead of tweaking the algorithm for a little bit more memory.
related: can we do anything about memory fragmentation? numpy needs contiguous memory to create a new array.
a bit of timing:
most of the time is spend in argsort, for 5300*2 1d-array, approx. 28 Million p-values
approximately 1-4 seconds if is_sorted=True
approximately 11-15 seconds if is_sorted=False (9 to 10 seconds for argsort)
here is the test script that I use on the commandline
# -*- coding: utf-8 -*-
Created on Sun Feb 16 09:52:36 2014
Author: Josef Perktold
import numpy as np
import statsmodels.stats.multitest as multi
# exclude ['b', 's'] which are always calculated
all_methods = ['sh', 'hs', 'h', 'hommel', 'fdr_i', 'fdr_n',
'fdr_tsbky', 'fdr_tsbh', 'fdr_gbs']
fail = ['hommel']
n = 5300
pvals = np.random.rand(n**2)
for method in all_methods:
if method in fail:
#skip known failures
pm = multi.multipletests(pvals, method=method)
REF: try using less memory in multipletests 'holms' see #1394
I'm using 32bit python, numpy 1.6.1. I restarted my computer and have 3 GB of memory available that should be little fragmented.
After a bit of cleanup for 'holms' in c86e08b , it returns results for 5300**2 pvalues.
Memory usage goes up to about 1.2 GB.
I get segfaults (in umath.pyd) instead of memory error trying other methods and running a script on the command line.
fdrcorrection works after deleting one intermediate array, with around 1.3GB of max memory.
(calling fdrcorrection through multipletests does duplicate work)
REF: allow is_sorted in fdrcorrection see #1394
closing, reopen if necessary (still more improvements are possible)
largely fixed in PR #1396