multipletests: reducing memory consumption #1394

Closed
josef-pkt opened this Issue Feb 15, 2014 · 3 comments

Projects

None yet

1 participant

@josef-pkt
Member

multipletests wasn't written with memory consumption in mind, it also wasn't optimized for speed. Parts of it are still a mixture of research and production code.

  • holding on to intermediary arrays that are not needed anymore DONE PARTIALLY
  • we work with a copy of the original array, and the original array stays in memory
    • allow overwrite_input=True (numpy keyword) NOT DONE
  • arrays are kept in memory for later usage, DONE PARTIALLY
    • example sortrevind is calculated at beginning but only needed in the end, save to disk temporarily ? (note we need to keep either sortind or sortrevind around DONE
  • wrapper code (sorting and putting back) and the core calculations are in the same function, -> allow is_sorted=True ? so we can outsource the sort wrapping DONE
  • optimizing calls to numpy, for example np.take should be more efficient than fancy indexing DONE PARTIALLY
  • some duplication is left over for the test decision reject/not-reject instead of pure pvalue-correction
    (mostly done because articles were on multiple testing not on pvalue-correction, and left in as cross-check) NOT DONE
  • using fdrcorrection through multipletests duplicates the wrapping code (sort, reverse index) DONE
  • fdrcorrection_twostage needs to loop over fdrcorrection but does redundant work because we need to wrap (sort pvalues, reverse index) only once DONE
  • ... ?

A bigger change would be to work only with part of the p-value array that is close to a decision threshold.
One problem with this is that we have step-up and step-down methods and fdr and fdr two-stage accessible through the same function. My guess is that step-up (?, those starting with the smallest p-values) methods would be easy to truncate, but not so easy for step-down methods. Also we might not be able to determine in advance where the truncation should be.

Working in batches on the sorted p-values looks difficult because of the adjustment with minimum.accumulate or maximum.accumulate. (What happens in a later batch could/will influence the numbers in an earlier batch, which is a feature of step-up and step-down methods.)

see #1392 for using out of memory computation instead of tweaking the algorithm for a little bit more memory.

related: can we do anything about memory fragmentation? numpy needs contiguous memory to create a new array.

a bit of timing:
most of the time is spend in argsort, for 5300*2 1d-array, approx. 28 Million p-values
approximately 1-4 seconds if is_sorted=True
approximately 11-15 seconds if is_sorted=False (9 to 10 seconds for argsort)


here is the test script that I use on the commandline

# -*- coding: utf-8 -*-
"""
Created on Sun Feb 16 09:52:36 2014

Author: Josef Perktold
"""

import gc
import numpy as np
import statsmodels.stats.multitest as multi

# exclude ['b', 's'] which are always calculated
all_methods = ['sh', 'hs', 'h', 'hommel', 'fdr_i', 'fdr_n',
                       'fdr_tsbky', 'fdr_tsbh', 'fdr_gbs']

fail = ['hommel']

n = 5300
pvals = np.random.rand(n**2)

for method in all_methods:
    if method in fail:
        #skip known failures
        continue
    print method
    pm = multi.multipletests(pvals, method=method)

    print len(pm[0])
    del pm
    gc.collect()
@josef-pkt josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Feb 16, 2014
@josef-pkt josef-pkt REF: try using less memory in multipletests 'holms' see #1394 c86e08b
@josef-pkt
Member

I'm using 32bit python, numpy 1.6.1. I restarted my computer and have 3 GB of memory available that should be little fragmented.

After a bit of cleanup for 'holms' in c86e08b , it returns results for 5300**2 pvalues.
Memory usage goes up to about 1.2 GB.

I get segfaults (in umath.pyd) instead of memory error trying other methods and running a script on the command line.

@josef-pkt
Member

fdrcorrection works after deleting one intermediate array, with around 1.3GB of max memory.

(calling fdrcorrection through multipletests does duplicate work)

@josef-pkt josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Feb 16, 2014
@josef-pkt josef-pkt REF: allow is_sorted in fdrcorrection see #1394 9b415ee
@josef-pkt
Member

closing, reopen if necessary (still more improvements are possible)

largely fixed in PR #1396

@josef-pkt josef-pkt closed this Feb 19, 2014
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@josef-pkt josef-pkt REF: try using less memory in multipletests 'holms' see #1394 d009d17
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@josef-pkt josef-pkt REF: allow is_sorted in fdrcorrection see #1394 0fae278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment