Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multipletests: reducing memory consumption #1394

Closed
josef-pkt opened this issue Feb 15, 2014 · 3 comments

Comments

Projects
None yet
1 participant
@josef-pkt
Copy link
Member

commented Feb 15, 2014

multipletests wasn't written with memory consumption in mind, it also wasn't optimized for speed. Parts of it are still a mixture of research and production code.

  • holding on to intermediary arrays that are not needed anymore DONE PARTIALLY
  • we work with a copy of the original array, and the original array stays in memory
    • allow overwrite_input=True (numpy keyword) NOT DONE
  • arrays are kept in memory for later usage, DONE PARTIALLY
    • example sortrevind is calculated at beginning but only needed in the end, save to disk temporarily ? (note we need to keep either sortind or sortrevind around DONE
  • wrapper code (sorting and putting back) and the core calculations are in the same function, -> allow is_sorted=True ? so we can outsource the sort wrapping DONE
  • optimizing calls to numpy, for example np.take should be more efficient than fancy indexing DONE PARTIALLY
  • some duplication is left over for the test decision reject/not-reject instead of pure pvalue-correction
    (mostly done because articles were on multiple testing not on pvalue-correction, and left in as cross-check) NOT DONE
  • using fdrcorrection through multipletests duplicates the wrapping code (sort, reverse index) DONE
  • fdrcorrection_twostage needs to loop over fdrcorrection but does redundant work because we need to wrap (sort pvalues, reverse index) only once DONE
  • ... ?

A bigger change would be to work only with part of the p-value array that is close to a decision threshold.
One problem with this is that we have step-up and step-down methods and fdr and fdr two-stage accessible through the same function. My guess is that step-up (?, those starting with the smallest p-values) methods would be easy to truncate, but not so easy for step-down methods. Also we might not be able to determine in advance where the truncation should be.

Working in batches on the sorted p-values looks difficult because of the adjustment with minimum.accumulate or maximum.accumulate. (What happens in a later batch could/will influence the numbers in an earlier batch, which is a feature of step-up and step-down methods.)

see #1392 for using out of memory computation instead of tweaking the algorithm for a little bit more memory.

related: can we do anything about memory fragmentation? numpy needs contiguous memory to create a new array.

a bit of timing:
most of the time is spend in argsort, for 5300*2 1d-array, approx. 28 Million p-values
approximately 1-4 seconds if is_sorted=True
approximately 11-15 seconds if is_sorted=False (9 to 10 seconds for argsort)


here is the test script that I use on the commandline

# -*- coding: utf-8 -*-
"""
Created on Sun Feb 16 09:52:36 2014

Author: Josef Perktold
"""

import gc
import numpy as np
import statsmodels.stats.multitest as multi

# exclude ['b', 's'] which are always calculated
all_methods = ['sh', 'hs', 'h', 'hommel', 'fdr_i', 'fdr_n',
                       'fdr_tsbky', 'fdr_tsbh', 'fdr_gbs']

fail = ['hommel']

n = 5300
pvals = np.random.rand(n**2)

for method in all_methods:
    if method in fail:
        #skip known failures
        continue
    print method
    pm = multi.multipletests(pvals, method=method)

    print len(pm[0])
    del pm
    gc.collect()

josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Feb 16, 2014

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Feb 16, 2014

I'm using 32bit python, numpy 1.6.1. I restarted my computer and have 3 GB of memory available that should be little fragmented.

After a bit of cleanup for 'holms' in c86e08b , it returns results for 5300**2 pvalues.
Memory usage goes up to about 1.2 GB.

I get segfaults (in umath.pyd) instead of memory error trying other methods and running a script on the command line.

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Feb 16, 2014

fdrcorrection works after deleting one intermediate array, with around 1.3GB of max memory.

(calling fdrcorrection through multipletests does duplicate work)

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2014

closing, reopen if necessary (still more improvements are possible)

largely fixed in PR #1396

@josef-pkt josef-pkt closed this Feb 19, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.