Memory scalability issues in univariate feature selection #1651

ogrisel · 2013-02-03T11:34:15Z

Gensim has a re-implementation of selectkbest to fix memory scalability issues. Someone should review them and it integrate the fix back into the original sklearn codebase.

https://github.com/piskvorky/gensim/blob/a73c84e21aecd3cc77ba2d752912f73b712bc60a/gensim/models/selectkbest.py

Maybe @agramfort you can have a look?

The text was updated successfully, but these errors were encountered:

agramfort · 2013-02-03T13:35:09Z

there implementation does a for loop to avoid a copy of X

basically it avoids this line:

args = [X[safe_mask(X, y == k)] for k in np.unique(y)]

the good approach would be cython but since our implementation works for
sparse data it's more work ...

jeffvanvoorst · 2013-03-22T13:12:53Z

I have been looking at the sparse matrix code (scipy.sparse). For large matrices, it is my opinion that CSC is the best sparse format with respect to memory access at a low level. However, if memory is the primary concern, then all sparse formats should be supported.

I am guessing that "how low level to go" would depend on benchmarks. I have downloaded some of the larger matrices from The University of Florida Sparse Matrix Collection that I have estimated will fit in memory for a machine with 4GB of RAM. I have tried to stay away from the diagonal ones as it is my opinion that they are less representative of typical sparse matrices for machine learning. If someone wants to comment on which matrix I should be using instead, I could look at those.

I have briefly looked at scipy with respect to computing the "element-wise" norms for a matrix (http://en.wikipedia.org/wiki/Matrix_norm), but according to the SciPy Dev Wiki (http://projects.scipy.org/scipy/ticket/18) there is little interest in this area. Also, column or row sum of powers of matrix entries will be less efficient than summing the entire matrix into one sum (for a large matrix).

My conclusion is I need to benchmark multiple ways of computing:
args = [X[safe_mask(X, y == k)] for k in np.unique(y)]
for large sparse matrices. If anyone has good matrices to use, please let me know, otherwise I will use the ones I have downloaded from The University of Florida Sparse Matrix Collection.

amueller · 2013-03-22T13:25:50Z

I think it is enough to support csr and convert all sparse input to csr before.
I don't think it would be that big a deal to rewrite. I think the problem is rather the current structure. I don't see a way to make a call to f_one_way if we avoid the copy. So we need to refactor the functions and have one specifically for the multi-class case, which can then be reused from f_one_way.
There are already some sparse matrix utilities in the utils.sparsefuncs module.

jeffvanvoorst · 2013-03-22T18:08:35Z

Ok, I will have to look at which functions call f_one_way. I was assuming that f_one_way could be modified as a "black box", but that appears to be incorrect.

jeffvanvoorst · 2013-03-23T22:12:57Z

I was looking in the wrong function with respect to:
args = [X[safe_mask(X, y == k)] for k in np.unique(y)]

It would still be useful to benchmark the code to determine the tradeoffs (if any) of computation versus space for a large enough sample size that the copy is a problem.

amueller · 2013-03-31T16:10:25Z

I think it is not really a tradeoff. It is a more a matter of refactoring the code. Either we special case the multivariate version or we implement it in such a way that it can be called by the univariate function.

jnothman · 2013-06-24T09:19:02Z

Am I correct in thinking this issue pertains to f_classif rather than SelectKBest?

amueller · 2013-07-26T09:14:23Z

who tagged that for the release? was it me? I feel it is a tad non-trival and I would like to not do this for the release. If someone else tagged this, please ping me.

amueller modified the milestones: 0.15.1, 0.15 Jul 18, 2014

amueller removed this from the 0.16 milestone Feb 25, 2015

amueller modified the milestone: 0.19 Sep 29, 2016

amueller removed this from the 0.19 milestone Oct 27, 2016

amueller added Enhancement Need Contributor labels Oct 27, 2016

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

cmarmo added module:feature_selection and removed Bug labels Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory scalability issues in univariate feature selection #1651

Memory scalability issues in univariate feature selection #1651

ogrisel commented Feb 3, 2013

agramfort commented Feb 3, 2013

jeffvanvoorst commented Mar 22, 2013

amueller commented Mar 22, 2013

jeffvanvoorst commented Mar 22, 2013

jeffvanvoorst commented Mar 23, 2013

amueller commented Mar 31, 2013

jnothman commented Jun 24, 2013

amueller commented Jul 26, 2013

Memory scalability issues in univariate feature selection #1651

Memory scalability issues in univariate feature selection #1651

Comments

ogrisel commented Feb 3, 2013

agramfort commented Feb 3, 2013

jeffvanvoorst commented Mar 22, 2013

amueller commented Mar 22, 2013

jeffvanvoorst commented Mar 22, 2013

jeffvanvoorst commented Mar 23, 2013

amueller commented Mar 31, 2013

jnothman commented Jun 24, 2013

amueller commented Jul 26, 2013