Memory scalability issues in univariate feature selection #1651

ogrisel opened this Issue Feb 3, 2013 · 8 comments


None yet

5 participants

ogrisel commented Feb 3, 2013

Gensim has a re-implementation of selectkbest to fix memory scalability issues. Someone should review them and it integrate the fix back into the original sklearn codebase.

Maybe @agramfort you can have a look?


there implementation does a for loop to avoid a copy of X

basically it avoids this line:

args = [X[safe_mask(X, y == k)] for k in np.unique(y)]

the good approach would be cython but since our implementation works for
sparse data it's more work ...


I have been looking at the sparse matrix code (scipy.sparse). For large matrices, it is my opinion that CSC is the best sparse format with respect to memory access at a low level. However, if memory is the primary concern, then all sparse formats should be supported.

I am guessing that "how low level to go" would depend on benchmarks. I have downloaded some of the larger matrices from The University of Florida Sparse Matrix Collection that I have estimated will fit in memory for a machine with 4GB of RAM. I have tried to stay away from the diagonal ones as it is my opinion that they are less representative of typical sparse matrices for machine learning. If someone wants to comment on which matrix I should be using instead, I could look at those.

I have briefly looked at scipy with respect to computing the "element-wise" norms for a matrix (, but according to the SciPy Dev Wiki ( there is little interest in this area. Also, column or row sum of powers of matrix entries will be less efficient than summing the entire matrix into one sum (for a large matrix).

My conclusion is I need to benchmark multiple ways of computing:
args = [X[safe_mask(X, y == k)] for k in np.unique(y)]
for large sparse matrices. If anyone has good matrices to use, please let me know, otherwise I will use the ones I have downloaded from The University of Florida Sparse Matrix Collection.


I think it is enough to support csr and convert all sparse input to csr before.
I don't think it would be that big a deal to rewrite. I think the problem is rather the current structure. I don't see a way to make a call to f_one_way if we avoid the copy. So we need to refactor the functions and have one specifically for the multi-class case, which can then be reused from f_one_way.
There are already some sparse matrix utilities in the utils.sparsefuncs module.


Ok, I will have to look at which functions call f_one_way. I was assuming that f_one_way could be modified as a "black box", but that appears to be incorrect.


I was looking in the wrong function with respect to:
args = [X[safe_mask(X, y == k)] for k in np.unique(y)]

It would still be useful to benchmark the code to determine the tradeoffs (if any) of computation versus space for a large enough sample size that the copy is a problem.


I think it is not really a tradeoff. It is a more a matter of refactoring the code. Either we special case the multivariate version or we implement it in such a way that it can be called by the univariate function.


Am I correct in thinking this issue pertains to f_classif rather than SelectKBest?


who tagged that for the release? was it me? I feel it is a tad non-trival and I would like to not do this for the release. If someone else tagged this, please ping me.

@amueller amueller modified the milestone: 0.15.1, 0.15 Jul 18, 2014
@amueller amueller removed this from the 0.16 milestone Feb 25, 2015
@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@amueller amueller removed this from the 0.19 milestone Oct 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment