Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory scalability issues in univariate feature selection #1651

Open
ogrisel opened this issue Feb 3, 2013 · 8 comments
Open

Memory scalability issues in univariate feature selection #1651

ogrisel opened this issue Feb 3, 2013 · 8 comments

Comments

@ogrisel
Copy link
Member

ogrisel commented Feb 3, 2013

Gensim has a re-implementation of selectkbest to fix memory scalability issues. Someone should review them and it integrate the fix back into the original sklearn codebase.

https://github.com/piskvorky/gensim/blob/a73c84e21aecd3cc77ba2d752912f73b712bc60a/gensim/models/selectkbest.py

Maybe @agramfort you can have a look?

@agramfort
Copy link
Member

there implementation does a for loop to avoid a copy of X

basically it avoids this line:

args = [X[safe_mask(X, y == k)] for k in np.unique(y)]

the good approach would be cython but since our implementation works for
sparse data it's more work ...

@jeffvanvoorst
Copy link

I have been looking at the sparse matrix code (scipy.sparse). For large matrices, it is my opinion that CSC is the best sparse format with respect to memory access at a low level. However, if memory is the primary concern, then all sparse formats should be supported.

I am guessing that "how low level to go" would depend on benchmarks. I have downloaded some of the larger matrices from The University of Florida Sparse Matrix Collection that I have estimated will fit in memory for a machine with 4GB of RAM. I have tried to stay away from the diagonal ones as it is my opinion that they are less representative of typical sparse matrices for machine learning. If someone wants to comment on which matrix I should be using instead, I could look at those.

I have briefly looked at scipy with respect to computing the "element-wise" norms for a matrix (http://en.wikipedia.org/wiki/Matrix_norm), but according to the SciPy Dev Wiki (http://projects.scipy.org/scipy/ticket/18) there is little interest in this area. Also, column or row sum of powers of matrix entries will be less efficient than summing the entire matrix into one sum (for a large matrix).

My conclusion is I need to benchmark multiple ways of computing:
args = [X[safe_mask(X, y == k)] for k in np.unique(y)]
for large sparse matrices. If anyone has good matrices to use, please let me know, otherwise I will use the ones I have downloaded from The University of Florida Sparse Matrix Collection.

@amueller
Copy link
Member

I think it is enough to support csr and convert all sparse input to csr before.
I don't think it would be that big a deal to rewrite. I think the problem is rather the current structure. I don't see a way to make a call to f_one_way if we avoid the copy. So we need to refactor the functions and have one specifically for the multi-class case, which can then be reused from f_one_way.
There are already some sparse matrix utilities in the utils.sparsefuncs module.

@jeffvanvoorst
Copy link

Ok, I will have to look at which functions call f_one_way. I was assuming that f_one_way could be modified as a "black box", but that appears to be incorrect.

@jeffvanvoorst
Copy link

I was looking in the wrong function with respect to:
args = [X[safe_mask(X, y == k)] for k in np.unique(y)]

It would still be useful to benchmark the code to determine the tradeoffs (if any) of computation versus space for a large enough sample size that the copy is a problem.

@amueller
Copy link
Member

I think it is not really a tradeoff. It is a more a matter of refactoring the code. Either we special case the multivariate version or we implement it in such a way that it can be called by the univariate function.

@jnothman
Copy link
Member

Am I correct in thinking this issue pertains to f_classif rather than SelectKBest?

@amueller
Copy link
Member

who tagged that for the release? was it me? I feel it is a tad non-trival and I would like to not do this for the release. If someone else tagged this, please ping me.

@amueller amueller modified the milestones: 0.15.1, 0.15 Jul 18, 2014
@amueller amueller removed this from the 0.16 milestone Feb 25, 2015
@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@amueller amueller removed this from the 0.19 milestone Oct 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants