New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory scalability issues in univariate feature selection #1651
Comments
there implementation does a for loop to avoid a copy of X basically it avoids this line: args = [X[safe_mask(X, y == k)] for k in np.unique(y)] the good approach would be cython but since our implementation works for |
I have been looking at the sparse matrix code (scipy.sparse). For large matrices, it is my opinion that CSC is the best sparse format with respect to memory access at a low level. However, if memory is the primary concern, then all sparse formats should be supported. I am guessing that "how low level to go" would depend on benchmarks. I have downloaded some of the larger matrices from The University of Florida Sparse Matrix Collection that I have estimated will fit in memory for a machine with 4GB of RAM. I have tried to stay away from the diagonal ones as it is my opinion that they are less representative of typical sparse matrices for machine learning. If someone wants to comment on which matrix I should be using instead, I could look at those. I have briefly looked at scipy with respect to computing the "element-wise" norms for a matrix (http://en.wikipedia.org/wiki/Matrix_norm), but according to the SciPy Dev Wiki (http://projects.scipy.org/scipy/ticket/18) there is little interest in this area. Also, column or row sum of powers of matrix entries will be less efficient than summing the entire matrix into one sum (for a large matrix). My conclusion is I need to benchmark multiple ways of computing: |
I think it is enough to support csr and convert all sparse input to csr before. |
Ok, I will have to look at which functions call f_one_way. I was assuming that f_one_way could be modified as a "black box", but that appears to be incorrect. |
I was looking in the wrong function with respect to: It would still be useful to benchmark the code to determine the tradeoffs (if any) of computation versus space for a large enough sample size that the copy is a problem. |
I think it is not really a tradeoff. It is a more a matter of refactoring the code. Either we special case the multivariate version or we implement it in such a way that it can be called by the univariate function. |
Am I correct in thinking this issue pertains to |
who tagged that for the release? was it me? I feel it is a tad non-trival and I would like to not do this for the release. If someone else tagged this, please ping me. |
Gensim has a re-implementation of
selectkbest
to fix memory scalability issues. Someone should review them and it integrate the fix back into the original sklearn codebase.https://github.com/piskvorky/gensim/blob/a73c84e21aecd3cc77ba2d752912f73b712bc60a/gensim/models/selectkbest.py
Maybe @agramfort you can have a look?
The text was updated successfully, but these errors were encountered: