Join GitHub today
[MRG+2] Use fused types in sparse mean variance functions #6593
This is a follow up PR from #6588 , which try to make functions in
In this PR, I focus on functions listed below:
memory usage surrounded by the bracket indeed decrease.
referenced this pull request
Mar 29, 2016
Hello @MechCoder ,
About memory profiling, I called
Here is the memory usage over time
As you can see, memory usage surrounded by the bracket drastically decrease.
Here is my test script:
import numpy as np import scipy.sparse as sp from sklearn.utils.sparsefuncs import mean_variance_axis X = np.random.rand(5000000, 20) X = X.astype(np.float32) X_csr = sp.csr_matrix(X) @profile def test(): X_means, X_vars = mean_variance_axis(X_csr, axis=0) print X_means.dtype test()
I think peak memory usage appear when I initialize
It's not really an issue with this PR, but I suspect we should have a test there to ensure the result is sensible for integer dtypes. You could just have something like:
for input_dtype, expected_dtype in [(np.float32, np.float32), (np.int32, np.float64), ...]
In fact, the
Thinking further about it, this change still copies data for integers, which we could avoid with a more generic fused type, while we still having float output. Can we assume that the mean of explicitly integer features is not actually something we're interested in often, so not worth the additional compilation time?
We also need a what's new entry boasting what we've enhanced.
Hello @jnothman , thanks for your review.
Maybe it's my lack of knowledge, but why mean of integer features is often not we are interesred in?