[WIP] out-of-core SVD #2749

Open
wants to merge 1 commit into
from

Projects

None yet

5 participants

@alimuldal

Hi all,

I've had a crack at implementing a function for out-of-core SVD on large memory-mapped arrays (as per issue #2661). It's based on the algorithm in Halko, et al., 2011, which is essentially identical to the one currently used by randomized_svd (originally from Halko, et al., 2009). The main difference is that I've implemented a function for computing dot products on pre-cached blocks of memory-mapped arrays. Doing manual buffering in this way is many times faster than calling np.dot directly on an np.memmap array (see my SO question/answer here).

For the time being I've just added another user option to randomized_svd to force buffering, with an additional keyword argument to control the size of the buffer in bytes. If people think this would be a potentially useful avenue to go down, I could have a look at integrating it into the RandomizedPCA class. However, there may be issues arising from the fact that the PCA classes force a copy of the input array (in order to do centering etc), which would be problematic for very large memory-mapped arrays.

Let me know what you think.

@coveralls

Coverage Status

Coverage remained the same when pulling 0b9fee6 on alimuldal:out_of_core_svd into cdebc18 on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 0b9fee6 on alimuldal:out_of_core_svd into cdebc18 on scikit-learn:master.

@ogrisel
Member
ogrisel commented Jan 14, 2014

Interesting. Have you tried to benchmark your randomized_svd with buffer=True on a dataset that does not fit in main memory?

As for the centering I think it should be possible to add an option to lazily center the input data as part of the randomized_range_finder loop by computing an X_mean = X.mean(axis=0) and then passing it as an additional optional argument to _blockwise_dot and leveraging the identity:

np.dot(A - A.mean(axis=0), B) == np.dot(A, B) - np.dot(A.mean(axis=0), B)
@alimuldal

@ogrisel

Interesting. Have you tried to benchmark your randomized_svd with buffer=True on a dataset that does not fit in main memory?

I'll try and do some overnight benchmarking whenever I can find the time. It's a bit painful to benchmark, since working on arrays that are too big to fit in core memory is intrinsically a pretty slow process!

As for the centering I think it should be possible to add an option to lazily center the input data as part of the randomized_range_finder loop by computing an X_mean = X.mean(axis=0) and then passing it as an additional optional argument to _blockwise_dot and leveraging the identity:

np.dot(A - A.mean(axis=0), B) == np.dot(A, B) - np.dot(A.mean(axis=0), B)

Good suggestion.

@larsmans
Member

Ping myself, this is interesting.

@ogrisel
Member
ogrisel commented Jan 16, 2014

@alimuldal you could make this more practical by using a virtual machine (e.g. virtualbox) to limit the RAM size to 2GB for instance, optionally via vagrant to automate its configuration.

@datnamer

Maybe the recent work with out of core blaze/numpy can help: http://matthewrocklin.com/blog/work/2014/12/30/Towards-OOC-Frontend/

@ogrisel ogrisel commented on the diff Feb 19, 2015
sklearn/utils/extmath.py
+ processed simultaneously.
+ """
+ m, n = A.shape
+ n1, o = B.shape
+ if n1 != n:
+ raise ValueError('matrices are not aligned')
+
+ if out is None:
+ out = np.empty((m, o), dtype=np.result_type(A, B))
+ elif out.shape != (m, o):
+ raise ValueError('output array has incorrect dimensions')
+
+ for mm in _block_slices(m, max_rows):
+ out[mm, :] = 0
+ for nn in _block_slices(n, max_cols):
+ A_block = A[mm, nn].copy() # copy to force a read
@ogrisel
ogrisel Feb 19, 2015 scikit-learn member

Why is this needed? Have you benchmarked the impact of this line? Can't the prefetch heuristics of the OS do a good job in that case? We could spare an necessary malloc here.

@ogrisel ogrisel commented on the diff Feb 19, 2015
sklearn/utils/extmath.py
+ while True:
+ yield slice(count, count + block_size, 1)
+ count += block_size
+ if count > dim_size:
+ raise StopIteration
+
+
+def _blockwise_dot(A, B, max_rows, max_cols, out=None):
+ """Computes the dot product of two matrices in a block-wise fashion. Only
+ blocks of `A` with a maximum size of `(max_rows, max_cols)` will be
+ processed simultaneously.
+ """
+ m, n = A.shape
+ n1, o = B.shape
+ if n1 != n:
+ raise ValueError('matrices are not aligned')
@ogrisel
ogrisel Feb 19, 2015 scikit-learn member

Please provide informative error messages with the actual values, for instance:

raise ValueError('Incompatible input array shapes to compute dot product: %r and %r'
                 % (A.shape, B.shape))
@ogrisel ogrisel commented on the diff Feb 19, 2015
sklearn/utils/extmath.py
+
+
+def _blockwise_dot(A, B, max_rows, max_cols, out=None):
+ """Computes the dot product of two matrices in a block-wise fashion. Only
+ blocks of `A` with a maximum size of `(max_rows, max_cols)` will be
+ processed simultaneously.
+ """
+ m, n = A.shape
+ n1, o = B.shape
+ if n1 != n:
+ raise ValueError('matrices are not aligned')
+
+ if out is None:
+ out = np.empty((m, o), dtype=np.result_type(A, B))
+ elif out.shape != (m, o):
+ raise ValueError('output array has incorrect dimensions')
@ogrisel
ogrisel Feb 19, 2015 scikit-learn member

same comment here: please state the observed shape of out and the value of the expected shape (m, o) in the error message.

@ogrisel
Member
ogrisel commented Feb 19, 2015

Could you please add tests for the buffer=True case and check that the results match the output of the same call with buffer=False.

For integration with RandomizedPCA on could just add buffer=False as additional kwarg and explain in the docstring that setting buffer=True is only interesting when computing the PCA of a large, dense nump.memmap array that does not necessarily fit in physical memory all at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment