Reduce memory footprint for large datasets #90

willwerscheid · 2018-10-24T15:12:18Z

Some ideas:

We do not need to store both Y and Yorig in the flash data object.
We should probably store tau as a vector when var_type = by_column or by_row. This could be tricky, but it's probably worth it since flash fit objects are frequently copied.
It shouldn't be too difficult to allow Y to be a dgCMatrix, and likewise for S.

If we do the above, then the only large dense matrices will be the matrices of residuals and squared residuals. (Or rather, R2, Rk, and R2k for the greedy step.) So, optimistically, we might be able to shoot for a memory requirement of 5x the size of the original data (measured as a dense matrix) when Y is sparse and 6-8x otherwise.

The text was updated successfully, but these errors were encountered:

pcarbo · 2018-10-24T15:23:10Z

@willwerscheid If the data matrix is sparse, but you also have other matrices of the same dimension that are dense, there really isn't much benefit to allowing the data matrix to be sparse, and may complicate your code (e.g., you will have to convert matrix-vector products to be dense).

The key is to reduce the number of times you are modifying large matrices within nested functions. e.g., if I am doing

f <- function (x) {
   g(x)
}
out <- f(x)

and both f and g modify x, then this will create two extra copies of x.

willwerscheid · 2018-10-24T16:15:35Z

@pcarbo Exactly. I think that the main advantage to point 3 is actually that datasets are often downloadable as dgCMatrix objects so, if this is easy to implement (it should be!), we save the user the trouble of converting. I don't know why I included S in that comment. S will of course almost never be sparse.

pcarbo · 2018-10-24T16:22:41Z

@willwerscheid Up to you. An alternative is to only accept matrices of class "matrix" and tell the user to the conversion themselves (which can easily be done with as.matrix).

willwerscheid · 2018-10-28T15:39:47Z

After looking more closely I don't think any of these are worth the doing, at least for now. 1. This would indeed save memory equal to the size of Y, but it's only a one-time savings and would require a lot of code changes with potentially subtle consequences. 2. R is smarter about memory than I had assumed, and we currently only ever have two copies of tau in memory (and we need both in case we discard the latest update). So we would only save up to 2x the size of tau, and each var_type would require separate implementations of factor/loading updates, which would get very messy. 3. This is also trickier than I thought it would be, mostly because subsetting with sparse matrices is not straightforward.

willwerscheid added the later label Oct 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory footprint for large datasets #90

Reduce memory footprint for large datasets #90

willwerscheid commented Oct 24, 2018

pcarbo commented Oct 24, 2018

willwerscheid commented Oct 24, 2018

pcarbo commented Oct 24, 2018

willwerscheid commented Oct 28, 2018

Reduce memory footprint for large datasets #90

Reduce memory footprint for large datasets #90

Comments

willwerscheid commented Oct 24, 2018

pcarbo commented Oct 24, 2018

willwerscheid commented Oct 24, 2018

pcarbo commented Oct 24, 2018

willwerscheid commented Oct 28, 2018