Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory footprint for large datasets #90

Open
willwerscheid opened this issue Oct 24, 2018 · 4 comments
Open

Reduce memory footprint for large datasets #90

willwerscheid opened this issue Oct 24, 2018 · 4 comments
Labels

Comments

@willwerscheid
Copy link
Member

Some ideas:

  1. We do not need to store both Y and Yorig in the flash data object.
  2. We should probably store tau as a vector when var_type = by_column or by_row. This could be tricky, but it's probably worth it since flash fit objects are frequently copied.
  3. It shouldn't be too difficult to allow Y to be a dgCMatrix, and likewise for S.

If we do the above, then the only large dense matrices will be the matrices of residuals and squared residuals. (Or rather, R2, Rk, and R2k for the greedy step.) So, optimistically, we might be able to shoot for a memory requirement of 5x the size of the original data (measured as a dense matrix) when Y is sparse and 6-8x otherwise.

@pcarbo
Copy link
Member

pcarbo commented Oct 24, 2018

@willwerscheid If the data matrix is sparse, but you also have other matrices of the same dimension that are dense, there really isn't much benefit to allowing the data matrix to be sparse, and may complicate your code (e.g., you will have to convert matrix-vector products to be dense).

The key is to reduce the number of times you are modifying large matrices within nested functions. e.g., if I am doing

f <- function (x) {
   g(x)
}
out <- f(x)

and both f and g modify x, then this will create two extra copies of x.

@willwerscheid
Copy link
Member Author

@pcarbo Exactly. I think that the main advantage to point 3 is actually that datasets are often downloadable as dgCMatrix objects so, if this is easy to implement (it should be!), we save the user the trouble of converting. I don't know why I included S in that comment. S will of course almost never be sparse.

@pcarbo
Copy link
Member

pcarbo commented Oct 24, 2018

@willwerscheid Up to you. An alternative is to only accept matrices of class "matrix" and tell the user to the conversion themselves (which can easily be done with as.matrix).

@willwerscheid
Copy link
Member Author

After looking more closely I don't think any of these are worth the doing, at least for now. 1. This would indeed save memory equal to the size of Y, but it's only a one-time savings and would require a lot of code changes with potentially subtle consequences. 2. R is smarter about memory than I had assumed, and we currently only ever have two copies of tau in memory (and we need both in case we discard the latest update). So we would only save up to 2x the size of tau, and each var_type would require separate implementations of factor/loading updates, which would get very messy. 3. This is also trickier than I thought it would be, mostly because subsetting with sparse matrices is not straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants