Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using sparse matrix #70

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

paoloinglese
Copy link
Contributor

I've made some changes to use sparse matrices from "Matrix".

I think it may be better in terms of memory usage, especially for imaging data, where data are often quite sparse.

@sgibb sgibb self-assigned this Oct 20, 2021
@sgibb sgibb self-requested a review October 20, 2021 09:27
@sgibb sgibb added this to the 2.0 milestone Oct 20, 2021
@sgibb
Copy link
Owner

sgibb commented Oct 20, 2021

Thanks a lot for this great contribution! I will need some time to ensure it doesn't have any negative side effects to (my) workflows and downstream dependencies.

@@ -1,5 +1,12 @@
import("methods")

importFrom("Matrix",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Matrix to Depends in the DESCRIPTION file to make R CMD check working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll do! Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've committed the fixed DESCRIPTION file.

@sgibb sgibb linked an issue Oct 20, 2021 that may be closed by this pull request
@paoloinglese
Copy link
Contributor Author

I'm having a look at the errors and fixing them.

Added support to sparse matrices and tests.
@paoloinglese
Copy link
Contributor Author

paoloinglese commented Oct 21, 2021

Hi,

I've made some modifications to the code.
Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0.
Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix (see sparseMatrix-functions.R).
In this way, you can continue to keep the zeros and the NAs continue to play the role of missing values.

I've also left the support to dense matrices, so you can easily adapt the code in case you need some legacy support.

I had to write new C++ functions for the colMeans and colMedians (/src/colMediansArma.cpp), where the zeros of the sparse matrix behave like NAs, as expected.
Please also have a look at the new tests for the sparse matrices. I wish there's no huge mistake.
I've tested the code on a Mac, I couldn't check if it runs on Win and Linux, unfortunately.

Of course, the support to the algebra of the sparseMatrixNA is missing (if you sum a "missing" value and number you get a number instead of the "missing"), so you can have some "weird" behaviours for downstream analysis. But, that would require a new class. Another solution would be dropping the difference between NAs and zeros, in that way the implementation would be straightforward.

@codecov
Copy link

codecov bot commented Oct 21, 2021

Codecov Report

Merging #70 (aa59b28) into master (b1ca2ff) will decrease coverage by 0.66%.
The diff coverage is 84.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
- Coverage   88.44%   87.78%   -0.67%     
==========================================
  Files          81       82       +1     
  Lines        1411     1514     +103     
==========================================
+ Hits         1248     1329      +81     
- Misses        163      185      +22     
Impacted Files Coverage Δ
R/as.matrix-functions.R 80.95% <63.63%> (-19.05%) ⬇️
src/colMediansArma.cpp 73.84% <73.84%> (ø)
R/intensityMatrix-functions.R 93.33% <75.00%> (-6.67%) ⬇️
R/colMedians-functions.R 100.00% <100.00%> (ø)
R/filterPeaks-functions.R 100.00% <100.00%> (ø)
R/merge-functions.R 100.00% <100.00%> (ø)
R/sparseMatrix-functions.R 100.00% <100.00%> (ø)
src/init.c

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1ca2ff...aa59b28. Read the comment docs.

@sgibb
Copy link
Owner

sgibb commented Oct 21, 2021

Wow, what a great PR!

Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0.
Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix

Could you please explain why this is necessary? NA means missing (which should be automatically handled by the sparse matrix, because there is no value), and zero means zero (could happen after baseline correction or transformation; should be a zero in the sparse matrix). When you replace all NA by zero and 0 by .Machine$double.xmin isn't the sparse matrix completely dense (and require 3x times the memory of a matrix)?

In MSnbase we often have problems with Rcpp (mostly different complied versions). While it is a great package I don't want to add this dependency if it is not really necessary. So I would maybe replace the .colMedians with apply(x, 2, median) instead (would be slower but IMHO it isn't heavily used by the users anyway) making maintaining easier.

@paoloinglese
Copy link
Contributor Author

paoloinglese commented Oct 21, 2021

Wow, what a great PR!

Thanks! :)

Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0.
Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix

Could you please explain why this is necessary? NA means missing (which should be automatically handled by the sparse matrix, because there is no value), and zero means zero (could happen after baseline correction or transformation; should be a zero in the sparse matrix). When you replace all NA by zero and 0 by .Machine$double.xmin isn't the sparse matrix completely dense (and require 3x times the memory of a matrix)?

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Basically, the idea is to convert the zeros to .Machine$double.xmin and treat the default values of the sparseMatrix as missing. In this way, you preserve the true zeros, and the NAs don't occupy the memory anymore. The tricky part is taking care of this when you do algebra with this matrix, because you want to get the right results (the zeros of the sparse matrix should behave like NAs). For this reason, I wrote the C++ versions colMeans and colMedians. Basically, these two give you the correct result, treating the zeros as NAs.
When you define the matrix, you have by default NAs, and you fill the elements with some values. In this case, you just need to convert the values equal to zero to .Machine$double.xmin and fill a sparse matrix. The zeros will take the role of the old NAs.

In MSnbase we often have problems with Rcpp (mostly different complied versions). While it is a great package I don't want to add this dependency if it is not really necessary. So I would maybe replace the .colMedians with apply(x, 2, median) instead (would be slower but IMHO it isn't heavily used by the users anyway) making maintaining easier.

That's easily solved by writing the C versions. Unfortunately, apply wouldn't work because it treats the zeros as zeros while they need to be treated as NAs.

@sgibb
Copy link
Owner

sgibb commented Oct 21, 2021

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Thanks for clarification. I wasn't aware of this detail of the Matrix implementation. I am wondering if we need to reimplement our own SparseMatrix or just treat NA and 0 (which isn't present very often) as missing.
I have to think about it a while. Finally the intensityMatrix should have a format that everybody can work with as expected (empty cells should be missing (NA) or imputed by spectra information (in that case it would be a dense matrix anyway).

@paoloinglese
Copy link
Contributor Author

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Thanks for clarification. I wasn't aware of this detail of the Matrix implementation. I am wondering if we need to reimplement our own SparseMatrix or just treat NA and 0 (which isn't present very often) as missing. I have to think about it a while. Finally the intensityMatrix should have a format that everybody can work with as expected (empty cells should be missing (NA) or imputed by spectra information (in that case it would be a dense matrix anyway).

Another completely different route is using something like bigmemory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Massive memory usage while using mergeMassPeaks and FilterPeaks
2 participants