using sparse matrix #70

paoloinglese · 2021-10-16T15:11:13Z

I've made some changes to use sparse matrices from "Matrix".

I think it may be better in terms of memory usage, especially for imaging data, where data are often quite sparse.

sgibb · 2021-10-20T09:28:52Z

Thanks a lot for this great contribution! I will need some time to ensure it doesn't have any negative side effects to (my) workflows and downstream dependencies.

sgibb · 2021-10-20T09:33:28Z

NAMESPACE

@@ -1,5 +1,12 @@
 import("methods")

+importFrom("Matrix",


Please add Matrix to Depends in the DESCRIPTION file to make R CMD check working.

Ok I'll do! Thanks!

I've committed the fixed DESCRIPTION file.

paoloinglese · 2021-10-20T13:20:15Z

I'm having a look at the errors and fixing them.

Added support to sparse matrices and tests.

paoloinglese · 2021-10-21T00:11:52Z

Hi,

I've made some modifications to the code.
Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0.
Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix (see sparseMatrix-functions.R).
In this way, you can continue to keep the zeros and the NAs continue to play the role of missing values.

I've also left the support to dense matrices, so you can easily adapt the code in case you need some legacy support.

I had to write new C++ functions for the colMeans and colMedians (/src/colMediansArma.cpp), where the zeros of the sparse matrix behave like NAs, as expected.
Please also have a look at the new tests for the sparse matrices. I wish there's no huge mistake.
I've tested the code on a Mac, I couldn't check if it runs on Win and Linux, unfortunately.

Of course, the support to the algebra of the sparseMatrixNA is missing (if you sum a "missing" value and number you get a number instead of the "missing"), so you can have some "weird" behaviours for downstream analysis. But, that would require a new class. Another solution would be dropping the difference between NAs and zeros, in that way the implementation would be straightforward.

codecov · 2021-10-21T06:55:20Z

Codecov Report

Merging #70 (aa59b28) into master (b1ca2ff) will decrease coverage by 0.66%.
The diff coverage is 84.05%.

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
- Coverage   88.44%   87.78%   -0.67%     
==========================================
  Files          81       82       +1     
  Lines        1411     1514     +103     
==========================================
+ Hits         1248     1329      +81     
- Misses        163      185      +22

Impacted Files	Coverage Δ
R/as.matrix-functions.R	`80.95% <63.63%> (-19.05%)`	⬇️
src/colMediansArma.cpp	`73.84% <73.84%> (ø)`
R/intensityMatrix-functions.R	`93.33% <75.00%> (-6.67%)`	⬇️
R/colMedians-functions.R	`100.00% <100.00%> (ø)`
R/filterPeaks-functions.R	`100.00% <100.00%> (ø)`
R/merge-functions.R	`100.00% <100.00%> (ø)`
R/sparseMatrix-functions.R	`100.00% <100.00%> (ø)`
src/init.c

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1ca2ff...aa59b28. Read the comment docs.

sgibb · 2021-10-21T07:47:42Z

Wow, what a great PR!

Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0.
Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix

Could you please explain why this is necessary? NA means missing (which should be automatically handled by the sparse matrix, because there is no value), and zero means zero (could happen after baseline correction or transformation; should be a zero in the sparse matrix). When you replace all NA by zero and 0 by .Machine$double.xmin isn't the sparse matrix completely dense (and require 3x times the memory of a matrix)?

In MSnbase we often have problems with Rcpp (mostly different complied versions). While it is a great package I don't want to add this dependency if it is not really necessary. So I would maybe replace the .colMedians with apply(x, 2, median) instead (would be slower but IMHO it isn't heavily used by the users anyway) making maintaining easier.

paoloinglese · 2021-10-21T08:45:30Z

Wow, what a great PR!

Thanks! :)

Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0.
Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix

Could you please explain why this is necessary? NA means missing (which should be automatically handled by the sparse matrix, because there is no value), and zero means zero (could happen after baseline correction or transformation; should be a zero in the sparse matrix). When you replace all NA by zero and 0 by .Machine$double.xmin isn't the sparse matrix completely dense (and require 3x times the memory of a matrix)?

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Basically, the idea is to convert the zeros to .Machine$double.xmin and treat the default values of the sparseMatrix as missing. In this way, you preserve the true zeros, and the NAs don't occupy the memory anymore. The tricky part is taking care of this when you do algebra with this matrix, because you want to get the right results (the zeros of the sparse matrix should behave like NAs). For this reason, I wrote the C++ versions colMeans and colMedians. Basically, these two give you the correct result, treating the zeros as NAs.
When you define the matrix, you have by default NAs, and you fill the elements with some values. In this case, you just need to convert the values equal to zero to .Machine$double.xmin and fill a sparse matrix. The zeros will take the role of the old NAs.

In MSnbase we often have problems with Rcpp (mostly different complied versions). While it is a great package I don't want to add this dependency if it is not really necessary. So I would maybe replace the .colMedians with apply(x, 2, median) instead (would be slower but IMHO it isn't heavily used by the users anyway) making maintaining easier.

That's easily solved by writing the C versions. Unfortunately, apply wouldn't work because it treats the zeros as zeros while they need to be treated as NAs.

sgibb · 2021-10-21T11:25:56Z

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Thanks for clarification. I wasn't aware of this detail of the Matrix implementation. I am wondering if we need to reimplement our own SparseMatrix or just treat NA and 0 (which isn't present very often) as missing.
I have to think about it a while. Finally the intensityMatrix should have a format that everybody can work with as expected (empty cells should be missing (NA) or imputed by spectra information (in that case it would be a dense matrix anyway).

paoloinglese · 2021-10-21T12:04:51Z

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Thanks for clarification. I wasn't aware of this detail of the Matrix implementation. I am wondering if we need to reimplement our own SparseMatrix or just treat NA and 0 (which isn't present very often) as missing. I have to think about it a while. Finally the intensityMatrix should have a format that everybody can work with as expected (empty cells should be missing (NA) or imputed by spectra information (in that case it would be a dense matrix anyway).

Another completely different route is using something like bigmemory.

using sparse matrix

0b59f34

sgibb self-assigned this Oct 20, 2021

sgibb self-requested a review October 20, 2021 09:27

sgibb added the enhancement label Oct 20, 2021

sgibb added this to the 2.0 milestone Oct 20, 2021

sgibb requested changes Oct 20, 2021

View reviewed changes

Added Matrix dep

d2aef0c

sgibb linked an issue Oct 20, 2021 that may be closed by this pull request

Massive memory usage while using mergeMassPeaks and FilterPeaks #64

Open

support to sparse matrices

aa59b28

Added support to sparse matrices and tests.

sgibb mentioned this pull request Nov 8, 2021

Reduced memory requirement for filterPeaks; closes #71 #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using sparse matrix #70

using sparse matrix #70

paoloinglese commented Oct 16, 2021

sgibb commented Oct 20, 2021

sgibb Oct 20, 2021

paoloinglese Oct 20, 2021

paoloinglese Oct 20, 2021

paoloinglese commented Oct 20, 2021

paoloinglese commented Oct 21, 2021 •

edited

codecov bot commented Oct 21, 2021

sgibb commented Oct 21, 2021

paoloinglese commented Oct 21, 2021 •

edited

sgibb commented Oct 21, 2021

paoloinglese commented Oct 21, 2021

using sparse matrix #70

Are you sure you want to change the base?

using sparse matrix #70

Conversation

paoloinglese commented Oct 16, 2021

sgibb commented Oct 20, 2021

sgibb Oct 20, 2021

Choose a reason for hiding this comment

paoloinglese Oct 20, 2021

Choose a reason for hiding this comment

paoloinglese Oct 20, 2021

Choose a reason for hiding this comment

paoloinglese commented Oct 20, 2021

paoloinglese commented Oct 21, 2021 • edited

codecov bot commented Oct 21, 2021

Codecov Report

sgibb commented Oct 21, 2021

paoloinglese commented Oct 21, 2021 • edited

sgibb commented Oct 21, 2021

paoloinglese commented Oct 21, 2021

paoloinglese commented Oct 21, 2021 •

edited

paoloinglese commented Oct 21, 2021 •

edited