Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues and (potential) chunk-wise processing of peaks data #304

Closed
jorainer opened this issue Nov 15, 2023 · 8 comments
Closed

Memory issues and (potential) chunk-wise processing of peaks data #304

jorainer opened this issue Nov 15, 2023 · 8 comments
Assignees
Labels
documentation enhancement New feature or request

Comments

@jorainer
Copy link
Member

By using a MsBackendSql for a very large MS experiment we ran into memory issues (see issue #303 ) processing the data. All operations working on peaks data use the internal .peaksapply function that by default splits the processing based on dataStorage(x) and performs the operation in parallel (even if processing might only mean loading the data from the raw data files). For in memory backends and the e.g. MsBackendSql, dataStorage will only have a single value, thus no splitting/parallel processing is performed.

Advantages of splitting and parallel processing

  • performance: for computationally intense processing steps.
  • lower memory demand: only the data currently being processed would be loaded into memory.

These advantages are good for very large data sets. The downside of splitting and parallel processing is however also the overhead of splitting and combining the data. Thus, in many cases (e.g. also for in-memory backends) it might actually be good to not do it.

Maybe it would make sense to add a parameter or let the user choose to enable parallel/split processing, even if backends other than MsBackendMzR are used.

@jorainer
Copy link
Member Author

A possibility would be to add a specific parameter for this to any of the user-facing functions: mz, intensity, peaksData, bin, ionCount, lengths etc.

A parameter chunkSize = integer() could be used. If length(chunkSize), a factor could be created to be passed to parameter f of the internal .peaksapply, hence forcing to perform splitted (eventually parallel) processing. With the default chunkSize = integer() per-file parallel processing would be performed for Spectra with different levels for dataStorage (as it is currently the default).

@jorainer
Copy link
Member Author

An example for the high memory requirement: determine the number of peaks per spectrum from a Spectra using a MsBackendSql backend - limiting to 500,000 spectra to avoid running out of memory:

> peakRAM(tmp <- lengths(sps[1:500000]))
               Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 tmp<-lengths(sps[1:5e+05])           63.094                6.4           32650.3

Thus, to complete this simple operation 32GB of memory are required, because the full peaks data is first loaded into memory before the number of peaks is determined.

@jorainer
Copy link
Member Author

After some tests and evaluating the code: passing parameters to individual functions might not be the ideal solution (also because some functions don't allow that, such as lengths).

An alternative would be to have a global setting processingChunkSize for a Spectra that allows the user to set a chunk size for any peaks data access function:

processingChunkSize(sps) <- 10000

any subsequent operation that calculates something on the peaks data will then load and process the data in chunks of 10000. We just have to ensure that functions like combineSpectra for which the user defines a factor to split and process the data ignore this global setting.

@jorainer jorainer self-assigned this Nov 17, 2023
@jorainer
Copy link
Member Author

For some functions the chunk-wise processing needs to be disabled:

  • combineSpectra: the Spectra needs to be split by the factor passed by the user because this set(s) of spectra need to be combined. Thus, any eventual chunk size should be ignored. Note that the function has a parameter p that allows to define how to split the data for parallel processing (and hence processes the data also by chunks).
  • compareSpectra: uses a different (internal) chunk processing. we need to compare all spectra against all, so separate processing as with the proposed chunk-wise processing would not work (likely?).

jorainer added a commit that referenced this issue Nov 21, 2023
- Refactor the code to decide how to split `Spectra` for parallel processing:
  it's no longer done automatically by `dataStorage`.
- Add a slot to `Spectra` allowing to set a processing chunk size. Issue #304
- With chunk-wise processing only the data of one chunk is realized to memory in
  each iteration. This also enables to process data in parallel.
jorainer added a commit that referenced this issue Nov 22, 2023
- Add and modify functions to enable default chunk-wise processing of peaks
  data (issue #304).
- Split the documentation for chunk-wise (parallel) processing into a separate
  documentation entry.
@jorainer
Copy link
Member Author

This is currently being implemented in the processingChunkSize branch.

@jorainer
Copy link
Member Author

Seems impossible to check/benchmark the memory usage properly from within R (at least I did not find any option). So, I'll perform some more tests locally and check if/when I'm running out of memory and how to prevent that.

@jorainer
Copy link
Member Author

Evaluation and performance comparison of chunk-wise processing

Here we evaluate how the chunk-wise processing works on a large data set (300 samples, 516,299 spectra). Memory estimation might not be correct, still we report here run time and (approximated/estimated) memory usage using peakRAM. The code was run (within a docker container) on a computer with 64GB of main memory (additional 64GB of swap memory). As a backend the MsBackendOfflineSql was used and the MS data was stored in a SQLite database (file).

Performance on two Spectra objects is tested: a and b, the former using a MsBackendOfflineSql without chunk-wise processing, the latter an MsBackendOfflineSql with processingChunkSize set to 100,000:

> length(a)
[1] 516299
> b <- a
> processingChunkSize(b) <- 100000

Next we filter the data removing peaks with an intensity lower 300 - this operation is just added to the processing queue.

> library(peakRAM)
> peakRAM(
+ a <- filterIntensity(a, intensity = 300),
+ b <- filterIntensity(b, intensity = 300)
+ )
                        Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-filterIntensity(a,intensity=300)            0.127                  0               0.1
2 b<-filterIntensity(b,intensity=300)            0.062                  0               0.1

Next we bin the data. This retrieves the range of the m/z values from the full data set and then adds the binning function to the processing queue.

> peakRAM(
+ a <- bin(a, binSize = 0.1),
+ b <- bin(b, binSize = 0.2)
+ )
          Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-bin(a,binSize=0.1)          507.889                4.1           58706.8
2 b<-bin(b,binSize=0.2)          301.128                0.8           35783.2

Here, the Spectra with the chunk-wise processing could perform the step faster and (if we can indeed trust the memory estimates) with a lower memory usage.

At last we calculate the tic for each spectrum. This call will require loading the data and executing the processing queue.

> peakRAM(
+ ionCount(a),
+ ionCount(b)
+ )
   Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 ionCount(a)          552.087                3.8           55898.9
2 ionCount(b)          366.852                3.9           26741.5

Chunk-wise processing thus reduced both run-time and memory demand. As a side effect, chunk-wise processing would enable also parallel processing, as a set of chunks can then be processed in parallel.

@jorainer
Copy link
Member Author

jorainer commented Dec 7, 2023

This is now implemented and activated by default. Closing issue.

@jorainer jorainer closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant