Memory issues and (potential) chunk-wise processing of peaks data #304

jorainer · 2023-11-15T14:07:16Z

By using a MsBackendSql for a very large MS experiment we ran into memory issues (see issue #303 ) processing the data. All operations working on peaks data use the internal .peaksapply function that by default splits the processing based on dataStorage(x) and performs the operation in parallel (even if processing might only mean loading the data from the raw data files). For in memory backends and the e.g. MsBackendSql, dataStorage will only have a single value, thus no splitting/parallel processing is performed.

Advantages of splitting and parallel processing

performance: for computationally intense processing steps.
lower memory demand: only the data currently being processed would be loaded into memory.

These advantages are good for very large data sets. The downside of splitting and parallel processing is however also the overhead of splitting and combining the data. Thus, in many cases (e.g. also for in-memory backends) it might actually be good to not do it.

Maybe it would make sense to add a parameter or let the user choose to enable parallel/split processing, even if backends other than MsBackendMzR are used.

The text was updated successfully, but these errors were encountered:

jorainer · 2023-11-15T14:13:41Z

A possibility would be to add a specific parameter for this to any of the user-facing functions: mz, intensity, peaksData, bin, ionCount, lengths etc.

A parameter chunkSize = integer() could be used. If length(chunkSize), a factor could be created to be passed to parameter f of the internal .peaksapply, hence forcing to perform splitted (eventually parallel) processing. With the default chunkSize = integer() per-file parallel processing would be performed for Spectra with different levels for dataStorage (as it is currently the default).

jorainer · 2023-11-17T13:20:12Z

An example for the high memory requirement: determine the number of peaks per spectrum from a Spectra using a MsBackendSql backend - limiting to 500,000 spectra to avoid running out of memory:

> peakRAM(tmp <- lengths(sps[1:500000]))
               Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 tmp<-lengths(sps[1:5e+05])           63.094                6.4           32650.3

Thus, to complete this simple operation 32GB of memory are required, because the full peaks data is first loaded into memory before the number of peaks is determined.

jorainer · 2023-11-17T14:32:02Z

After some tests and evaluating the code: passing parameters to individual functions might not be the ideal solution (also because some functions don't allow that, such as lengths).

An alternative would be to have a global setting processingChunkSize for a Spectra that allows the user to set a chunk size for any peaks data access function:

processingChunkSize(sps) <- 10000

any subsequent operation that calculates something on the peaks data will then load and process the data in chunks of 10000. We just have to ensure that functions like combineSpectra for which the user defines a factor to split and process the data ignore this global setting.

jorainer · 2023-11-20T15:35:55Z

For some functions the chunk-wise processing needs to be disabled:

combineSpectra: the Spectra needs to be split by the factor passed by the user because this set(s) of spectra need to be combined. Thus, any eventual chunk size should be ignored. Note that the function has a parameter p that allows to define how to split the data for parallel processing (and hence processes the data also by chunks).
compareSpectra: uses a different (internal) chunk processing. we need to compare all spectra against all, so separate processing as with the proposed chunk-wise processing would not work (likely?).

- Refactor the code to decide how to split `Spectra` for parallel processing: it's no longer done automatically by `dataStorage`. - Add a slot to `Spectra` allowing to set a processing chunk size. Issue #304 - With chunk-wise processing only the data of one chunk is realized to memory in each iteration. This also enables to process data in parallel.

- Add and modify functions to enable default chunk-wise processing of peaks data (issue #304). - Split the documentation for chunk-wise (parallel) processing into a separate documentation entry.

jorainer · 2023-11-22T08:29:21Z

This is currently being implemented in the processingChunkSize branch.

jorainer · 2023-11-24T08:32:36Z

Seems impossible to check/benchmark the memory usage properly from within R (at least I did not find any option). So, I'll perform some more tests locally and check if/when I'm running out of memory and how to prevent that.

jorainer · 2023-11-24T13:46:01Z

Evaluation and performance comparison of chunk-wise processing

Here we evaluate how the chunk-wise processing works on a large data set (300 samples, 516,299 spectra). Memory estimation might not be correct, still we report here run time and (approximated/estimated) memory usage using peakRAM. The code was run (within a docker container) on a computer with 64GB of main memory (additional 64GB of swap memory). As a backend the MsBackendOfflineSql was used and the MS data was stored in a SQLite database (file).

Performance on two Spectra objects is tested: a and b, the former using a MsBackendOfflineSql without chunk-wise processing, the latter an MsBackendOfflineSql with processingChunkSize set to 100,000:

> length(a)
[1] 516299
> b <- a
> processingChunkSize(b) <- 100000

Next we filter the data removing peaks with an intensity lower 300 - this operation is just added to the processing queue.

> library(peakRAM)
> peakRAM(
+ a <- filterIntensity(a, intensity = 300),
+ b <- filterIntensity(b, intensity = 300)
+ )
                        Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-filterIntensity(a,intensity=300)            0.127                  0               0.1
2 b<-filterIntensity(b,intensity=300)            0.062                  0               0.1

Next we bin the data. This retrieves the range of the m/z values from the full data set and then adds the binning function to the processing queue.

> peakRAM(
+ a <- bin(a, binSize = 0.1),
+ b <- bin(b, binSize = 0.2)
+ )
          Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-bin(a,binSize=0.1)          507.889                4.1           58706.8
2 b<-bin(b,binSize=0.2)          301.128                0.8           35783.2

Here, the Spectra with the chunk-wise processing could perform the step faster and (if we can indeed trust the memory estimates) with a lower memory usage.

At last we calculate the tic for each spectrum. This call will require loading the data and executing the processing queue.

> peakRAM(
+ ionCount(a),
+ ionCount(b)
+ )
   Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 ionCount(a)          552.087                3.8           55898.9
2 ionCount(b)          366.852                3.9           26741.5

Chunk-wise processing thus reduced both run-time and memory demand. As a side effect, chunk-wise processing would enable also parallel processing, as a set of chunks can then be processed in parallel.

jorainer · 2023-12-07T07:25:30Z

This is now implemented and activated by default. Closing issue.

jorainer self-assigned this Nov 17, 2023

jorainer added the enhancement New feature or request label Nov 22, 2023

jorainer mentioned this issue Nov 22, 2023

Enable chunk-wise processing for all peaks data functions #306

Merged

jorainer mentioned this issue Nov 24, 2023

Ensure all methods working on peaksData use the chunk-wise processing #307

Open

jorainer added the documentation label Dec 7, 2023

jorainer closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues and (potential) chunk-wise processing of peaks data #304

Memory issues and (potential) chunk-wise processing of peaks data #304

jorainer commented Nov 15, 2023

jorainer commented Nov 15, 2023

jorainer commented Nov 17, 2023

jorainer commented Nov 17, 2023

jorainer commented Nov 20, 2023

jorainer commented Nov 22, 2023

jorainer commented Nov 24, 2023

jorainer commented Nov 24, 2023

jorainer commented Dec 7, 2023

Memory issues and (potential) chunk-wise processing of peaks data #304

Memory issues and (potential) chunk-wise processing of peaks data #304

Comments

jorainer commented Nov 15, 2023

jorainer commented Nov 15, 2023

jorainer commented Nov 17, 2023

jorainer commented Nov 17, 2023

jorainer commented Nov 20, 2023

jorainer commented Nov 22, 2023

jorainer commented Nov 24, 2023

jorainer commented Nov 24, 2023

Evaluation and performance comparison of chunk-wise processing

jorainer commented Dec 7, 2023