Enable chunk-wise processing for all peaks data functions #306

jorainer · 2023-11-22T08:44:39Z

This PR fixes issue #304 . In brief: it adds the possibility for the user to set and define chunk-wise processing of a Spectra. This will affect all functions working on peaks data (e.g. even lengths, mz, peaksData) and ensures that even large-scale data can be handled reducing out of memory errors.

What this PR adds:

Spectra gains a new slot @processingChunkSize.
Add functions processingChunkSize and processingChunkSize<- to get or set the size of chunks for chunk-wise processing. The default is Inf hence no chunk-wise processing is performed (important e.g. for small data sets of in-memory backends).
Add backendParallelFactor,MsBackend method: this allows backends to suggest a preferred splitting of the data into chunks. The default is to return factor() (i.e. no preferred splitting), MsBackendMzR on the other hand returns a factor depending on the "dataStorage" spectra variable (hence suggests splitting by original data file).
The internal peaksapply function uses either chunks defined through processingChunkSize for chunk-wise processing, or if not set, uses the suggested splitting from the backend (through backendParallelFactor).
The user can check if and how the Spectra will be split using the processingChunkFactor function that returns a factor representing the chunks (defined through processingChunkSize), or, if not set, the suggested splitting (through backendParallelFactor) or factor() in which case no chunk-wise processing is performed.

This processing is used for all Spectra methods accessing peaks data (or processing the peaks data). To avoid performance loss for small data sets or in-memory backends it is not performed by default. If enabled by the user, it allows to process also large data.

I think this is a very important improvement allowing the analysis of large (on-disk) data - for which we ran into unexpected issues (see #304).

Happy to discuss @sgibb @lgatto @philouail .

- Refactor the code to decide how to split `Spectra` for parallel processing: it's no longer done automatically by `dataStorage`. - Add a slot to `Spectra` allowing to set a processing chunk size. Issue #304 - With chunk-wise processing only the data of one chunk is realized to memory in each iteration. This also enables to process data in parallel.

- Add and modify functions to enable default chunk-wise processing of peaks data (issue #304). - Split the documentation for chunk-wise (parallel) processing into a separate documentation entry.

vignettes/MsBackend.Rmd

R/MsBackend.R

R/Spectra-functions.R

R/Spectra.R

- Use `processingChunkFactor` instead of `dataStorage` in functions to define splitting and processing. - Remove unnecessary functions. - Add a vignette on parallel/chunk-wise processing.

jorainer · 2023-11-24T07:51:52Z

@philouail , I've fixed some more things, can you please give again a careful look - any questions, concerns, comments or change requests highly welcome!

I've added now also a vignette describing the parallel processing settings. Please have a look at that too.

philouail

This looks super good to me.
The processingChunkSize and processingChunkFactor description is really really good.
The vignette with the tips for large dataset is super good and straight to the point i like it.

I made very few comments and I had one thing I was confused about.

R/Spectra-functions.R

R/Spectra.R

vignettes/Spectra-large-scale.Rmd

andreavicini

Seems very good to me!

andreavicini · 2023-11-28T10:46:33Z

vignettes/Spectra.Rmd

@@ -41,7 +41,9 @@ This vignette provides general examples and descriptions for the *Spectra*
 package. Additional information and tutorials are available, such as
 [SpectraTutorials](https://jorainer.github.io/SpectraTutorials/),
 [MetaboAnnotationTutorials](https://jorainer.github.io/MetaboAnnotationTutorials),
-or also in [@rainer_modular_2022].
+or also in [@rainer_modular_2022]. For information how to handle and (parallel)


"For information on how" maybe

jorainer · 2023-11-30T09:06:38Z

Thanks for the reviews @andreavicini and @philouail ! I will merge now after having another look myself again.

sgibb · 2023-11-30T09:17:15Z

@jorainer sorry but I didn't review the code but a small suggestion anyway: If we add a new slot to the Spectra class it would break backward compatibility. So I would suggest to increment the "version" of the class and increment the minor number of the version field in the DESCRIPTION file.

jorainer · 2023-11-30T10:49:28Z

Hi Sebastian @sgibb , thanks for the suggestion. I'll increment the class version. I ensured backward compatibility through the accessor function that checks if the object has the slot and if not returns Inf (the default for the slot value). Also, Spectra methods will (automatically) call updateObject if required. So, backward compatibility should be guaranteed.

I would maybe not bump the minor version of the package to not interfere with the Bioconductor versioning?

jorainer added 6 commits November 20, 2023 14:57

refactor: move function

432efc2

docs: start refactoring the parallel processing documentation

2e7366e

feat: enable chunk-wise processing for peaks data operations

17e733f

- Add and modify functions to enable default chunk-wise processing of peaks data (issue #304). - Split the documentation for chunk-wise (parallel) processing into a separate documentation entry.

Merge branch 'main' into processingChunkSize

67c88e3

docs: fix version

72c65ea

philouail reviewed Nov 23, 2023

View reviewed changes

vignettes/MsBackend.Rmd Show resolved Hide resolved

philouail reviewed Nov 23, 2023

View reviewed changes

R/MsBackend.R Show resolved Hide resolved

philouail reviewed Nov 23, 2023

View reviewed changes

R/Spectra-functions.R Show resolved Hide resolved

philouail reviewed Nov 23, 2023

View reviewed changes

R/Spectra.R Outdated Show resolved Hide resolved

refactor: update function to all use the new chunk settings

e6a50c8

- Use `processingChunkFactor` instead of `dataStorage` in functions to define splitting and processing. - Remove unnecessary functions. - Add a vignette on parallel/chunk-wise processing.

philouail approved these changes Nov 24, 2023

View reviewed changes

R/Spectra-functions.R Show resolved Hide resolved

R/Spectra-functions.R Outdated Show resolved Hide resolved

R/Spectra.R Show resolved Hide resolved

vignettes/Spectra-large-scale.Rmd Outdated Show resolved Hide resolved

refactor: address Phili's comments

194f104

jorainer marked this pull request as ready for review November 24, 2023 14:14

refactor: add parameter f to more functions

0e555dc

jorainer requested review from lgatto and andreavicini November 27, 2023 07:00

andreavicini approved these changes Nov 28, 2023

View reviewed changes

docs: fix comment from Andrea

3ba908b

jorainer merged commit 9ac911b into main Nov 30, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable chunk-wise processing for all peaks data functions #306

Enable chunk-wise processing for all peaks data functions #306

jorainer commented Nov 22, 2023

jorainer commented Nov 24, 2023

philouail left a comment

andreavicini left a comment

andreavicini Nov 28, 2023

jorainer Nov 30, 2023

jorainer commented Nov 30, 2023

sgibb commented Nov 30, 2023

jorainer commented Nov 30, 2023

Enable chunk-wise processing for all peaks data functions #306

Enable chunk-wise processing for all peaks data functions #306

Conversation

jorainer commented Nov 22, 2023

jorainer commented Nov 24, 2023

philouail left a comment

Choose a reason for hiding this comment

andreavicini left a comment

Choose a reason for hiding this comment

andreavicini Nov 28, 2023

Choose a reason for hiding this comment

jorainer Nov 30, 2023

Choose a reason for hiding this comment

jorainer commented Nov 30, 2023

sgibb commented Nov 30, 2023

jorainer commented Nov 30, 2023