ENH: canonicalize dataset preparation (further) such that datasets can always be in-memory / memory mapped #668

NickleDave · 2023-06-07T16:44:06Z

As discussed in #667 the current dataset classes conflate the underlying source data (e.g. audio files, array files with spectrograms, annotation files).

This also results in hard-to-reason-about logic inside the dataset classes and data loading logic that is tightly-coupled to specific tasks without it being obvious why.

Again as discussed in #667 one way around this is to make dataset classes that are task specific, and then make sure that data is as close as possible to the final form needed for the specified task.

We should canonicalize the file types that a dataset can consist of.
Basically, once prepared, the dataset should consist of one of two things:

for array data, such as audio and spectrograms, all data should be in array files that can be mem-mapped, e.g. a zarr store.
For frame classification task this should also include pre-computed labeled timebin vectors, so that we are not on the fly computing these for every batch
for other data types, such as annotations, these should be in some lightweight flat format like a csv or json file that can be loaded into memory all at once

This will require us to make a .zarr store for every split of the dataset at the end of the prep stage. For now I favor just doing this post-hoc to get an MVP in although it seems like there is probably room for re-factoring later.

This was referenced Jun 8, 2023

ENH: Add task-specific dataset prep / classes #667

Closed

Refactor frame classification models to use single WindowedFramesDatapipe #574

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: canonicalize dataset preparation (further) such that datasets can always be in-memory / memory mapped #668

ENH: canonicalize dataset preparation (further) such that datasets can always be in-memory / memory mapped #668

NickleDave commented Jun 7, 2023 •

edited

Loading

ENH: canonicalize dataset preparation (further) such that datasets can always be in-memory / memory mapped #668

ENH: canonicalize dataset preparation (further) such that datasets can always be in-memory / memory mapped #668

Comments

NickleDave commented Jun 7, 2023 • edited Loading

NickleDave commented Jun 7, 2023 •

edited

Loading