Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: canonicalize dataset preparation (further) such that datasets can always be in-memory / memory mapped #668

Open
3 tasks
NickleDave opened this issue Jun 7, 2023 · 0 comments

Comments

@NickleDave
Copy link
Collaborator

NickleDave commented Jun 7, 2023

As discussed in #667 the current dataset classes conflate the underlying source data (e.g. audio files, array files with spectrograms, annotation files).

This also results in hard-to-reason-about logic inside the dataset classes and data loading logic that is tightly-coupled to specific tasks without it being obvious why.

Again as discussed in #667 one way around this is to make dataset classes that are task specific, and then make sure that data is as close as possible to the final form needed for the specified task.

We should canonicalize the file types that a dataset can consist of.
Basically, once prepared, the dataset should consist of one of two things:

  • for array data, such as audio and spectrograms, all data should be in array files that can be mem-mapped, e.g. a zarr store.
  • For frame classification task this should also include pre-computed labeled timebin vectors, so that we are not on the fly computing these for every batch
  • for other data types, such as annotations, these should be in some lightweight flat format like a csv or json file that can be loaded into memory all at once

This will require us to make a .zarr store for every split of the dataset at the end of the prep stage. For now I favor just doing this post-hoc to get an MVP in although it seems like there is probably room for re-factoring later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant