You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed in #667 the current dataset classes conflate the underlying source data (e.g. audio files, array files with spectrograms, annotation files).
This also results in hard-to-reason-about logic inside the dataset classes and data loading logic that is tightly-coupled to specific tasks without it being obvious why.
Again as discussed in #667 one way around this is to make dataset classes that are task specific, and then make sure that data is as close as possible to the final form needed for the specified task.
We should canonicalize the file types that a dataset can consist of.
Basically, once prepared, the dataset should consist of one of two things:
for array data, such as audio and spectrograms, all data should be in array files that can be mem-mapped, e.g. a zarr store.
For frame classification task this should also include pre-computed labeled timebin vectors, so that we are not on the fly computing these for every batch
for other data types, such as annotations, these should be in some lightweight flat format like a csv or json file that can be loaded into memory all at once
This will require us to make a .zarr store for every split of the dataset at the end of the prep stage. For now I favor just doing this post-hoc to get an MVP in although it seems like there is probably room for re-factoring later.
The text was updated successfully, but these errors were encountered:
As discussed in #667 the current dataset classes conflate the underlying source data (e.g. audio files, array files with spectrograms, annotation files).
This also results in hard-to-reason-about logic inside the dataset classes and data loading logic that is tightly-coupled to specific tasks without it being obvious why.
Again as discussed in #667 one way around this is to make dataset classes that are task specific, and then make sure that data is as close as possible to the final form needed for the specified task.
We should canonicalize the file types that a dataset can consist of.
Basically, once prepared, the dataset should consist of one of two things:
This will require us to make a .zarr store for every split of the dataset at the end of the prep stage. For now I favor just doing this post-hoc to get an MVP in although it seems like there is probably room for re-factoring later.
The text was updated successfully, but these errors were encountered: