Introduce new DatasetWriter class and other related classes #61

gvanhoy · 2023-04-04T20:53:42Z

To support the parallel generation of datasets #41, #35, #36 , partial generation of datasets #31 , and storing datasets in multiple formats, it makes sense to have a class that can do this.

I think the DatasetCreator class can just have the responsibility of connecting a DatasetLoader and a DatasetWriter. Something like this.

The DatasetCreator can take care of seeding, naming, and handling dataset metadata.

The DatasetLoader is probably a Dataset wrapped by a torch.utils.data.DataLoader, but could be wrapped in anything that generates that data like a Python Generator.

The DatasetWriter is just a strategy to store metadata and data pairs.

This also means we can refactor out of named datasets like Sig53 any responsibility it has to generate/store itself.

The text was updated successfully, but these errors were encountered:

gvanhoy · 2023-04-19T15:40:50Z

In local tests for Sig53, these classes decreased generation time from 4 minutes to less than one. YMMV.

There are a few other issues that came up here though.

What's the best way to provide dataset meta-data for inspecting datasets during training? I'm not talking about storage format like hdf5 or sigMF. The question is what fields do we keep in the database.
Do we need or want an abstracted way of interfacing with our data so that we can leverage various trade-offs in database/storage technologies? I feel like this is out of scope, so I just chose a faster storage for writing during generation and a faster storage for reading during training :/.

gvanhoy added enhancement New feature or request help wanted Extra attention is needed labels Apr 4, 2023

gvanhoy self-assigned this Apr 4, 2023

This was referenced Apr 4, 2023

Accelerate Sig53 Generation #35

Closed

Automate tests of examples #41

Closed

Loading part of the Sig53 datasets #31

Closed

gvanhoy linked a pull request Apr 19, 2023 that will close this issue

Introducing DatasetWriter/DatasetLoader/DatasetReader classes for parallelized loading/saving of seeded datasets #67

Merged

gvanhoy closed this as completed in #67 Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new DatasetWriter class and other related classes #61

Introduce new DatasetWriter class and other related classes #61

gvanhoy commented Apr 4, 2023 •

edited

Loading

gvanhoy commented Apr 19, 2023

Introduce new DatasetWriter class and other related classes #61

Introduce new DatasetWriter class and other related classes #61

Comments

gvanhoy commented Apr 4, 2023 • edited Loading

gvanhoy commented Apr 19, 2023

gvanhoy commented Apr 4, 2023 •

edited

Loading