Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new DatasetWriter class and other related classes #61

Closed
gvanhoy opened this issue Apr 4, 2023 · 1 comment · Fixed by #67
Closed

Introduce new DatasetWriter class and other related classes #61

gvanhoy opened this issue Apr 4, 2023 · 1 comment · Fixed by #67
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@gvanhoy
Copy link
Collaborator

gvanhoy commented Apr 4, 2023

To support the parallel generation of datasets #41, #35, #36 , partial generation of datasets #31 , and storing datasets in multiple formats, it makes sense to have a class that can do this.

I think the DatasetCreator class can just have the responsibility of connecting a DatasetLoader and a DatasetWriter. Something like this.

The DatasetCreator can take care of seeding, naming, and handling dataset metadata.

The DatasetLoader is probably a Dataset wrapped by a torch.utils.data.DataLoader, but could be wrapped in anything that generates that data like a Python Generator.

The DatasetWriter is just a strategy to store metadata and data pairs.

This also means we can refactor out of named datasets like Sig53 any responsibility it has to generate/store itself.

@gvanhoy
Copy link
Collaborator Author

gvanhoy commented Apr 19, 2023

In local tests for Sig53, these classes decreased generation time from 4 minutes to less than one. YMMV.

There are a few other issues that came up here though.

  1. What's the best way to provide dataset meta-data for inspecting datasets during training? I'm not talking about storage format like hdf5 or sigMF. The question is what fields do we keep in the database.
  2. Do we need or want an abstracted way of interfacing with our data so that we can leverage various trade-offs in database/storage technologies? I feel like this is out of scope, so I just chose a faster storage for writing during generation and a faster storage for reading during training :/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
Status: Done
1 participant