You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To support the parallel generation of datasets #41, #35, #36 , partial generation of datasets #31 , and storing datasets in multiple formats, it makes sense to have a class that can do this.
I think the DatasetCreator class can just have the responsibility of connecting a DatasetLoader and a DatasetWriter. Something like this.
The DatasetCreator can take care of seeding, naming, and handling dataset metadata.
The DatasetLoader is probably a Dataset wrapped by a torch.utils.data.DataLoader, but could be wrapped in anything that generates that data like a Python Generator.
The DatasetWriter is just a strategy to store metadata and data pairs.
This also means we can refactor out of named datasets like Sig53 any responsibility it has to generate/store itself.
The text was updated successfully, but these errors were encountered:
In local tests for Sig53, these classes decreased generation time from 4 minutes to less than one. YMMV.
There are a few other issues that came up here though.
What's the best way to provide dataset meta-data for inspecting datasets during training? I'm not talking about storage format like hdf5 or sigMF. The question is what fields do we keep in the database.
Do we need or want an abstracted way of interfacing with our data so that we can leverage various trade-offs in database/storage technologies? I feel like this is out of scope, so I just chose a faster storage for writing during generation and a faster storage for reading during training :/.
To support the parallel generation of datasets #41, #35, #36 , partial generation of datasets #31 , and storing datasets in multiple formats, it makes sense to have a class that can do this.
I think the
DatasetCreator
class can just have the responsibility of connecting aDatasetLoader
and aDatasetWriter
. Something like this.The
DatasetCreator
can take care of seeding, naming, and handling dataset metadata.The
DatasetLoader
is probably a Dataset wrapped by atorch.utils.data.DataLoader
, but could be wrapped in anything that generates that data like a Python Generator.The
DatasetWriter
is just a strategy to store metadata and data pairs.This also means we can refactor out of named datasets like
Sig53
any responsibility it has to generate/store itself.The text was updated successfully, but these errors were encountered: