You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How does tinygrad plan on dealing with datasets and dataloading generally (to the extent those concepts generalize)?
I figure the doing and the optimizing of distributed routing of samples to model replicas is really more of an adjacent than core problem for tinygrad. Is that right?
Am arguing this is one of those things that quickly gets somewhat hairier than expected when you sweat the details, expectations wafting loftily toward convergence-throughput Pareto optimality.
Ask
I would like to brazenly suggest MosaicML Streaming/StreamingDataset, which was used to train MPT-30B etc. tl;dr the repo is built around a dataset class which provides all the distributed and serialization aspects that PyTorch IterableDataset leaves for you to solve yourself and some other nice things, i.e. Dataset > IterableDataset > StreamingDataset, also DataLoader > StreamingDataLoader if want mid-epoch ckpt/resume. Suspect easily adaptable to non-PT.
Details
The point of StreamingDataset is to solve scaling, here are some points:
Compute | data: (a) streams/caches samples if data is remote, (b) acts local, with prefetched/lazy global random access.
Serialization: (a) handles via own sharded serialization format (formats), (b) shards live on disk not ram, (c) cached getitem in tens of usec.
Combining many datasets at run time: (a) balancing component sub-datasets, (b) on the fly shuffling strong enough for different data distributions and severe upsampling w/ non-impacted convergence and severe downsampling w/ efficient usage of ingress and so on, (c) all the ways to batch, shuffle, and jive known to science and some others.
Hardware failures: (a) adds mid-epoch checkpointing/resumption, requiring distributed deterministic everything (table stakes?), (b) but it is also elastically deterministic in the number of nodes across checkpoint/resumes.
These items don't blow your hair back in single node training on datasets that will fit on the node, but when you start thinking bigger and performance you hit the wall.
This discussion was converted from issue #2602 on December 04, 2023 15:39.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Framing
How does tinygrad plan on dealing with datasets and dataloading generally (to the extent those concepts generalize)?
I figure the doing and the optimizing of distributed routing of samples to model replicas is really more of an adjacent than core problem for tinygrad. Is that right?
Am arguing this is one of those things that quickly gets somewhat hairier than expected when you sweat the details, expectations wafting loftily toward convergence-throughput Pareto optimality.
Ask
I would like to brazenly suggest MosaicML Streaming/StreamingDataset, which was used to train MPT-30B etc. tl;dr the repo is built around a dataset class which provides all the distributed and serialization aspects that PyTorch IterableDataset leaves for you to solve yourself and some other nice things, i.e.
Dataset
>IterableDataset
>StreamingDataset
, alsoDataLoader
>StreamingDataLoader
if want mid-epoch ckpt/resume. Suspect easily adaptable to non-PT.Details
The point of StreamingDataset is to solve scaling, here are some points:
Compute | data: (a) streams/caches samples if data is remote, (b) acts local, with prefetched/lazy global random access.
Serialization: (a) handles via own sharded serialization format (formats), (b) shards live on disk not ram, (c) cached getitem in tens of usec.
Combining many datasets at run time: (a) balancing component sub-datasets, (b) on the fly shuffling strong enough for different data distributions and severe upsampling w/ non-impacted convergence and severe downsampling w/ efficient usage of ingress and so on, (c) all the ways to batch, shuffle, and jive known to science and some others.
Hardware failures: (a) adds mid-epoch checkpointing/resumption, requiring distributed deterministic everything (table stakes?), (b) but it is also elastically deterministic in the number of nodes across checkpoint/resumes.
These items don't blow your hair back in single node training on datasets that will fit on the node, but when you start thinking bigger and performance you hit the wall.
Beta Was this translation helpful? Give feedback.
All reactions