extra/datasets/ and MosaicML StreamingDataset, eh? #2603

knighton · 2023-12-04T07:46:25Z

knighton
Dec 4, 2023

Framing

How does tinygrad plan on dealing with datasets and dataloading generally (to the extent those concepts generalize)?

I figure the doing and the optimizing of distributed routing of samples to model replicas is really more of an adjacent than core problem for tinygrad. Is that right?

Am arguing this is one of those things that quickly gets somewhat hairier than expected when you sweat the details, expectations wafting loftily toward convergence-throughput Pareto optimality.

Ask

I would like to brazenly suggest MosaicML Streaming/StreamingDataset, which was used to train MPT-30B etc. tl;dr the repo is built around a dataset class which provides all the distributed and serialization aspects that PyTorch IterableDataset leaves for you to solve yourself and some other nice things, i.e. Dataset > IterableDataset > StreamingDataset, also DataLoader > StreamingDataLoader if want mid-epoch ckpt/resume. Suspect easily adaptable to non-PT.

Details

The point of StreamingDataset is to solve scaling, here are some points:

Compute | data: (a) streams/caches samples if data is remote, (b) acts local, with prefetched/lazy global random access.
Serialization: (a) handles via own sharded serialization format (formats), (b) shards live on disk not ram, (c) cached getitem in tens of usec.
Combining many datasets at run time: (a) balancing component sub-datasets, (b) on the fly shuffling strong enough for different data distributions and severe upsampling w/ non-impacted convergence and severe downsampling w/ efficient usage of ingress and so on, (c) all the ways to batch, shuffle, and jive known to science and some others.
Hardware failures: (a) adds mid-epoch checkpointing/resumption, requiring distributed deterministic everything (table stakes?), (b) but it is also elastically deterministic in the number of nodes across checkpoint/resumes.

These items don't blow your hair back in single node training on datasets that will fit on the node, but when you start thinking bigger and performance you hit the wall.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extra/datasets/ and MosaicML StreamingDataset, eh? #2603

{{title}}

Replies: 0 comments

Select a reply

extra/datasets/ and MosaicML StreamingDataset, eh? #2603

knighton Dec 4, 2023

Framing

Ask

Details

Replies: 0 comments

knighton
Dec 4, 2023