Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add tar-based IterableDataset implementation to PyTorch #38419

Open
tmbdev opened this issue May 13, 2020 · 31 comments
Open

[RFC] Add tar-based IterableDataset implementation to PyTorch #38419

tmbdev opened this issue May 13, 2020 · 31 comments
Assignees
Labels
feature A request for a proper, new feature. module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@tmbdev
Copy link

tmbdev commented May 13, 2020

Problem

As datasets become larger and larger, storing training samples as individual files becomes impractical and inefficient. This can be addressed using sequential storage formats and sharding (see "Alternatives Considered" for other implementations). PyTorch lacks such a common storage format right now.

Proposal

WebDataset provides an implementation of IterableDataset based on sharded tar archives. This format provides efficient I/O for very large datasets, makes migration from file-based I/O easy, and works well locally, with cloud storage, and web servers. It also provides a simple, standard format in which large datasets can be distributed easily and used directly without unpacking.

The implementation is small (1000-1800 LOC) and has no external dependencies. The proposal is to incorporate the webdataset.Dataset class into the PyTorch base distribution. The source repository is here:

http://github.com/tmbdev/webdataset

My suggestion would be to incorporate the library with minimal changes into its own subpackage. I can perform the integration and generate a PR once there is general agreement.

More Background

For a general introduction to how we handle large scale training with WebDataset, see this YouTube playlist

The WebDataset library (github.com/tmbdev/webdataset) provides an implementation of IterableDataset that uses POSIX tar archives as its native storage format. The format itself is based on a simple convention:

  • datasets are POSIX tar archives
  • each training sample consists of adjacent files with the same basename
  • shards are numbered consecutively

For example, ImageNet is stored in 147 separate 1 Gbyte shards with names imagenet-train-0000.tar to imagenet-train-0147.tar; the contents of the first shard are:

-r--r--r-- bigdata/bigdata      3 2020-05-08 21:23 n03991062_24866.cls
-r--r--r-- bigdata/bigdata 108611 2020-05-08 21:23 n03991062_24866.jpg
-r--r--r-- bigdata/bigdata      3 2020-05-08 21:23 n07749582_9506.cls
-r--r--r-- bigdata/bigdata 129044 2020-05-08 21:23 n07749582_9506.jpg
-r--r--r-- bigdata/bigdata      3 2020-05-08 21:23 n03425413_23604.cls
-r--r--r-- bigdata/bigdata 106255 2020-05-08 21:23 n03425413_23604.jpg
-r--r--r-- bigdata/bigdata      3 2020-05-08 21:23 n02795169_27274.cls

Datasets in WebDataset format can be used directly after downloading without unpacking; they can also be mounted as a file system. Content in WebDataset format can used any file-based compression scheme. In addition, the tar file itself can also be compressed and WebDataset will transparently decompress it.

WebDatsets can be used directly from local disk, from web servers (hence the name), from cloud storage, and from object stores, just by changing a URL.

WebDataset readers/writers are easy to implement (we have Python, Golang, and C++ implementations).

WebDataset performs shuffling both at the shard level and at the sample level. Splitting of data across multiple workers is performed at the shard level using a user-provided shard_selection function that defaults to a function that splits based on get_worker_info. (WebDataset can be combined with the tensorcom library to offload decompression/data augmentation and provide RDMA and direct-to-GPU loading; see below.)

We are storing and processing petabytes of training data as tar archives and are using the format for unsupervised learning from video, OCR and scene text recognition, and large scale object recognition experiments. The same code and I/O pipelines work efficiently on the desktop, on a local cluster, or in the cloud. Our benchmarks show scalability and the ability to take advantage of the full I/O bandwidth of each disk drive across very large training jobs and storage clusters.

Code Sample

This shows how to use WebDataset with ImageNet.

import webdataset as wds
improt ...

shardurl = "/imagenet/imagenet-train-{0000..0147}.tar"

normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225])

preproc = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    normalize,
])

dataset = (
    wds.Dataset(shardurl)
    .shuffle(1000)
    .decode("pil")
    .rename(image="jpg;png", data="json")
    .map_dict(image=preproc)
    .to_tuple("image", "data")
)

loader = torch.utils.data.DataLoader(dataset, batch_size=64, num_workers=8)

for inputs, targets in loader:
    ...

WebDataset uses a fluent API for configuration that internally builds up a processing pipeline. Without any added processing stages, WebDataset just iterates through each training sample as a dictionary:

# load from a web server using a separate client process
shardurl = "pipe:curl -s http://server/imagenet/imagenet-train-{0000..0147}.tar"

dataset = wds.Dataset(shardurl)

for sample in dataset:
    # sample["jpg"] contains the raw image data
    # sample["cls"] contains the class
    ...

Alternatives Considered

Keeping WebDataset as as Separate Library

WebDataset is perfectly usable as a separate library, so why not keep it that way?

  • Users of deep learning libraries expect an efficient data format that avoids the "many small file" problem; Tensorflow provides TFRecord/tf.Example Making WebDataset part of PyTorch itself provides a straightforward solution for most users and reduces dependencies.
  • Providing a format reader in the standard library encourages dataset standardization, tool building, and adoption of the same format by other libraries.
  • Having a realistic implementation of IterableDataset in PyTorch provides a common reference against which to address issues and provide test cases in the DataLoader implementation and its use of IterableDataset

Many of the larger datasets distributed in torchvision could be distributed easily in WebDataset format. This would allow users to either train directly against web-hosted data, or to train on the datasets immediately after downloading without unpacking. The way data is arranged in WebDataset also allows users to download just a few shards for testing code locally, and then use the entire dataset when running on a cluster. Furthermore, the way webdataset.Dataset works in most cases, no special code is needed in order to read them; many training jobs can be retargeted to different datasets simply by using a different URL for the dataset.

TFRecord+protobuf, Parquet

These formats are suitable for large scale data processing for machine learning and deep learning applications and some datasets exist in this format and more will continue to be generated for the Tensorflow ecosystem. However, they are not good candidates for incorporating into PyTorch as core feature because:

  • TFRecord+protobuf (tf.Example) and Parquet have significant external library dependencies; in contrast, WebDataset is pure Python, using the built-in libraries for tar decoding.
  • Protobuf and Parquet represent serialized data structures, while WebDataset uses file-based representations. File based representations make creation of WebDatasets easy, result in bit-identical representations of data in WebDatasets, and allow existing file-based codecs and I/O pipelines to be reused easily.
  • Except for their native frameworks, there is not much third party support for these formats; in contrast, POSIX tar archive libraries exist in all major programming languages and are easily processed, either by unpacking shards or using tools like tarp.

Note that WebDataset supports usage scenarios similar to TFRecord+protobuf, since serialized data structures can be incorporated as files, and WebDataset will decode them automatically. For example, OpenImages multi-instance data is simply stored in a .json file accompanying each .jpg file:

$ curl -s http://storage.googleapis.com/nvdata-openimages/openimages-train-000554.tar | tar tf -
0b946d1d5201c06c.jpg
0b946d1d5201c06c.json
05c9d72ff9e64010.jpg
05c9d72ff9e64010.json
...
$

zip instead of tar

The zip format is another archival format. Unlike tar format, which is just a sequence of records, zip format stores a file index at the very end of the file, making it unsuitable for streaming. Tar files can be made random access (and, in fact, can be mounted as file systems), but they use a separate index file to support that functionality.

LMDB, HDF5, Databases

These formats are not suitable for streaming and require the entire dataset to fit onto local disk. In addition, while they nominally solve the "many small file problems", they don't solve the problem that indexing into the dataset still results in expensive seek operations.

Local File System Caches

An approach for extending file-system based I/O pipelines to large distributed storage systems is to use some form of "pre-caching" or "staging" on a local NVMe drive. Generally, there is little advantage to this. For large datasets, it does not increase throughput. Input pipelines still need to be modified to schedule the pre-caching. And generally, this requires volume plugins or virtual file system support. A similar effect can be achieved with WebDataset by simply unpacking shards to the local file system when direct file access is required.

Related Software

AIStore is an open source object store capable of full bandwidth disk-to-GPU data delivery (meaning that if you have 1000 rotational drives with 200 MB/s read speed, AIStore actually delivers an aggregate bandwidth of 200 GB/s to the GPUs). AIStore is fully compatible with WebDataset as a client, and in addition understands the WebDataset format, permitting it to perform shuffling, sorting, ETL, and some map-reduce operations directly in the storage system.

tarp is a small command line program for splitting, merging, shuffling, and processing tar archives and WebDataset datasets.

tensorcom is a library supporting distributed data augmentation and RDMA to GPU.

webdataset-examples contains an example (and soon more examples) of how to use WebDataset in practice.

Bigdata 2019 Paper with Benchmarks

cc @ssnl

@chauhang
Copy link
Contributor

Thanks @tmbdev We will review and get back to you

@soumith
Copy link
Member

soumith commented May 13, 2020

this seems like a very appealing RFC. I am generally supportive.

Can you talk about how you plan to integrate it into PyTorch core?

  • What's the outline of where you want to put it, what files / folders do you plan to add?

Can you talk about whether you or others are volunteering to maintain it, responding to feature requests and issues wrt this Dataset?

@ssnl
Copy link
Collaborator

ssnl commented May 13, 2020

Thanks for the RFC. I haven't read enough details yet but I just want to say thank you for the detailed the request and making us aware of a great library and a great use of IterableDataset.

I may look at this in more depth and express my thoughts later. But either way, I'd be happy to assist review if this makes its way into core.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented May 13, 2020

Thank you for this RFC! This looks great! In particular I'm very happy to see that it interacts so well with abstractions we already have. Here are some initial comments:

In general this seems like adding a component that could, maybe with a few more extensions (indexing for constrained random access), represent a revamp of PyTorch's data functionalities. There are data source abstractions (path, cloud storage, web, etc.), generic python functionalities (reading and decoding tar files on the fly), shuffling schemas (inter and intra-shard shuffling), transforms (map_dict, etc.) and a fully-fledged interface (fluent API).

If we want to integrate this into our current abstractions, I think we need to pry these components apart and land them as separate abstractions that can be considered orthogonal to what we currently have. Otherwise, we should consider a revamp of our data abstractions and could presumably use much of the code here to supplement that.

I'm in favor of keeping this a separate library until we have come up with a full proposal for a revamp. Otherwise this seems like adding an independent, large component to PyTorch that can easily stand on its own. Having said that, I still agree with the reasons for why we would want to add something like this to the core. I absolutely agree that we lack a system that provides e.g. efficient data storage for many small files or interfaces well with internet based, high-latency, read-only storage solutions.

@tmbdev
Copy link
Author

tmbdev commented May 14, 2020

this seems like a very appealing RFC. I am generally supportive.

Great!

Can you talk about how you plan to integrate it into PyTorch core? What's the outline of where you want to put it, what files / folders do you plan to add?

I haven't looked at it in detail, but from a user's point of view, torch.utils.webdataset might be a good place. What do you think?

Can you talk about whether you or others are volunteering to maintain it, responding to feature requests and issues wrt this Dataset?

I'm happy to maintain and support it (and I'm using the library for a lot of my work). In general, NVIDIA sees fast access to storage as an important enabler for large scale deep learning (cf Mellanox and SwiftStack acquisitions and the AIStore server), so there is corporate backing, and strong backing for open source efforts.

@tmbdev
Copy link
Author

tmbdev commented May 14, 2020

I'm in favor of keeping this a separate library until we have come up with a full proposal for a revamp. Otherwise this seems like adding an independent, large component to PyTorch that can easily stand on its own.

WebDataset is actually derived from a previous library called dlinputs that was based around processing pipelines and represented such a revamp. Experience and feedback from users suggested that refactoring and an incremental approach was better than an all-at-once replacement. The refactoring of dlinputs created several separate libraries:

  • WebDataset provides general Python processing pipelines with an interface familiar to PyTorch users. It's mature enough to be incorporated now, and it is completely usable on its own, since it works with DataLoader.
  • Tensorcom handles parallelization of preprocessing pipelines, distributed augmentation, RDMA, and direct-to-GPU. It works great as is and has significant advantages over DataLoader. We mostly use it with large jobs right now where data loading is defined and launched completely separately from the training jobs themselves; we still need an API that makes it more convenient for small/simple jobs, where data loading and launching is part of a single program.
  • Tarp replaces the command line programs. It's a Go program with its own WebDataset-like pipeline library. Its development is independent from PyTorch, but it operates on the WebDataset file formats.
  • Objectio is a small library for I/O from object stores. It works fine, and WebDataset can use it but doesn't require it. There are some benefits from using it (or something like it), but I'm not sure they are compelling enough, which is why it's not an integral part of WebDataset.

My suggestion for a roadmap would be the following:

  • incorporate the WebDataset library as is, exposing only a few select classes; users can use it with DataLoader; this doesn't close any future options, but it solves an immediate need and defines a standard that people can build on
  • afterwards, in parallel
    • to help current users, iron out rough spots in the existing DataLoader class for use with IterableDataset as much as possible (I have some ideas for that, we can discuss them separately)
    • if we identify parts that are useful elsewhere in PyTorch (e.g., gopen), gradually refactor them into their own packages and expose/document them
    • develop the tensorcom library (or some new library) to the point where it can be a replacement for DataLoader in new code

So, the long and the short is: WebDataset is useful and mature enough as is, and I think there are substantial benefits from incorporating it into PyTorch now and defining a dataset format standard that people can build on.

In addition, we also have a roadmap for evolving from here to a "revamped" I/O system; we can get there in steps and without breaking existing code. Once both WebDataset and Tensorcom are in place, we can deprecate and eventually remove old classes that aren't needed anymore.

What do you think? Does that sound like a good plan?

@megaserg
Copy link

megaserg commented May 15, 2020

Could you please point out in the code how are random-accessing (of individual training examples) and len() supported in WebDataset?

I believe most published datasets do not have an index file generated for their .tars, so this is something user would need to do?

In my understanding, without random-access one can't use DistributedSampler, and without DistributedSampler I'm not sure how to partition the dataset for the multi-machine/multi-GPU setup, in a way that each GPU gets equal number of examples. Is my understanding correct?

I see WebDataset has a concept of "shard" as a separate tar file, how does it play together with multi-GPU setup?

@mjunyent
Copy link

Thanks for the RFC. This would provide a good speedup even in not so big datasets. And it wouldn't only be useful for training but also to update an existing dataset where the "many small file" problem is double as we have to download/upload or unpack/pack.

@tmbdev
Copy link
Author

tmbdev commented May 15, 2020

Could you please point out in the code how are random-accessing (of individual training examples) and len() supported in WebDataset?
I believe most published datasets do not have an index file generated for their .tars, so this is something user would need to do?

WebDataset is set up so that for regular training on shuffled data, you don't need either random access or len. Instead, it relies on a combination of shard shuffling and inline shuffling. That's what allows it to provide shuffled training data to GPUs while using streaming, sequential I/O.

In my understanding, without random-access one can't use DistributedSampler, and without DistributedSampler I'm not sure how to partition the dataset for the multi-machine/multi-GPU setup, in a way that each GPU gets equal number of examples. Is my understanding correct?

DistributedSampler and other classes like it don't apply to datasets defined with IterableDataset. If you want to use DataLoader with any IterableDataset (not just WebDataset), you need other ways of sampling and batching, and coping with variable numbers of samples per node. The recommended way seems to be to let the IterableDataset do the batching, which WebDataset supports. WebDataset offers other possible strategies as well. I think there are also some small fixes to DataLoader to make it work better with IterableDataset.

You can also avoid this problem by not using DataLoader at all; the tensorcom library supports not just multiprocess and multinode preprocessing and augmentation, but will also support RDMA and direct-to-GPU in the next release. I think something like tensorcom is the future, at least for large scale training.

(We're currently looking into modifying tensorcom to make it as much of a drop-in replacement for DataLoader as possible, and maybe it can become a one stop solution. But that's a separate issue from this one.)

I see WebDataset has a concept of "shard" as a separate tar file, how does it play together with multi-GPU setup?

Shards are used for three purposes: to support parallel map-reduce operations during preprocessing/ETL, to allow large scale data shuffling, and to permit fast parallel I/O.

Let's look at a big problem to illustrate how this works. Let's say you have a 300TB dataset (like YT8m at HD resolution) stored on rotational drives capable each of 150 Mbytes/s bandwidth and you want to run a training job using 100 GPUs, each requiring 450 Mbytes/s data. This means that in order to keep the GPUs busy, you need at least 3 dedicated drives per GPU or 300 drives total, and you need at least 300 shards in order to be able to enable parallel reading across all drives. You would usually pick a larger number of shards and a somewhat larger number of drives, both to get better shuffling and to be more robust to drives that are temporarily underperforming (e.g., because of file system maintenance).

This may sound complicated, but it really isn't in practice. A server like AIStore will automate distribution of shards; for smaller datasets with high I/O bandwidth requirements, AIStore will also transparently replicate shards for you to meet the I/O bandwidth requirements. If you're running in the cloud, cloud storage providers implicitly do something similar. As a user, you usually just operate on a set of shards as a whole with URLs like gs://bucket/imagenet-train-{0000..0147}.tar Since this is the same syntax as used by the shell, many command line operations will also "just work" using these shard specs.

There is more information in these videos

@tmbdev
Copy link
Author

tmbdev commented May 15, 2020

Thanks for the RFC. This would provide a good speedup even in not so big datasets. And it wouldn't only be useful for training but also to update an existing dataset where the "many small file" problem is double as we have to download/upload or unpack/pack.

Yes, it's useful also on smaller problems. For example, on my desktop, I store ImageNet and other datasets as shards on a 10TB external rotational drive, and WebDataset allows the 10TB to operate at full I/O bandwidth.

@jerryzh168 jerryzh168 added feature A request for a proper, new feature. module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 15, 2020
@soumith
Copy link
Member

soumith commented May 18, 2020

@tmbdev some updates.

We are pretty sure we want this in PyTorch core. The only point we are discussing is when.

In the short-term, the first set of items we should take up is to immediately point some resources to your repository to let it gain some steam and adoption, it is fairly featureful and very useful to many.

I'll give you a bit more of a detailed updated later this week.

@maxluk
Copy link

maxluk commented May 20, 2020

@tmbdev One question on design. The below notation creates separate client processes for download. How does it reconcile with num_workers parameter in DataLoader? It sounds like two ways to control sub-processes for data loading? Is it really needed?

# load from a web server using a separate client process
shardurl = "pipe:curl -s http://server/imagenet/imagenet-train-{0000..0147}.tar"

dataset = wds.Dataset(shardurl)

@jotaf98
Copy link

jotaf98 commented Jun 8, 2020

WebDataset has a lot of nice abstractions, and I'm keen to use them all, but the core functionality (90% of why people would want it) is simply a Dataset that reads jpg/png images (and possibly associated json/text files - turned into dict and str, respectively) from a tar file.

With this in mind, wouldn't it be possible to include this simple (extremely useful!) class first, and worry about an overhaul of PyTorch's data abstractions afterwards?

It just doesn't seem necessary to consider the data abstractions as a blocker for the Tar dataset functionality.

@tmbdev
Copy link
Author

tmbdev commented Jun 12, 2020

@maxluk The "pipe:" notation creates a single subprocess purely for asynchronous I/O operations for a Dataset instance. All the decoding and augmentation are still handled by whatever thread/process the Dataset instance runs in. That is, usually, you still use multiple Dataset instances with DataLoader, and each gets its own I/O subprocess.

Things are handled this way for a couple of reasons. Let's look at HTTP. The alternative to using pipe:curl would be to use Python's built-in HTTP client library. But that is not only less efficient than curl, it blocks. So you really have to use something like asyncio (aiohttp), but that doesn't integrate well with PyTorch. Overall, the use of I/O subprocesses is the fastest way we have found of performing I/O from Python. And as an added bonus, it is also very flexible and works with all major object stores, cloud providers, and network file systems.

However, you can fully customize I/O for any URL scheme using the webdataset.gopen_schemes dictionary; it maps schemes to Python functions.

There are/will be other implementations. Down the line, for PyTorch, you will be able to use WebDataset with NVIDIA's DALI, which performs all I/O and data augmentation using multithreading. For Golang, we already have a multithreaded implementation of most of WebDataset. The companion Tensorcom library uses ZMQ or RDMA for communicating with PyTorch and requires neither multiprocessing nor multithreading.

@tmbdev
Copy link
Author

tmbdev commented Jun 12, 2020

@jotaf98 The WebDataset library does not attempt to overhaul PyTorch's abstractions; you can pretty much substitute WebDataset for Dataset in any DataLoader, since WebDataset is just an IterableDataset implementation. WebDataset will also do something sensible with DataLoader multiprocessing (it splits up the shards among workers). All you need to know about is webdataset.Dataset.

There are a few utility classes: ResizedDataset changes the epoch size and nominal size of a dataset, and ShardWriter lets you write tar files.

@maxluk
Copy link

maxluk commented Jun 12, 2020

@tmbdev Thank you for detailed response. It makes sense, you use processes for IO pipelining. In the context of large tar files I am sure extra processes don't create significant overhead, so it's efficient solution. Quite ingenious. :-)

In terms of forward looking architecture I agree with you there is a place for something more sophisticated, creating N * M processes (N-dataloaders, M-files per dataloader) is archaic. There are many examples of data frameworks where this problem is solved, PySpark, Dask, (sounds like DALI as well) without subprocesses. I am looking forward to native solution for PyTorch based on similar approach.

@jotaf98
Copy link

jotaf98 commented Jun 13, 2020

@tmbdev :

The WebDataset library does not attempt to overhaul PyTorch's abstractions

Yes, I understood that, thanks! I was responding to @cpuhrsch 's suggestion that

We need to pry these components apart and land them as separate abstractions. (...)
I'm in favor of keeping this a separate library until we have come up with a full proposal for a revamp.

Which would slow things down a lot unnecessarily, when the main feature of reading TAR files works just fine as a Dataset, and could be added as-is today.

@byronyi
Copy link

byronyi commented Jul 2, 2020

PyTorch Dataloader is a pain in the ass for any data not reside in mounted file system. It encourages the users to store a huge number of small files which is an effective DoS attack to the storage system itself.

The whole multi worker pre-processing design causes countless wasted CPU cycles and even GPU hours because of frequent OOMs of worker processes.

It’s a nightmare for infrastructure SREs to keep the training platform functioning for all these broken data pipelines written by PyTorch users.

Maybe an average PyTorch user only uses low to medium end GPUs (so no need to feed 10k image/s into an 8-V100 DGX-1) and their datasets fit in a local NVMe SSD.

@ngimel
Copy link
Collaborator

ngimel commented Jul 10, 2020

cc @VitalyFedyunin

@jotaf98
Copy link

jotaf98 commented Apr 19, 2021

In the short-term, the first set of items we should take up is to immediately point some resources to your repository to let it gain some steam and adoption, it is fairly featureful and very useful to many.

Wondering if this made it to the resources/docs?

@tmbdev
Copy link
Author

tmbdev commented Apr 19, 2021

@maxluk Just some more thoughs about this

In terms of forward looking architecture I agree with you there is a place for something more sophisticated, creating N * M processes (N-dataloaders, M-files per dataloader) is archaic.

A technically better solution is to implement a multithreaded, single-process loader, but you can't do that in Python. We have something in C++ that works with Tensorflow 1.x but haven't ported it yet.

There are many examples of data frameworks where this problem is solved, PySpark, Dask, (sounds like DALI as well) without subprocesses. I am looking forward to native solution for PyTorch based on similar approach.

PySpark uses Java multithreading, a different language. Dask is hamstrung by Python in the same way PyTorch is. DALI solves this by writing a C++ extension (and will support WebDataset). None of those are ideal solutions.

Python is a great language with extensive libraries; the price you pay for using it right now is that you have to work around the GIL by using multiprocessing. For all the ugliness of Dataloader, it can give excellent performance.

You also have the option of using Tensorcom, which lets you factor out I/O pipelines and augmentation completely from your DL jobs. This give you dynamic scaling, better scalability, easier debugging, and distributed data augmentation.

@carlthome
Copy link

Maybe an average PyTorch user only uses low to medium end GPUs (so no need to feed 10k image/s into an 8-V100 DGX-1) and their datasets fit in a local NVMe SSD.

@byronyi, I think that's precisely right but that's a good thing. That PyTorch prioritized gigabyte scale rather than terabyte scale seems smart to me, considering the vast bulk of useful ML tasks just waiting on its developer.

Recent NLP research makes it sound like one would need a distributed GPU cluster to do anything meaningful today, but that's not the case. Many models turn out fine from on-disk datasets on commodity hardware, and hopefully algorithm researchers consider the long-term effects of not being mindful of compute needs, such that training ML models turns prohibitively expensive. ML gets better if its development can be made accessible to everyone. 🤞

With that said, this RFC looks amazing and as a very happy tf.data user who streams TFDS datasets from GCS buckets without having to worry about GPU starvation, I really hope PyTorch gets a similar feature (the lack of which is holding me back from migrating workloads from TensorFlow to PyTorch). 🥳

@jotaf98
Copy link

jotaf98 commented Jun 7, 2021

Adding to @carlthome's mention that on-disk datasets are important for accessibility (outside of the most well-provisioned labs), I thought I'd mention that I implemented the same idea but as a Dataset:

https://github.com/jotaf98/simple-tar-dataset

This supports random access, as opposed to IterableDataset's sequential access.

This is more relevant for the mentioned single-machine/local disk access use case, and many large simplifications are possible (the code itself has only a few dozen lines of code and is very flexible).

Despite the superficial similarity (reading from TAR archives), it's very complementary to this proposal, since sequential and random access patterns require very different implementations with almost no overlap.

@tadejsv
Copy link

tadejsv commented Oct 28, 2021

@VitalyFedyunin Any progress on this? At what point can we expect this to become part of Pytorch?

@rom1504
Copy link

rom1504 commented Dec 1, 2021

How would you say https://github.com/NVIDIA/AIStore and tar compare with https://github.com/uber/petastorm and parquet ?
it seems parquet is supported by more tools (spark, pandas, dask) than tar

@tmbdev
Copy link
Author

tmbdev commented Dec 6, 2021

AIStore can store and process any format data efficiently, including tar files, TFRecord files, and Parquet.

Parquet is fine for what it is: a store for columnar data, and if that's what your data is like, you may want to use it.

The use of .tar files is convenient and natural for datasets where samples are individual files and/or involve multimedia data. By storing such data inside .tar files, you get compression, metadata standards, and tons of tools.

Because .tar files are so ubiquitous, you don't need special library support. You can write map-reduce-style jobs easily using just command line tools, or using Dask or Ray using just standard Python libraries. The ubiquitous nature of .tar files and tar libraries is another thing that makes them attractive.

@dimentary
Copy link

Is this still planned to be integrated?

@austinmw
Copy link

Hi, also curious what the future plans are for this.

@Luciennnnnnn
Copy link

@tmbdev Hi, any progress?

@tmbdev
Copy link
Author

tmbdev commented May 9, 2023

The main reader is integrated in torch data. The torch data team still plans on integrating some utility functions for decoding and renaming fields, but none of those are necessary, they are just for convenience. You can just put that functionally in a map function.

@jotaf98
Copy link

jotaf98 commented May 18, 2023

@tmbdev That's great, can you point us to where this is documented? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests