-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add tar-based IterableDataset implementation to PyTorch #38419
Comments
Thanks @tmbdev We will review and get back to you |
this seems like a very appealing RFC. I am generally supportive. Can you talk about how you plan to integrate it into PyTorch core?
Can you talk about whether you or others are volunteering to maintain it, responding to feature requests and issues wrt this Dataset? |
Thanks for the RFC. I haven't read enough details yet but I just want to say thank you for the detailed the request and making us aware of a great library and a great use of IterableDataset. I may look at this in more depth and express my thoughts later. But either way, I'd be happy to assist review if this makes its way into core. |
Thank you for this RFC! This looks great! In particular I'm very happy to see that it interacts so well with abstractions we already have. Here are some initial comments: In general this seems like adding a component that could, maybe with a few more extensions (indexing for constrained random access), represent a revamp of PyTorch's data functionalities. There are data source abstractions (path, cloud storage, web, etc.), generic python functionalities (reading and decoding tar files on the fly), shuffling schemas (inter and intra-shard shuffling), transforms (map_dict, etc.) and a fully-fledged interface (fluent API). If we want to integrate this into our current abstractions, I think we need to pry these components apart and land them as separate abstractions that can be considered orthogonal to what we currently have. Otherwise, we should consider a revamp of our data abstractions and could presumably use much of the code here to supplement that. I'm in favor of keeping this a separate library until we have come up with a full proposal for a revamp. Otherwise this seems like adding an independent, large component to PyTorch that can easily stand on its own. Having said that, I still agree with the reasons for why we would want to add something like this to the core. I absolutely agree that we lack a system that provides e.g. efficient data storage for many small files or interfaces well with internet based, high-latency, read-only storage solutions. |
Great!
I haven't looked at it in detail, but from a user's point of view,
I'm happy to maintain and support it (and I'm using the library for a lot of my work). In general, NVIDIA sees fast access to storage as an important enabler for large scale deep learning (cf Mellanox and SwiftStack acquisitions and the AIStore server), so there is corporate backing, and strong backing for open source efforts. |
WebDataset is actually derived from a previous library called
My suggestion for a roadmap would be the following:
So, the long and the short is: WebDataset is useful and mature enough as is, and I think there are substantial benefits from incorporating it into PyTorch now and defining a dataset format standard that people can build on. In addition, we also have a roadmap for evolving from here to a "revamped" I/O system; we can get there in steps and without breaking existing code. Once both WebDataset and Tensorcom are in place, we can deprecate and eventually remove old classes that aren't needed anymore. What do you think? Does that sound like a good plan? |
Could you please point out in the code how are random-accessing (of individual training examples) and I believe most published datasets do not have an index file generated for their In my understanding, without random-access one can't use I see |
Thanks for the RFC. This would provide a good speedup even in not so big datasets. And it wouldn't only be useful for training but also to update an existing dataset where the "many small file" problem is double as we have to download/upload or unpack/pack. |
WebDataset is set up so that for regular training on shuffled data, you don't need either random access or len. Instead, it relies on a combination of shard shuffling and inline shuffling. That's what allows it to provide shuffled training data to GPUs while using streaming, sequential I/O.
You can also avoid this problem by not using (We're currently looking into modifying
Shards are used for three purposes: to support parallel map-reduce operations during preprocessing/ETL, to allow large scale data shuffling, and to permit fast parallel I/O. Let's look at a big problem to illustrate how this works. Let's say you have a 300TB dataset (like YT8m at HD resolution) stored on rotational drives capable each of 150 Mbytes/s bandwidth and you want to run a training job using 100 GPUs, each requiring 450 Mbytes/s data. This means that in order to keep the GPUs busy, you need at least 3 dedicated drives per GPU or 300 drives total, and you need at least 300 shards in order to be able to enable parallel reading across all drives. You would usually pick a larger number of shards and a somewhat larger number of drives, both to get better shuffling and to be more robust to drives that are temporarily underperforming (e.g., because of file system maintenance). This may sound complicated, but it really isn't in practice. A server like AIStore will automate distribution of shards; for smaller datasets with high I/O bandwidth requirements, AIStore will also transparently replicate shards for you to meet the I/O bandwidth requirements. If you're running in the cloud, cloud storage providers implicitly do something similar. As a user, you usually just operate on a set of shards as a whole with URLs like There is more information in these videos |
Yes, it's useful also on smaller problems. For example, on my desktop, I store ImageNet and other datasets as shards on a 10TB external rotational drive, and WebDataset allows the 10TB to operate at full I/O bandwidth. |
@tmbdev some updates. We are pretty sure we want this in PyTorch core. The only point we are discussing is when. In the short-term, the first set of items we should take up is to immediately point some resources to your repository to let it gain some steam and adoption, it is fairly featureful and very useful to many. I'll give you a bit more of a detailed updated later this week. |
@tmbdev One question on design. The below notation creates separate client processes for download. How does it reconcile with num_workers parameter in DataLoader? It sounds like two ways to control sub-processes for data loading? Is it really needed? # load from a web server using a separate client process
shardurl = "pipe:curl -s http://server/imagenet/imagenet-train-{0000..0147}.tar"
dataset = wds.Dataset(shardurl) |
WebDataset has a lot of nice abstractions, and I'm keen to use them all, but the core functionality (90% of why people would want it) is simply a Dataset that reads jpg/png images (and possibly associated json/text files - turned into dict and str, respectively) from a tar file. With this in mind, wouldn't it be possible to include this simple (extremely useful!) class first, and worry about an overhaul of PyTorch's data abstractions afterwards? It just doesn't seem necessary to consider the data abstractions as a blocker for the Tar dataset functionality. |
@maxluk The "pipe:" notation creates a single subprocess purely for asynchronous I/O operations for a Dataset instance. All the decoding and augmentation are still handled by whatever thread/process the Dataset instance runs in. That is, usually, you still use multiple Dataset instances with DataLoader, and each gets its own I/O subprocess. Things are handled this way for a couple of reasons. Let's look at HTTP. The alternative to using pipe:curl would be to use Python's built-in HTTP client library. But that is not only less efficient than curl, it blocks. So you really have to use something like asyncio (aiohttp), but that doesn't integrate well with PyTorch. Overall, the use of I/O subprocesses is the fastest way we have found of performing I/O from Python. And as an added bonus, it is also very flexible and works with all major object stores, cloud providers, and network file systems. However, you can fully customize I/O for any URL scheme using the webdataset.gopen_schemes dictionary; it maps schemes to Python functions. There are/will be other implementations. Down the line, for PyTorch, you will be able to use WebDataset with NVIDIA's DALI, which performs all I/O and data augmentation using multithreading. For Golang, we already have a multithreaded implementation of most of WebDataset. The companion Tensorcom library uses ZMQ or RDMA for communicating with PyTorch and requires neither multiprocessing nor multithreading. |
@jotaf98 The WebDataset library does not attempt to overhaul PyTorch's abstractions; you can pretty much substitute WebDataset for Dataset in any DataLoader, since There are a few utility classes: |
@tmbdev Thank you for detailed response. It makes sense, you use processes for IO pipelining. In the context of large tar files I am sure extra processes don't create significant overhead, so it's efficient solution. Quite ingenious. :-) In terms of forward looking architecture I agree with you there is a place for something more sophisticated, creating N * M processes (N-dataloaders, M-files per dataloader) is archaic. There are many examples of data frameworks where this problem is solved, PySpark, Dask, (sounds like DALI as well) without subprocesses. I am looking forward to native solution for PyTorch based on similar approach. |
@tmbdev :
Yes, I understood that, thanks! I was responding to @cpuhrsch 's suggestion that
Which would slow things down a lot unnecessarily, when the main feature of reading TAR files works just fine as a Dataset, and could be added as-is today. |
PyTorch Dataloader is a pain in the ass for any data not reside in mounted file system. It encourages the users to store a huge number of small files which is an effective DoS attack to the storage system itself. The whole multi worker pre-processing design causes countless wasted CPU cycles and even GPU hours because of frequent OOMs of worker processes. It’s a nightmare for infrastructure SREs to keep the training platform functioning for all these broken data pipelines written by PyTorch users. Maybe an average PyTorch user only uses low to medium end GPUs (so no need to feed 10k image/s into an 8-V100 DGX-1) and their datasets fit in a local NVMe SSD. |
Wondering if this made it to the resources/docs? |
@maxluk Just some more thoughs about this
A technically better solution is to implement a multithreaded, single-process loader, but you can't do that in Python. We have something in C++ that works with Tensorflow 1.x but haven't ported it yet.
PySpark uses Java multithreading, a different language. Dask is hamstrung by Python in the same way PyTorch is. DALI solves this by writing a C++ extension (and will support WebDataset). None of those are ideal solutions. Python is a great language with extensive libraries; the price you pay for using it right now is that you have to work around the GIL by using multiprocessing. For all the ugliness of Dataloader, it can give excellent performance. You also have the option of using Tensorcom, which lets you factor out I/O pipelines and augmentation completely from your DL jobs. This give you dynamic scaling, better scalability, easier debugging, and distributed data augmentation. |
@byronyi, I think that's precisely right but that's a good thing. That PyTorch prioritized gigabyte scale rather than terabyte scale seems smart to me, considering the vast bulk of useful ML tasks just waiting on its developer. Recent NLP research makes it sound like one would need a distributed GPU cluster to do anything meaningful today, but that's not the case. Many models turn out fine from on-disk datasets on commodity hardware, and hopefully algorithm researchers consider the long-term effects of not being mindful of compute needs, such that training ML models turns prohibitively expensive. ML gets better if its development can be made accessible to everyone. 🤞 With that said, this RFC looks amazing and as a very happy |
Adding to @carlthome's mention that on-disk datasets are important for accessibility (outside of the most well-provisioned labs), I thought I'd mention that I implemented the same idea but as a https://github.com/jotaf98/simple-tar-dataset This supports random access, as opposed to This is more relevant for the mentioned single-machine/local disk access use case, and many large simplifications are possible (the code itself has only a few dozen lines of code and is very flexible). Despite the superficial similarity (reading from TAR archives), it's very complementary to this proposal, since sequential and random access patterns require very different implementations with almost no overlap. |
@VitalyFedyunin Any progress on this? At what point can we expect this to become part of Pytorch? |
How would you say https://github.com/NVIDIA/AIStore and tar compare with https://github.com/uber/petastorm and parquet ? |
AIStore can store and process any format data efficiently, including tar files, TFRecord files, and Parquet. Parquet is fine for what it is: a store for columnar data, and if that's what your data is like, you may want to use it. The use of .tar files is convenient and natural for datasets where samples are individual files and/or involve multimedia data. By storing such data inside .tar files, you get compression, metadata standards, and tons of tools. Because .tar files are so ubiquitous, you don't need special library support. You can write map-reduce-style jobs easily using just command line tools, or using Dask or Ray using just standard Python libraries. The ubiquitous nature of .tar files and tar libraries is another thing that makes them attractive. |
Is this still planned to be integrated? |
Hi, also curious what the future plans are for this. |
@tmbdev Hi, any progress? |
The main reader is integrated in torch data. The torch data team still plans on integrating some utility functions for decoding and renaming fields, but none of those are necessary, they are just for convenience. You can just put that functionally in a map function. |
@tmbdev That's great, can you point us to where this is documented? Thanks! |
Problem
As datasets become larger and larger, storing training samples as individual files becomes impractical and inefficient. This can be addressed using sequential storage formats and sharding (see "Alternatives Considered" for other implementations). PyTorch lacks such a common storage format right now.
Proposal
WebDataset provides an implementation of
IterableDataset
based on sharded tar archives. This format provides efficient I/O for very large datasets, makes migration from file-based I/O easy, and works well locally, with cloud storage, and web servers. It also provides a simple, standard format in which large datasets can be distributed easily and used directly without unpacking.The implementation is small (1000-1800 LOC) and has no external dependencies. The proposal is to incorporate the
webdataset.Dataset
class into the PyTorch base distribution. The source repository is here:http://github.com/tmbdev/webdataset
My suggestion would be to incorporate the library with minimal changes into its own subpackage. I can perform the integration and generate a PR once there is general agreement.
More Background
For a general introduction to how we handle large scale training with WebDataset, see this YouTube playlist
The WebDataset library (github.com/tmbdev/webdataset) provides an implementation of
IterableDataset
that uses POSIX tar archives as its native storage format. The format itself is based on a simple convention:For example, ImageNet is stored in 147 separate 1 Gbyte shards with names
imagenet-train-0000.tar
toimagenet-train-0147.tar
; the contents of the first shard are:Datasets in WebDataset format can be used directly after downloading without unpacking; they can also be mounted as a file system. Content in WebDataset format can used any file-based compression scheme. In addition, the tar file itself can also be compressed and WebDataset will transparently decompress it.
WebDatsets can be used directly from local disk, from web servers (hence the name), from cloud storage, and from object stores, just by changing a URL.
WebDataset readers/writers are easy to implement (we have Python, Golang, and C++ implementations).
WebDataset performs shuffling both at the shard level and at the sample level. Splitting of data across multiple workers is performed at the shard level using a user-provided
shard_selection
function that defaults to a function that splits based onget_worker_info
. (WebDataset can be combined with thetensorcom
library to offload decompression/data augmentation and provide RDMA and direct-to-GPU loading; see below.)We are storing and processing petabytes of training data as tar archives and are using the format for unsupervised learning from video, OCR and scene text recognition, and large scale object recognition experiments. The same code and I/O pipelines work efficiently on the desktop, on a local cluster, or in the cloud. Our benchmarks show scalability and the ability to take advantage of the full I/O bandwidth of each disk drive across very large training jobs and storage clusters.
Code Sample
This shows how to use WebDataset with ImageNet.
WebDataset uses a fluent API for configuration that internally builds up a processing pipeline. Without any added processing stages, WebDataset just iterates through each training sample as a dictionary:
Alternatives Considered
Keeping WebDataset as as Separate Library
WebDataset is perfectly usable as a separate library, so why not keep it that way?
IterableDataset
in PyTorch provides a common reference against which to address issues and provide test cases in theDataLoader
implementation and its use ofIterableDataset
Many of the larger datasets distributed in
torchvision
could be distributed easily in WebDataset format. This would allow users to either train directly against web-hosted data, or to train on the datasets immediately after downloading without unpacking. The way data is arranged in WebDataset also allows users to download just a few shards for testing code locally, and then use the entire dataset when running on a cluster. Furthermore, the waywebdataset.Dataset
works in most cases, no special code is needed in order to read them; many training jobs can be retargeted to different datasets simply by using a different URL for the dataset.TFRecord+protobuf, Parquet
These formats are suitable for large scale data processing for machine learning and deep learning applications and some datasets exist in this format and more will continue to be generated for the Tensorflow ecosystem. However, they are not good candidates for incorporating into PyTorch as core feature because:
tarp
.Note that WebDataset supports usage scenarios similar to TFRecord+protobuf, since serialized data structures can be incorporated as files, and WebDataset will decode them automatically. For example, OpenImages multi-instance data is simply stored in a
.json
file accompanying each.jpg
file:zip instead of tar
The zip format is another archival format. Unlike tar format, which is just a sequence of records, zip format stores a file index at the very end of the file, making it unsuitable for streaming. Tar files can be made random access (and, in fact, can be mounted as file systems), but they use a separate index file to support that functionality.
LMDB, HDF5, Databases
These formats are not suitable for streaming and require the entire dataset to fit onto local disk. In addition, while they nominally solve the "many small file problems", they don't solve the problem that indexing into the dataset still results in expensive seek operations.
Local File System Caches
An approach for extending file-system based I/O pipelines to large distributed storage systems is to use some form of "pre-caching" or "staging" on a local NVMe drive. Generally, there is little advantage to this. For large datasets, it does not increase throughput. Input pipelines still need to be modified to schedule the pre-caching. And generally, this requires volume plugins or virtual file system support. A similar effect can be achieved with WebDataset by simply unpacking shards to the local file system when direct file access is required.
Related Software
AIStore is an open source object store capable of full bandwidth disk-to-GPU data delivery (meaning that if you have 1000 rotational drives with 200 MB/s read speed, AIStore actually delivers an aggregate bandwidth of 200 GB/s to the GPUs). AIStore is fully compatible with WebDataset as a client, and in addition understands the WebDataset format, permitting it to perform shuffling, sorting, ETL, and some map-reduce operations directly in the storage system.
tarp is a small command line program for splitting, merging, shuffling, and processing tar archives and WebDataset datasets.
tensorcom is a library supporting distributed data augmentation and RDMA to GPU.
webdataset-examples contains an example (and soon more examples) of how to use WebDataset in practice.
Bigdata 2019 Paper with Benchmarks
cc @ssnl
The text was updated successfully, but these errors were encountered: