-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] DataLoader architecture updates and TarDataset implementation #49440
Comments
@VitalyFedyunin Is it possible to update the RFC with a simple example of what the dataloader would look like while training with DDP where we need to shard data? |
Reminder that https://github.com/pytorch/rfcs/ is a thing, and might be easier to do in depth comments on the RFC |
Answering sharding question. Let say we have dataset with access to some database of N elements. class IterateableDataset:
pass
class NumbersIteratableDataset(IterateableDataset):
def is_shardable(self):
return True
def __init__(self, range = 100):
self.range = range
def sharding_settings(self, total_num_of_shards, id_of_shard, seed = 0):
# Called before __iter__
self.total_num_of_shards = total_num_of_shards
self.id_of_shard = id_of_shard
self.seed = seed
def __iter__(self):
for i in range(self.id_of_shard, self.range, self.total_num_of_shards):
yield (i + self.seed) % self.range Now lets assume X == 3, so we have three consumers (and it doesn't matter how they are separated by physical/logical nodes). # First
number_of_shards = 3
shard = NumbersIteratableDataset()
shard.sharding_settings(number_of_shards, 0) # Second
number_of_shards = 3
shard_0 = NumbersIteratableDataset()
shard_0.sharding_settings(number_of_shards, 1)
shard_1 = NumbersIteratableDataset()
shard_1.sharding_settings(number_of_shards, 2) So we basically need to share between machines only total number of shards and some cross system incremental id for i in iter(shards_1):
print(i)
# 2, 5, ...., 98 It is likely that we will need to have access to other data on next epoch, for this we can introduce seed or epock number to the sharding_settings # Next epoch
for i, shard in enumerate(shards):
shard.sharding_settings(number_of_shards, i, seed = 777 * epoch_number )
for i in iter(shards_2):
print(i)
# 79, 82 .... 97, 0, 3, ... 75 |
The alternative option available when we have only legacy dataset without sharding support, nesting it with something like sharding filter will help class OldDataset:
def __iter__(self):
for x in range(100):
yield "str%d" % x
class ShardedFilterIterableDataset:
def __init__(self, source_ds):
self.source_ds = source_ds
def sharding_settings(self, total_num_of_shards, id_of_shard, seed = 0):
# Called before __iter__
self.total_num_of_shards = total_num_of_shards
self.id_of_shard = id_of_shard
self.seed = seed
def __iter__(self):
for i, value in enumerate(self.source_ds):
if i % self.total_num_of_shards == self.id_of_shard:
yield value
number_of_shards = 3
legacy_ds = OldDataset()
sharded = ShardedFilterIterableDataset(legacy_ds)
sharded.sharding_settings(number_of_shards, 2)
for x in sharded:
print(x)
# str2, str5 .... |
Two features I’d suggest:
|
Can you please elaborate on the "Uneven end of data handling (for distributed training)" topic. What exactly you are looking for? Good point about checkpointing, just to be clear, we want to be able to stop straining at the 'middle' of epoch (data loop) and resume it after a cold restart, right? |
Say you have n data loader instances, in many distributed training scenarios if one instance runs out of data then all has to run out of data. Otherwise the training hangs. |
Similarly, supporting #25162 (comment) |
cc @jph00 for viz and any inputs. Also anyone else from the fast.ai community you think should comment here? |
nit: IteratableDataset -> IterableDataset |
Here are a few thoughts to contribute to this discussion:
I have made a few experiments with data loading before, here is my modest feedback and sorry for the self-advertisement:
Extra minor remarks:
|
@VitalyFedyunin The same question. What if the shards can not be evenly dispatched to different workers (maybe GPUs) or the number of items in shards are different? I saw in pytorch 1.7, join op can be used to address the problem of uneven dataset. But in the case of sharing, the last worker or GPU may still have a whole shard (may contain several items) to be consumed when the others are idle. Does this need more consideration? |
any update? @VitalyFedyunin |
Reposting this sharding question from the google doc version of this RFC:
|
Can u share the link? Is that public? |
Hello! Doc URL is https://docs.google.com/document/d/1ILJQl1QFUWFRbmFW50askG5l4t_8xcvSMbIwHFK6DoU/edit?usp=sharing |
It looks like an ambitious redesign; I'm looking forward to its implementation. In particular, cleaning up the DataLoader and multiprocessing facilities will be very useful. I want to point people at related resources that we have developed for these kinds of representations and processing pipelines.
|
A comment about the design and objectives of the project in general. I think from a software engineering point of view, it is nice to have support for all the Python I/O and concurrency facilities: async, threading, queues, and multiprocessing, and to allow for arbitrary sources and sinks. If you have the software develoipment resources, by all means, it's good to implement that. Keep in mind, though, that any I/O pipeline you write as part of the deep learning job is limited to the CPU cores and PCI bandwidth on the machine with the GPU cards. On 8 or 16 GPU nodes, that greatly limits what you can do. What we do for large scale training jobs is to run the I/O pipelines on separate nodes from the GPU pipelines. There are two common and easy ways of doing that:
Note that this stuff works today and scales linearly (limited only by available hardware). AIStore data augmentation pipelines are very simple for users, since they really just look like opening a datasets; the fact that the dataset is processed on the fly doesn't concern the DL job. We generally schedule Tensorcom-based pipelines using Kubernetes. So, while I'm glad that async and threading are making it into the redesigned I/O pipelines, high performance I/O solutions for large scale deep learning probably just end up running a large number of distributed jobs, each of which is fairly simple and sequential. |
Thank you for the write up! It moves forward the design of DataLoader and addresses some common problems, so it's a great effort. It's hard to make conclusions on the solution though without end-to-end examples. The proposal is heavy on architectural principles which is good, but lacks description of how end user will interact with the system and how the end solution will look like. Can you include examples of how user code will look like for at least some of the problems you state in the beginning?
|
Thanks, @tmbdev we are going to put some time and review the potential of adding AIStore and/or TensorCom as DataPipes. @maxluk we are currently working on the next iteration of the RFC with extra attention to usage examples, your feedback is important and we will take it into account as well. Answering other questions:
|
I saw the recent updates in torch.utils.data.datapipes and tried to utilize that for my own project. The distributed sharding solution is to add a class DistributedPartitionSampler(Sampler[int]):
def __init__(self, data_source, *, shuffle=True, seed=0):
super().__init__(data_source)
self.data_source = data_source
self.shuffle = shuffle
self.epoch = 0
self.seed = seed
def __iter__(self):
if self.shuffle:
# deterministically shuffle based on epoch and seed
g = torch.Generator()
g.manual_seed(self.seed + self.epoch)
indices = torch.randperm(len(self.data_source), generator=g).tolist()
else:
indices = list(range(len(self.data_source)))
worker_info = get_worker_info()
if worker_info is None:
num_workers = 1
worker_rank = 0
else:
num_workers = worker_info.num_workers
worker_rank = worker_info.id
if not dist.is_initialized():
gpu_rank = 0
gpu_size = 1
else:
gpu_rank = dist.get_rank()
gpu_size = dist.get_world_size()
skip = max(1, gpu_size * num_workers)
offset = gpu_rank * num_workers + worker_rank
shared_index = indices[offset::skip]
for index in shared_index:
yield self.data_source[index]
def set_epoch(self, epoch: int) -> None:
self.epoch = epoch Here the data source need to be
Now I could get different shards on different workers (carefullly set epoch when training)
|
@zhuzilin
Yep, these are the fact for Iter-style DataPipe. But, there are still lots of benefits of iter-style datapipe like reducing memory overhead, better multiprocessing/threading support, etc. And, I do think there are ways to work around the problems you just pointed out. Like shuffling, you can still have the global shuffling by using a buffer that can hold the whole dataset, even though it's not recommended. And, just like you said, you could do global shuffling or sharding or skipping at the beginning of the whole pipeline (like filenames, file handlers, web requests, etc.), then you don't necessarily need to do the operation over actual data.
We have received lots of requests for map-style datapipe, and we are gathering more feedback about it. We would really appreciate if you can elaborate the use cases that you think only map-style can achieve, but iter-style can not. |
I am actually quite curious about why |
Yes, it will. In the future, you will be able to define when you want batching, before or after merged from shards. |
I second @zhuzilin comment. The data preprocessing api should not get in the way of what we want to implement like tf.data does, random access to data should remain readily available. Wether or not it leads to atrocious IO and subpar performances is secondary. Ideally, the new API should allow random access and transparently take advantage of a more IO friendly sampling strategy, which would be documented in tutorials and selected by default in the dataloader for example. |
@ejguan Thank you for your reply:) I understand there are great advantages to use iter-style datasets. And it will be much easier to deal with all kinds of data formats or data sources if we switch to it. There are two use cases on my mind at the moment: sharding and caching.
I've been working on improving the performance of The workaround by tensorflow community is to shard the file names and read each file without the shard: d = tf.data.Dataset.from_tensor_slices(filenames)
d = d.shard(num_workers, worker_index)
d = d.flat_map(lambda x: tf.data.TFRecordDataset(x)) instead of the intuitive: d = tf.data.TFRecordDataset(filenames)
d = d.shard(num_workers, worker_index) This work around would fail if the user have uneven-sized tfrecord files or maybe only one large file, which are common in practice. On the other hand, the map-style dataset can easily achieve sharding with
In our scenario, the user would save some large dataset on remote file systems like ceph and the latency fluctuation of the FS may affect the training speed a lot. Therefore, we need to cache part of the dataset to local disk to make IO more stable. However, it's hard to implement caching if we only need to store part of the data (The whole dataset is too large (TBs) to be hold in the local disk (GBs)), because we cannot check if a data entry is cached or not. Even if we can, we cannot get the entries that are not cached, since we cannot have random access to the dataset. Therefore, we only implement the caching functionality for pytorch, but not tf at the moment. Again, I need to emphasis that I agree iter-style is a really nice model for most use cases. I just hope we could have some support for the map-style as well. The current dataset and dataloader model supports map-style and I really loved that. Hope we could find a way to keep on support at least part of that. There are some comment on your rely.
For the iter-style, one need to get a pool that is the same size of the dataset to do global shuffling, while the map style would only need a function to produce the permutation of the index or directory. An image is several-hundred Ks, while a directory is only a couple of bytes. I don't think they have the same memory overhead.
I agree. But I think most datasets do not belong to this category... I think it's reasonable to assume most users are facing the datasets at the scale of GBs to TBs instead of the gigantic ones that would take GBs only for index or directory. Even in this case, the map-style can still fallback to have a pool of index, which could hold a lot more records than the data pool of iter-style. Thank you for your time on this looong comment~ |
@ejguan ListDirFilesIterDataPipe uses generator to yield shard file names, and def __iter__(self) -> Iterator[str] :
yield from get_file_pathnames_from_root(self.root, self.masks, self.recursive, self.abspath)
def __len__(self):
if self.length == -1:
raise NotImplementedError
return self.length So my question is how sharding works if all the |
Re: Uneven end of data handling A common solution is an extra layer of indirection between "shards" and workers (which I guess would be instances of
you might have something like:
I.e you might have 100,000 rows, 100 shards, and 3 "workers." Each worker handles about 33 shards which is ~33,000 rows. Then you just have some extra communication overhead that if one worker is idle you need to move some shards from a busy node to that worker. You can add some rebalancing mechanism so that shards end up being moved to faster workers until they're allocated according to the speed of the hardware. Going back to the original example, the same thing can probably achieved just by having a larger number of shards per machine. |
@zhuzilin We really appreciate your feedback. For sharding, what's your opinion about the following pipeline. listfiles_dp = ListFiles(root=data_dir)
shard_dp = Shard(num_workers, worker_index)
loadfiles_dp = LoadFiles(shard_dp)
decoded_images_dp = DecodeImages(loadfiles_dp)
... It definitely requires users more knowledge about the arch to add sharding pipe at the front of pipeline. We probably need to provide the factory pipelines for users to construct them without worry about where they should attach sharding pipe. And that's the reason I said the memory overhead would be similar fot iter and map in this case. For map-styple, even if you do permutation over indexes, you still have a dict to save index to filepath or whatever.
It's correct but heavily relying on the meta of TFRecord file to do the random access. It's not common case for users when they are using their own file format like tar, etc.
I am sorry that I am not sure that I understand what the exact need. Could you elaborate why iter-style datapipe doesn't work in this case? |
@npuichigo After |
@ejguan In dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.cache()
dataset = dataset.repeat() The inner logic of class CacheDataset(Dataset):
def __init__(self, input_dataset):
self.cache = Cache()
self.input_dataset = input_dataset
self.index = 0
def __iter__(self):
if self.index < len(self.input_dataset):
data = next(self.input_dataset)
self.cache[self.index] = data
else:
data = self.cache[self.index % len(self.input_dataset)]
self.index += 1
return data It will start reuse the cache only after the whole dataset is cached into memory. And that is why we need to put the However, we cannot cache the whole dataset in most cases, so we hope to cache only part of it and for the remains that are not cached, we could still read from the class CacheDataset(Dataset):
def __init__(self, input_dataset, limit):
# add a size limit to the cache
self.cache = Cache(limit=limit)
self.input_dataset = input_dataset
self.index = 0
def __iter__(self):
if self.index in self.cache:
data = self.cache[self.index % len(self.input_dataset)]
else:
# here we need to have random access to the input_dataset
data = self.input_dataset[self.index]
if not self.cache.full():
self.cache[self.index % len(self.input_dataset)] = data
self.index += 1
return data As the comment in the above code snippet, when trying to get data directly from the input dataset, we need to rely on random access. (I just found it is still about skipping some records...) And in the origin comment, our team was trying to add one more hierachy to the caching mechanism. To deal with the remote storage system, we will have:
And this will also need the map-style. |
I strongly agree on this and that's why I asked about the |
The work in |
Yes, it will change as we are going to support mini-batches and would be able to concatenate all 'streams' data. |
Oh, I see my wording is confusing and needs to be corrected in the document. When I wrote 'shards' I meant a number of consumers, not the number of partitions in the source dataset. For sure, having the number of shards divisible by the numbers of consumers is preferred for even load, but not necessary. |
Thanks, this is exactly where my documentation is incomplete and require more details to separate shards and workers (and make API cleaner). |
Just curious what you guys think of InfiniBatch. Does it apply here at all? What's different or missing compared to your vision (I admit, couldn't follow the discussion and idea doc all that well 😅). |
InfiniBatch looks interesting but doesn't fit our criteria of full backward compatibility and support of map style datasets |
# Add new data pipeline * Add writers for zip/tar/chunk based shard format to be compatible with webdataset and TorchSpeech * Abstract shard writer to support sharding when combined with zip/tar/chunk writers * Make **FeatureInfo** more clear to behave as feature encoder/decoder before writing to or decoding from shards. * Redesign data pipeline to be compatible with PyTorch's latest RFC (pytorch/pytorch#49440) * Abstract distributed shard sampling to **SamplerIterDataPipe** with **DistributedPartitionSampler**. Related work items: #3234727
DataLoader architecture updates and TarDataset implementation
Problem statement
This proposal aims to construct a modular, user-friendly, and performant toolset to address the ambiguous activity referred to as “dataloading” within PyTorch, a simplification attributable to the indivisibility of the DataLoader abstraction prescribed today. In reality, “dataloading” is a diverse set of operations that should be supported by extensible building blocks, out of which the present abstractions, and far more could be easily built. Some typical needs which are scarcely supported in the present implementation include:
There are hundreds of ways to store a single structured dataset, each requiring a custom or highly configured DataLoader. Users want to take advantage of modularity and not reimplement complete DataSets over and over again. Suppose we have a simple dataset of sample:label pairs of images and ints. There are a number of dimensions whose outer product enumerates the possible storage formats for this data, each requiring a distinct (or specifically configured) DataLoader:
The above example is only for an extremely simple data structure case. The reality of data is often dramatically more heterogeneous and complex (e.g. variable-length lists of bounding box points and strings in object detection, highly nested structures of user or product features in ranking).
Further, users want to find or contribute ops to decode specific file types (e.g. HDF5) and accelerated kernels (e.g. GPU mp3 decoding).
Given PyTorch maintains decoders for many storage formats, users want more powerful top-level abstractions, such as simply pointing PyTorch to a local or remote directory and receiving an iterator over best-efforts deserializations of the files within.
Tensorflow addresses many of the above needs with their TFRecord, dramatically simplifying the problem by taking a strong opinion of the data format with which Tensorflow works best. This has been extremely successful from a performance perspective. However, by prescribing a single storage format, all others are demoted, and the diversity of data needs and entrenched formats made ubiquitous adoption of TFRecord for storage practically impossible. We’ve heard directly from users that they do not want to be forced into a single first-class format, and the public datasets (which Google rehosts in TFRecord), tend to agree (by completely disagreeing on format). For this reason, we prefer extensibility over prescription, wherein we provide performant support for a basic set of formats in-tree (e.g. Hive and Manifold internally, tar shard and Arrow externally) but users can plug in modular extensions for new formats easily.
Underlying DataLoader Issues
Beyond the needs described above, the existing DataLoader is also a frequent source of user requests and github issues. Such feedback includes, but is not limited to:
Finally, the ubiquity of the DataLoader necessitates strong backward compatibility. For this reason we do not plan to deprecate any existing functionality, but in some cases may offer a more modern way of doing things.
Solution
Break down Dataset into smaller components
DataPipe
-s reducing logic to a queue of data-in and a queue of data-out.DataLoader observes the acyclic graph of
DataPipe
-s and provides the necessary level of parallelism using multiprocessing and multithreading.Bear in mind that even if we use
IterDataPipe
in examples below, all this also applicable toMapDataPipe
.Separating by smaller
DataPipe
-s and connecting them togetherWill allow us to simplify
DataPipe
code and make them reusable across various implementations (for example ImageFolder and TarDataset). Also necessary in case of moving memory consumingDataPipe
into separate processes.Turning IterDataPipe (or IterableDataset) and MapDataPipe (or MapDataset) into NonBlockingIterDataPipe and NonBlockingMapDataPipe
Multiprocessing/threading support makes us prefer
nonblocking_next
over__next__
function. Key difference is thatnonblocking_next
might throwNotAvailable
exception, meaning that data is not yet available and should be requested again withnonblocking_next
.DataPipe
(and older Datasets) which implements onlynonblocking_next
can be easily used as standardDataPipe
because parent class provides necessary API:Existing synchronous
DataPipe
(and older Datasets) can be turned into non-blockingDataPipe
using helper function:Combination of two approaches will allow a mix of old-style DataPipe (and datasets) and new non-blocking datapipes.
As
nonblocking_next
does not guarantee results to be returned, it can be used to schedule requests ahead:Similar approach will be applied to MapDataPipe with
nonblocking_get(id)
.Connecting blocks with queues
Having all datapipes as non-blocking (asynchronous), allows to connect them with a couple of queues.
For example in multiprocessing version, sub process main loop can look like this:
When main process can transparently access this datapipe with simple wrapper:
Allow to send
DataPipe
into separate process by few lines of code:Please note, that only one request in the queue, is an implementation restriction and not enforced by design.
DataLoaderQueue
The above examples using standard multiprocessing Queue, but it is not the best choice (performance-wise) in some cases and not working in others. Instead we suggest to replace it with higher abstraction DataLoaderQueue.
DataLoaderQueue - used to pass data between elements of a pipeline inside a single thread, between threads, between processes, in distributed env. DataLoader will replace queue with best for the moment implementation, but they all should follow next requirements:
API:
def get(blocking=True)
- returns any python structure, or raises NotAvailableException, or raises QueueErrordef put(data, blocking=True)
- data is any Python structure, may raise QueueErrorDataLoaderQueue implementation also defines ‘serialization’ technique, from simple pass object reference inside the same thread to IPC calls and full object serialization to be passed via network.
Users API
DataPipe
should work as standard iterators (or implement get__item) outside of DataLoader.DataLoader output should be exactly the same, but different pieces of graph might be executed as separate threads/processes.
Naming
There are a number of concepts which we would like to take this refactoring opportunity to clarify, though we also emphasize the importance of backward compatibility. We propose the following naming scheme for the components described within the scope of this doc, including typical end-user code samples.
Sharding
Sharding should be implemented on the framework level and hidden from DataPipe users. DataPipe developers will get control over sharding settings and running configurations. DataLoader will decide how to split DataPipe into shards and run configuration.
DataPipe blocks will provide information to the DataLoader if they support sharding via
datapipe.is_shardable()
. If a function is not defined DataPipe will be considered as non-shardable.DataLoader will callback DataPipe objects with sharding settings using
datapipe.sharding_settings(total_num_of_shards, id_of_shard)
.Example:
Individual Process (Thread)
Situations like prefetching and large non-forkable arrays require to spawn separate processes for a DataPipe. DataPipe blocks will provide information to the DataLoader if they are recommended to be executed as separate processing via
datapipe.is_separate_process()
.Lazy Initialization
In some cases it is inefficient to initialize DataPipe data before usage. For example, we need to postpone loading a full list of files before forking out a file scanner. For this purpose lazy_init function will be called prior to any
__len__
,__get_item__
,__iter__
operators.Functional DataPipe
DataLoader should not care about any data logic (including sampling, shuffle, and collate).
Moving Sampler from DataLoader into separate DataPipe
We are planning to create Samplers DataPipe for each existing logic as well as a wrapper around existing Sampler classes.
PR: #49363
Note:
All of SamplerDataPipes can be replaced by another Iterable DataPipe, and Sampler is not required in the Data pipeline.
In general, sampler datapipe is not suggested to be used in the new pipeline, and we keep it in favor of non BC-breaking.
Example for the replacement of SubsetSampler:
Moving Collate functions from DataLoader into separate DataPipes
We are going to move collate logic out of DataLoader and implement it as IterDataPipe, it will accept old collate functions as argument or can be rewritten entirely.
PR: #48933
Moving Shuffle from DataLoader into separate Datasets
pytorch/torch/utils/data/dataset.py
Lines 257 to 312 in 7729581
Other functional DataPipes
In order to provide more versatile API, we plan to add more functional DataPipe for users.
Reproducibility and randomness
Should be part of DataLoader implementation, to be able to define random seed in case of various parallelization techniques.
Async (non-blocking) operations also introduce non-determinism of order, so we would need to implement a DataLoader attribute to order of non-blocking calls fulfillments and to guarantee order determinism.
To Do
This document doesn’t touch the problem of varying batch size for different phases of processing. It is archivable by passing a list of objects into the queue and will be considered at the phase of queue implementation. However it is better to put code example here.
This document doesn't cover distributed training in detail. We are going to extend on this topic using additional sharding parameters and queue implementations.
Considerations
User defined sharding was considered unnecessary at the early stages, however, nothing in the proposed architecture prevents from implementing it later.
CPP implementation was considered as non-flexible. However, nothing prevents users from creating DataPipes with CPP internals.
Torchscript can be used inside of DataPipes, but we are not limited to it.
Arrow/Proto/… can be used to pass data between DataPipes.
Error Tracing?
C++
cc @ssnl @VitalyFedyunin @ejguan
The text was updated successfully, but these errors were encountered: