Future of torchdata and dataloading #1196

laurencer · 2023-07-17T14:49:34Z

As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space.

We want to hear from users on their use-cases and the pain-points they have (with data loading in general or torchdata specifically). Please reply on this issue to help inform our future roadmap.

The text was updated successfully, but these errors were encountered:

erip · 2023-07-17T17:56:51Z

Thanks for the update, @laurencer. Does this mean torchdata as a domain library or the entire concept of datapipes/DataLoader2 is under review?

laurencer · 2023-07-17T19:44:21Z

The short answer is we need to look at both. More holistically there's lots of benefits with datapipes & DataloaderV2, however we've seen some limitations in a few use-cases which indicate we may need to tweak them a bit (or they're not the one-stop solution that we were hoping for). Overall the data loading space is really important and we hear about a lot of pain-points so we want to make sure we get the core abstractions right.

josiahls · 2023-07-18T00:00:27Z

I've personally liked the idea behind data pipes and the newer data loader. It'll be helpful to know what are some examples of use cases where this concept/API broke down?

I think this would be helpful for people about to jump into torchdata to know the breaking limitations, because it is not obvious to me at least.

andrew-bydlon · 2023-07-18T01:32:02Z

I too would like to hear what limitations you are referencing.

If it is performance oriented, I believe there's an argument. You could make something compatible with a compiled framework, especially something like torch.compile. Someone recently talked about speedups loading tar files with rust as an example.

My question is what you recommend as an alternative platform to torchdata for flexible and fast dataloading in pytorch?

nairbv · 2023-07-18T15:42:30Z

Does the design re-evaluation apply only to TorchData, or also to the portion of the datapipes API that was upstreamed to PyTorch core?

https://github.com/pytorch/pytorch/tree/main/torch/utils/data/datapipes

BlueskyFR · 2023-07-20T16:39:53Z

IMO the first thing that comes to mind is TensorFlow's tf.data.Dataset API, which is super cool to use. It is deeply integrated in the framework and with Keras, and the different operations you "pipe"/chain together can be fused at runtime so that the input pipeline is more optimized.
However there are still some solutions such as NVIDIA DALI that is waaay faster if it applies to your use case.

I don't know TorchData in much details but I'd say building a nice-looking pipeline is easy, but making a pipeline optimized for high performance while still making it look cool is the real challenge.
By cool-looking I mean nice/easy to read as code, but also easily extensible, like users could share "databricks" together or something.

I hope this helps!

BlueskyFR · 2023-07-21T12:23:38Z

Also, do you still recommend to use torchData in the meantime or will the compatibility with torch break at some point so we should avoid using it?

rbavery · 2023-07-21T23:07:15Z

Our ML team have been avid users of torchdata. We have used it to build datapipes that fetch large raster datasets from cloud providers to support training and inference. Currently we have a couple projects using torchdata, the most robust is
https://zen3geo.readthedocs.io/en/latest/walkthrough.html a library managed by my colleague @weiji14 for fetching and batching large satellite images, with steps organized as functional datapipe ops.

While the API makes it easier to reuse custom data operations, we've been running into some consistent pain points when integrating datapipes with Dataloader V1 or Dataloader V2. We've had to switch back to Dataloader V1 and Datasets.

It doesn't seem like there is a clear set of documented rules for prefetching, shuffling, buffer sizes, memory pinning that results in good performance, or even any performance gains that beats single process dataloading when using torchdata with either Dataloader. All the configurations we have tried result in hangups, out of memory errors, or slower performance than a single process. It's also unclear how these parameters interact with different reading services.

I would love to see better docs and functionality for setting prefetching, shuffling, etc. with different kinds of reading services. Being able to profile datapipes and inspect RAM and cpu consumption of each operation would also be invaluable.

biphasic · 2023-07-24T14:05:43Z

I started to prototype a new version of my event data library on top of torchdata. The API is very clean and easy to understand, which is a strong plus point, even if it has a minor performance impact (I didn't verify that). I remember struggling with DataLoader2 and making multithreading work.

sehoffmann · 2023-07-26T12:47:37Z

Hey, thanks for the update. Does that mean that torchdata will become obsolete in the future?

As I already indicated in older issues, what I see as the biggest weakness right now, is the lack of control and flexibility with regards to:

Shuffling
Sharding
Multiprocessing (dispatching to other processes etc.)

These things are right now tightly integrated into the torchdata core, and not easily accessible from user code. Giving User code the same Power and flexibility is paramount in my opinion to facilitate more complex pipelines than the vanilla cv pipeline. Or contrary, these functionalities should be implemented in user land without privileged handling from torchdata.

You can have a look at my repository (sehoffmann/atmodata) for some monkey patches that were necessary to facilitate my pipeline.

npuichigo · 2023-07-26T13:53:36Z

As for me, the ideal data pipeline should be ergonomic, flexible and efficient.

Chainable iterator already shows its power in iterator algorithm libraries like itertools and more-itertools. torchdata choose to enhance that with a functional programming API, that's good. Actually rust Iterator algorithms provide a good list for common used pipeline, besides Filter and Map, there're also FilterMap and FlatMap and so on.

But flexibilty still has room for improvement. At least, torchdata should be comparible to pypeln which has good flexibility to switch or mix thread/process/coroutine based tasks.

For performance, can torchdata be comparible with huggingface dataset? Can we easily leverage arrow or something else to build a high-performance data pipeline? It's better to show more benchmarks on production data.

BarclayII · 2023-07-27T02:12:34Z

DGL team is currently studying what should the UX be for scaling deep learning on graphs, namely the sampling strategies. More specifically, we want to support customization on

Different graph storages (in-memory, disk, graph databases, etc.)
Different node/edge feature storages (in-memory, disk, etc.)
Sampling algorithms (online neighbor/subgraph sampling, offline sampling, etc.)
Downstream tasks (node classification, link prediction with negative sampling, graph classification, etc.)
Orchestration (whether to put sampling and feature fetching in multiprocessing/multithreading, how to schedule different stages, etc.).

So our current design depends on the composability of torchdata's DataPipes to allow for maximum extensibility, expressing the graph storages/feature storages/sampling algorithms/etc. as a composition of iterables and their transforms.

That being said, we are currently not pursuing active usages on DataLoader2 due to concerns on compatibility to existing packages depending on PyTorch DataLoader (e.g. PyTorch Lightning). That being said, we borrowed some ideas from ReadingService (namely the in-place editing of DataPipes).

We already have some demo in https://github.com/dmlc/dgl/tree/master/tests/python/pytorch/graphbolt.

Happy to discuss further.

nairbv · 2023-07-27T13:48:27Z

@npuichigo pypeln looks interesting. Based on there being multiple single-threaded queues between stages I assume this is designed for a single-node setup? PyTorch users would need multi-node support. To insert a queue between stages of a pipeline with multi-node stages, presumably we'd want to use some kind of purpose-built stand alone message queue. I'm not sure if that kind of setup is desirable -- once the training data reaches GPU hosts, I'd think we usually don't want to send it back elsewhere, so that architecture might make more sense for pre-processing.

@BarclayII

our current design depends on the composability of torchdata's DataPipes
not pursuing active usages on DataLoader2

I'd like to use DataPipes for some NLP problems for similar reasons, and have some prototypes. I'd like to get confirmation, but from what I can tell it seems like it may only be TorchData that has paused development, whereas the DataPipes API is already part of PyTorch core.

For my use-cases I don't need DataLoader2 or readingservice/adapter. I think there are other ways to solve the problems addressed by those additional APIs -- I think there are ways to do it with just DataPipes that would address the concerns @sehoffmann raises around shuffling/sharding being "tightly integrated into the torchdata core, and not easily accessible from user code."

I also wonder whether we actually need a separate API focused on composability of datasets. The original dataset API could be used with composition too, and I'm not sure exactly what challenges we'd face in doing so. I know we wouldn't have the functional helper functions but that seems minor, and not sure what else we'd be missing.

bryant1410 · 2023-07-28T20:17:20Z

however we've seen some limitations in a few use-cases which indicate we may need to tweak them a bit

@laurencer can you elaborate on which are those use cases?

BarclayII · 2023-08-03T02:25:42Z

@npuichigo I checked pypeln as well. It seems that the user needs to specify how to organize the queues in low-level (e.g. multiprocessing, multithreading, asyncio, etc.). Normally our UX shouldn't involve such a low-level specification unless the developers want to implement their own pipeline scheduling.

The original dataset API could be used with composition too, and I'm not sure exactly what challenges we'd face in doing so. I know we wouldn't have the functional helper functions but that seems minor, and not sure what else we'd be missing.

@nairbv Other than the functional helper functions, I find the in-place editing of DataPipe (namely torchdata.dataloader2.graph namespace) useful. For instance, with a single-processing DataPipe. I can have a DataLoader that changes the DataPipe for multi-processing and the process will be transparent to users. We also intend to apply the same idea for coordinate graph sampling, feature prefetching, and CPU-GPU transfer. https://github.com/dmlc/dgl/blob/master/python/dgl/graphbolt/dataloader.py#L58 shows an example.

Happy to discuss further.

hhoeflin · 2023-08-14T07:29:35Z

As for data pipelining solution, it would be nice if this could be developed without a dependency on a deep learning framework (torch, tensorflow etc),

vincenttran-msft · 2023-09-05T21:24:31Z

Thanks for the update @laurencer Laurence.

This is unfortunate news to hear as we over here at Microsoft's Developer Experience team have seen a lot of interest in cloud computing and seamless integration between Azure Storage and PyTorch from our customers. Our summer intern's project was building out a custom FileLoader and FileLister DataPipe that allowed easily interacting with datasets that are stored on Azure Storage, and so the news of a halt in development and an uncertain future of the torchdata repo makes for a difficult situation in regard to planning our future in terms of continuing development of integration with PyTorch.

With that being said, I am hopeful that this is a necessary step back in order to re-strategize and refine the future roadmap to ultimately end up with a better user experience for all. As the field of AI/ML continues to develop in the near future, it is really a matter of when (and not if) we will revisit building direct support for Azure Storage with PyTorch workflows, and so we will likely reach out sometime when the future of torchdata and dataloading is clearer. However, there are still some questions (and many were raised by previous posters above) that I would like to echo which would greatly help us developers in the interim of no new releases:

What are the expectations of compatibility / should we even consider continuing to build out torchdata support, or is that something that is becoming obsolete in the near future?
Any ETAs for when we will be updated on the status of the future of torchdata? While I understand this may be difficult, it would be great if any information could be shared in regard to timing to keep any developers from continuing to build out their infrastructure with torchdata, or if they should begin the migration process etc.

Thanks in advance, and please feel free to reach out if necessary!

talmo · 2023-12-19T19:22:59Z

What are the expectations of compatibility / should we even consider continuing to build out torchdata support, or is that something that is becoming obsolete in the near future?

Any ETAs for when we will be updated on the status of the future of torchdata? While I understand this may be difficult, it would be great if any information could be shared in regard to timing to keep any developers from continuing to build out their infrastructure with torchdata, or if they should begin the migration process etc.

+1, any updates on the roadmap since September?

npuichigo · 2023-12-21T06:07:06Z

I found an alternative to use Ray Data for data loading. It's framework-agnostic and performant, also providing a chainable API but with more fine-grained parallism control. I also write a tutorial to use together with HuggingFace Dataset here https://github.com/npuichigo/blazing-fast-io-tutorial.

coufon · 2024-01-16T16:46:06Z

I'd like to share our work https://github.com/google/space that supports materialized views in ML data pipelines. It ingests more metadata (versioning, column stats, logical plan) to ML datasets and pipelines, to provide a database/lakehouse like experience. Materialized views have the benefits of incrementally processing (and go back to old versions) and data lineage. Hope it will be useful.

jaanli · 2024-02-13T16:39:03Z

Hi @laurencer - any update on this?

bhack · 2024-02-15T01:25:53Z

I am also interested about the torchdata and datloading vision/roadmap related to audio/video API fragmentation pytorch/pytorch#81102

seanmcc-msft · 2024-02-21T22:03:54Z

Any update here? My team is interested in creating an Azure Storage extension for PyTorch, similar to S3, but we cannot proceed with planning and implementation until we know what the future of PyTorch extensions will look like.

bhack · 2024-02-22T13:50:56Z

I don't know if @laurencer is still active on this or he is working for pytorch/Meta more in general. /cc @ejguan as he is the codeowner of pytorch dataloader.

nairbv · 2024-02-22T14:18:23Z

@bhack I know @ejguan isn't still on torchdata (I think he's in ads now)

bhack · 2024-02-22T14:31:14Z

@nairbv I think that the codeowners need to be updated https://github.com/pytorch/pytorch/blob/main/CODEOWNERS#L113-L114

nairbv · 2024-02-22T14:44:25Z

Agreed. I don't believe dataloader currently has a specific owner but I'm not still at Meta.

bhack · 2024-03-06T19:40:56Z

I hope it will be one of the topics at https://events.linuxfoundation.org/pytorch-conference/

andrewkho · 2024-06-11T16:20:29Z

Hi everyone, we’d like to share an update about how we plan to use the pytorch/data repo going forward. We will be focusing our efforts on a more iterative approach to Dataloader v1 (torch.utils.data), and we plan to use torchdata as a place to iterate on new features, such as the StatefulDataLoader. To avoid confusion in our support and offering, we will be deleting DataLoader2 and DataPipes from this repo. We plan to restart the torchdata release cycle, beginning with 0.8.0 in July 2024 (following PyTorch 2.4.0). DataLoader2 and DataPipes will be marked deprecated. In the October 0.9.0 release, they will no longer be part of the torchdata package, and existing users will need to either migrate away or pin to an older release such as 0.8.0. We welcome feedback to these plans, and will be responding in this Issue to any questions and concerns.

zhitaoli · 2024-06-11T16:25:33Z

Is it possible to publish some whitepaper / vision showcasing where the current steering committee's thought is, and consider an RFC process?

talmo · 2024-06-11T16:27:38Z

Thanks for the update @andrewkho!

Are there any places we learn more about the roadmap? A design doc for StatefulDataLoader or anything we can use to help inform our decisions on how to engineer future-oriented performant data pipelines? We've been in a bit of a holding pattern waiting for clarity before needing to sink tons of eng hours on reimplementing complex data pipelines...
What does the removal of DataPipes from this repo mean for the corresponding implementation in core torch (torch.util.data.datapipes)?

josiahls · 2024-06-11T16:33:50Z

@andrewkho Glad to hear work will restart on this project. Glaring concerns:

1.This plan talks about removing a bunch of stuff, especially datapipes
2. The plan fails to explain what they are being replaced with.
- StatefulDataLoader looks like a very small part of a much bigger picture.
- It would be helpful give a plan / vision of that is going to proceed.
- It would be helpful to know what was the original problem with torchdata/datapipes so other's can learn/avoid whatever antipatterns/scalability issues/API issues caused this project to get paused and completely redone.

bhack · 2024-06-11T16:51:20Z

It would be nice in this roadmap also to consider how to coordinate/consolidate with some other projects in the ecosystem for their specific data domain. See pytorch/pytorch#81102

andrewkho · 2024-06-11T17:31:24Z

Thanks everyone for the questions and comments. I'd like to acknowledge @josiahls 's comment that we are removing stuff without saying what we're replacing it with, which is completely fair criticism. We want to get feedback and see if there is significant pushback on something that we are sure about (deleting datapipes/dataloader2) as soon as possible.

Regarding roadmaps and plans, we're currently iterating on some ideas on how best to serve the community, but it's not quite in a state that is ready to discuss yet. I'll be able to share more in the next couple of months, please bear with us as we get our act together :)

Re: datapipes in torch.utils.data, we're going to be taking a hard look at that as well.

talmo · 2024-06-11T17:39:36Z

Thanks @andrewkho! We'll be looking forward to hearing more.

Just to +1 what has been pointed out before in this thread: the tf.data.Dataset is a majorly performant system to consider, allowing you to build IterDataPipe-style chains of transformations, each with autographed functions (tf.function) and auto-tunable level of parallelism. It's awesome to be able to control concurrency at each part of the data pipeline, and it makes for super composable data pipeline blocks.

nairbv · 2024-06-11T17:56:40Z

There are a couple of different approaches to stateful / savable datasets in IBM repos that may be of interest. These have been used to stop and restart long-running training jobs when training LLMs.

SavableDataset: https://github.com/foundation-model-stack/foundation-model-stack/blob/main/fms/datasets/util.py#L65

StatefulDataset: https://github.com/foundation-model-stack/fms-fsdp/blob/main/fms_fsdp/utils/dataset_utils.py#L68

andrew-bydlon · 2024-06-11T19:22:39Z

It's understandable to remove datapipes from this repo, as it is just a pass through to the torch version. I do like the functional datapipes approach, allowing for dp.someiterable nomenclature. I hope it can be ported.

For data loader 2, is there a concrete idea for how to do a data sampler approach in dl1? I typically don't have a length in iterable datapipes, or they are infinite cycles. I'm unaware if a standard was created. The analogous reading services for distributed and multiprocessing are very nice here.

Glad to hear it is coming back in some form. To me, it is a useful repo for datapipes and accessing cloud providers (e.g. S3). I wonder what the big vision is after removing dp and dl2 🤣

jaanli · 2024-06-12T00:06:01Z

We have been using dbt: @dbt-labs helps us transform our pre-train data and upload it automatically to S3 for streaming - I think a similar pipeline should be feasible for DataLoader2-type tasks? Not sure...

Here's an example for health care data: https://github.com/onefact/healthcare-data/

rgtjf · 2024-07-04T07:34:15Z

StatefulDataLoader may be a prerequisite for stateful data pipes.

laurencer pinned this issue Jul 17, 2023

talmo mentioned this issue Jul 22, 2023

Refactor datapipes talmolab/sleap-nn#9

Merged

laurencer mentioned this issue Jul 24, 2023

Is torchdata still being actively developed? #1192

Closed

NicolasHug mentioned this issue Jul 24, 2023

Migrating download utils to torchdata pytorch/vision#7549

Closed

talmo mentioned this issue Jul 27, 2023

Top-down centered-instance pipeline talmolab/sleap-nn#3

Closed

NickleDave mentioned this issue Jul 28, 2023

ENH: add BaseVocalDataset that uses vocles vocalpy/vak#539

Closed

weiji14 mentioned this issue Aug 1, 2023

✨ MapperAsyncIterDataPipe for applying custom async functions weiji14/bambooflow#9

Merged

3 tasks

pmeier mentioned this issue Aug 1, 2023

Simpler file chunking #1188

Closed

adamjstewart mentioned this issue Aug 1, 2023

TorchData microsoft/torchgeo#576

Closed

atolopko-czi mentioned this issue Aug 3, 2023

Assess continued use of TorchData chanzuckerberg/cellxgene-census#685

Open

biphasic mentioned this issue Aug 10, 2023

Porting datasets to new TorchData API neuromorphs/tonic#229

Closed

17 tasks

albertz mentioned this issue Aug 22, 2023

Reconsider TorchData, DataLoader2, etc rwth-i6/returnn#1382

Closed

awaelchli mentioned this issue Sep 10, 2023

Support setup for torch DataPipes Lightning-AI/pytorch-lightning#16603

Closed

adamjstewart mentioned this issue Sep 27, 2023

RasterDataset.from_files microsoft/torchgeo#1427

Closed

jacobbieker mentioned this issue Oct 10, 2023

Torchdata Future is up in the air openclimatefix/ocf_datapipes#227

Closed

jacobbieker mentioned this issue Nov 21, 2023

PVNet Torchdata now conflicts with ocf_datapipes DataPipes openclimatefix/PVNet#94

Closed

lhoestq mentioned this issue Apr 30, 2024

Save and resume the state of a DataLoader huggingface/datasets#5454

Open

andrewkho mentioned this issue May 29, 2024

Update README.md #1263

Merged

This was referenced Jun 12, 2024

Enable Append Mode in SaverIterDataPipe #1270

Open

add deprecation warning for deprecated modules #1277

Merged

gitttt-1234 mentioned this issue Jun 23, 2024

DataLoader Performance improvement talmolab/sleap-nn#58

Open

Future of torchdata and dataloading #1196

Future of torchdata and dataloading #1196

Comments

laurencer commented Jul 17, 2023 • edited Loading

erip commented Jul 17, 2023

laurencer commented Jul 17, 2023

josiahls commented Jul 18, 2023

andrew-bydlon commented Jul 18, 2023

nairbv commented Jul 18, 2023

BlueskyFR commented Jul 20, 2023

BlueskyFR commented Jul 21, 2023

rbavery commented Jul 21, 2023 • edited Loading

biphasic commented Jul 24, 2023

sehoffmann commented Jul 26, 2023

npuichigo commented Jul 26, 2023 • edited Loading

BarclayII commented Jul 27, 2023

nairbv commented Jul 27, 2023

bryant1410 commented Jul 28, 2023 • edited Loading

BarclayII commented Aug 3, 2023 • edited Loading

hhoeflin commented Aug 14, 2023

vincenttran-msft commented Sep 5, 2023 • edited Loading

talmo commented Dec 19, 2023

npuichigo commented Dec 21, 2023

coufon commented Jan 16, 2024

jaanli commented Feb 13, 2024

bhack commented Feb 15, 2024 • edited Loading

seanmcc-msft commented Feb 21, 2024

bhack commented Feb 22, 2024

nairbv commented Feb 22, 2024

bhack commented Feb 22, 2024

nairbv commented Feb 22, 2024

bhack commented Mar 6, 2024

andrewkho commented Jun 11, 2024

zhitaoli commented Jun 11, 2024

talmo commented Jun 11, 2024

josiahls commented Jun 11, 2024

bhack commented Jun 11, 2024 • edited Loading

andrewkho commented Jun 11, 2024

talmo commented Jun 11, 2024

nairbv commented Jun 11, 2024

andrew-bydlon commented Jun 11, 2024

jaanli commented Jun 12, 2024

rgtjf commented Jul 4, 2024

laurencer commented Jul 17, 2023 •

edited

Loading

rbavery commented Jul 21, 2023 •

edited

Loading

npuichigo commented Jul 26, 2023 •

edited

Loading

bryant1410 commented Jul 28, 2023 •

edited

Loading

BarclayII commented Aug 3, 2023 •

edited

Loading

vincenttran-msft commented Sep 5, 2023 •

edited

Loading

bhack commented Feb 15, 2024 •

edited

Loading

bhack commented Jun 11, 2024 •

edited

Loading