Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared Dataset Functionality #24915

Open
vincentqb opened this issue Aug 20, 2019 · 8 comments
Open

Shared Dataset Functionality #24915

vincentqb opened this issue Aug 20, 2019 · 8 comments
Labels
better-engineering Relatively self-contained tasks for better engineering contributors module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@vincentqb
Copy link
Contributor

vincentqb commented Aug 20, 2019

馃殌 Feature

We want to build a unified data pipeline interface that offers building blocks for others to build on with the following objectives:

  • Standardize datasets across domains.
  • Offer flexible building blocks that can be combine to obtain other datasets.
  • Enable datasets that do not fit in memory.
  • Share code among domains.
  • Facilitate parallel loading and processing of data.
  • Decouple data loading and preprocessing/transformation.
  • Offer static typing for datasets

Motivation

  • The Domains currently each have their own non-standard dataset structure that may also download the data. This duplicate efforts and adds complexity to the user.
  • A common bottleneck when generating datasets is reading the data. We want to offer an interface that enables reading the data and running initial preprocessing while maximizing available computing resources utilization.
  • We may want to leverage specialize libraries such as NVIDIA DALI.

Additional Information

Datasets:

Dataloader:

Features:

cc @ssnl @fmassa @zhangguanheng66 @vincentqb @mrshenli

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Aug 20, 2019

An additional wish: unify transforms interface. One idea: make them regular autograd functions / modules. This enables writing them more shared between domains (imagine various forms of cutout / warps, synthetic noise), moving them from cpu to gpu, performing them at other parts of the model, etc

@vincentqb vincentqb added better-engineering Relatively self-contained tasks for better engineering contributors module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module triage review labels Aug 20, 2019
@soumith
Copy link
Member

soumith commented Aug 23, 2019

all datasets are not moving to core pytorch. If you want, create a torchdata for it and make domain APIs depend on torchdata.

@zhangguanheng66
Copy link
Contributor

all datasets are not moving to core pytorch. If you want, create a torchdata for it and make domain APIs depend on torchdata.

One of the plans is to keep the raw data links and utils in pytorch/data and have the preprocessing func and associated ops in each domain.

@vincentqb vincentqb changed the title Dataset Functionality torchdata Aug 23, 2019
@ezyang ezyang added needs research We need to decide whether or not this merits inclusion, based on research world and removed triage review needs research We need to decide whether or not this merits inclusion, based on research world labels Aug 26, 2019
@cpuhrsch
Copy link
Contributor

cc @ssnl - what is your take on this?

@ssnl
Copy link
Collaborator

ssnl commented Aug 29, 2019

I'm wonder what functionalities are planned for torchdata. Most of the proposed features sound already doable nowadays. I think additional explanation on what the exact needs are and why the current code can't do the job will be helpful for discussion.

For example,

a core download feature, within PyTorch

sounds similar to torch.hub's download utility function

a core raw data loading (e.g. ByteTensor iterator), within PyTorch

sounds doable just with the current data loading infra.

@vadimkantorov
Copy link
Contributor

@soumith Any thoughts on my suggestion about transforms interface? (i.e. making them normal pytorch functions / modules to support better reuse)

@fmassa
Copy link
Member

fmassa commented Oct 7, 2019

@vadimkantorov

Any thoughts on my suggestion about transforms interface? (i.e. making them normal pytorch functions / modules to support better reuse)

Yes, I agree, and torchvision is going to be doing this, see pytorch/vision#1375

@vincentqb
Copy link
Contributor Author

vincentqb commented Oct 9, 2019

Thoughts from offline chat with @taylorgordon20

How to help in debugging in a natural python way?

  • Could use maps (e.g. Map(read, get_line)) instead of generators.
  • Do we want static type analysis?
  • Enforce structure between generators?
  • tensorflow_datasets/image/coco.py has "features" for typing
  • facebook pyre may support templating and typed dictionary check in functions?

Wishes?

  • Batched
  • Lazy better than Eager for data loading
  • Async preferred over Sync (e.g. use CPU while GPU is busy, mutiple hosts)
  • Prefer imperative over declarative
  • Serializable (for pre-processing in particular) can be done in a few ways: torchscript, etc
  • Shuffle, Sample
  • Save, cache transformed data, memoization (available for free for flows?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

8 participants