Shared Dataset Functionality #24915

vincentqb · 2019-08-20T17:22:07Z

🚀 Feature

We want to build a unified data pipeline interface that offers building blocks for others to build on with the following objectives:

Standardize datasets across domains.
Offer flexible building blocks that can be combine to obtain other datasets.
Enable datasets that do not fit in memory.
Share code among domains.
Facilitate parallel loading and processing of data.
Decouple data loading and preprocessing/transformation.
Offer static typing for datasets

Motivation

The Domains currently each have their own non-standard dataset structure that may also download the data. This duplicate efforts and adds complexity to the user.
A common bottleneck when generating datasets is reading the data. We want to offer an interface that enables reading the data and running initial preprocessing while maximizing available computing resources utilization.
We may want to leverage specialize libraries such as NVIDIA DALI.

Additional Information

torch.utils.data
tf.data (e.g. uses dictionary for data point iteration)
fast.ai's basic_data and data_block
tnt
torchnet
~~torchdata~~

Datasets:

Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) text#624 Add a new dataset - Enwik9 text#610 new dataset format with librispeech and commonvoice audio#303 new datasets in domains
Add VGGface2 dataset vision#1193 wants to select which metadata to return
Internal: overview torchtext core torchvision
safe datasets

Dataloader:

torchaudio background iterator
Shared Dataset Functionality #24915 wants to re-use worker processes
FastDataLoader
python 3.8 shared memory
Internal: torchdata gil experiment DataLoader+Iterable

Features:

Move collate_fn functionality / responsibility into Dataset object #12672 wants to move collate_fn functionality to datasets
ChunkDataset API proposal #26547 wants distributed random sampling
Sampler for IterableDataset #28743 for sampler for iterable datasets
MultiCompose and SegmentationCompose [proof-of-concept] vision#1315 wants to apply an instance of random transform sequence to many images

cc @ssnl @fmassa @zhangguanheng66 @vincentqb @mrshenli

vadimkantorov · 2019-08-20T17:51:55Z

An additional wish: unify transforms interface. One idea: make them regular autograd functions / modules. This enables writing them more shared between domains (imagine various forms of cutout / warps, synthetic noise), moving them from cpu to gpu, performing them at other parts of the model, etc

soumith · 2019-08-23T03:39:38Z

all datasets are not moving to core pytorch. If you want, create a torchdata for it and make domain APIs depend on torchdata.

zhangguanheng66 · 2019-08-23T13:49:18Z

all datasets are not moving to core pytorch. If you want, create a torchdata for it and make domain APIs depend on torchdata.

One of the plans is to keep the raw data links and utils in pytorch/data and have the preprocessing func and associated ops in each domain.

cpuhrsch · 2019-08-26T17:25:52Z

cc @ssnl - what is your take on this?

ssnl · 2019-08-29T14:04:16Z

I'm wonder what functionalities are planned for torchdata. Most of the proposed features sound already doable nowadays. I think additional explanation on what the exact needs are and why the current code can't do the job will be helpful for discussion.

For example,

a core download feature, within PyTorch

sounds similar to torch.hub's download utility function

a core raw data loading (e.g. ByteTensor iterator), within PyTorch

sounds doable just with the current data loading infra.

vadimkantorov · 2019-08-31T11:08:51Z

@soumith Any thoughts on my suggestion about transforms interface? (i.e. making them normal pytorch functions / modules to support better reuse)

fmassa · 2019-10-07T05:15:17Z

@vadimkantorov

Any thoughts on my suggestion about transforms interface? (i.e. making them normal pytorch functions / modules to support better reuse)

Yes, I agree, and torchvision is going to be doing this, see pytorch/vision#1375

vincentqb · 2019-10-09T22:27:38Z

Thoughts from offline chat with @taylorgordon20

How to help in debugging in a natural python way?

Could use maps (e.g. Map(read, get_line)) instead of generators.
Do we want static type analysis?
Enforce structure between generators?
tensorflow_datasets/image/coco.py has "features" for typing
facebook pyre may support templating and typed dictionary check in functions?

Wishes?

Batched
Lazy better than Eager for data loading
Async preferred over Sync (e.g. use CPU while GPU is busy, mutiple hosts)
Prefer imperative over declarative
Serializable (for pre-processing in particular) can be done in a few ways: torchscript, etc
Shuffle, Sample
Save, cache transformed data, memoization (available for free for flows?)

This was referenced Aug 20, 2019

Datasets disclaimer #24865

Closed

CI Standardization for Domain APIs #24344

Closed

reduce aliasing in downsampling pytorch/audio#30

Closed

vincentqb changed the title ~~Dataset Functionality~~ torchdata Aug 23, 2019

ezyang added needs research We need to decide whether or not this merits inclusion, based on research world and removed triage review needs research We need to decide whether or not this merits inclusion, based on research world labels Aug 26, 2019

vincentqb mentioned this issue Aug 26, 2019

Data set discussion pytorch/audio#116

Open

vincentqb changed the title ~~torchdata~~ Shared Dataset Functionality Aug 27, 2019

vincentqb mentioned this issue Aug 27, 2019

conversation: fast, general audio read keunwoochoi/torchaudio-contrib#31

Open

fmassa mentioned this issue Aug 28, 2019

Read COCO dataset from ZIP file pytorch/vision#950

Closed

vincentqb mentioned this issue Aug 28, 2019

Saving and loading the downsampled audio results in a tensor with zeros. pytorch/audio#252

Closed

vincentqb mentioned this issue Aug 29, 2019

Added AudioMNIST dataset. pytorch/audio#84

Closed

fmassa mentioned this issue Sep 19, 2019

SegmentationDataset [feature proposal and demo] pytorch/vision#1330

Closed

vincentqb mentioned this issue Sep 30, 2019

ChunkDataset API proposal #26547

Closed

vincentqb mentioned this issue Oct 9, 2019

new dataset format with librispeech and commonvoice pytorch/audio#303

Merged

cpuhrsch mentioned this issue Oct 25, 2019

DataLoader with option to re-use worker processes #15849

Open

vincentqb mentioned this issue Oct 31, 2019

Resume download pytorch/audio#320

Merged

PetrochukM mentioned this issue Nov 25, 2019

Feature: tensor.disk() #24119

Closed

vincentqb mentioned this issue Jan 21, 2020

[Feature] batch save audio util function pytorch/audio#418

Closed

DeNeutoy mentioned this issue Jan 30, 2020

Data V2 allenai/allennlp#3700

Merged

vincentqb mentioned this issue Mar 11, 2020

Text classification datasets with new torchtext dataset abstraction pytorch/text#701

Merged

vincentqb mentioned this issue Mar 31, 2020

VCTK missing txt file for 'p315' pytorch/audio#483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared Dataset Functionality #24915

Shared Dataset Functionality #24915

vincentqb commented Aug 20, 2019 •

edited

vadimkantorov commented Aug 20, 2019 •

edited

soumith commented Aug 23, 2019

zhangguanheng66 commented Aug 23, 2019

cpuhrsch commented Aug 26, 2019

ssnl commented Aug 29, 2019

vadimkantorov commented Aug 31, 2019

fmassa commented Oct 7, 2019

vincentqb commented Oct 9, 2019 •

edited

Shared Dataset Functionality #24915

Shared Dataset Functionality #24915

Comments

vincentqb commented Aug 20, 2019 • edited

🚀 Feature

Motivation

Additional Information

vadimkantorov commented Aug 20, 2019 • edited

soumith commented Aug 23, 2019

zhangguanheng66 commented Aug 23, 2019

cpuhrsch commented Aug 26, 2019

ssnl commented Aug 29, 2019

vadimkantorov commented Aug 31, 2019

fmassa commented Oct 7, 2019

vincentqb commented Oct 9, 2019 • edited

vincentqb commented Aug 20, 2019 •

edited

vadimkantorov commented Aug 20, 2019 •

edited

vincentqb commented Oct 9, 2019 •

edited