TorchData 0.5.0 Release Notes

Highlights
Backwards Incompatible Change
Deprecations
New Features
Improvements
Bug Fixes
Performance
Documentation
Future Plans
Beta Usage Note

Highlights

We are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.5.0 updates are focused on consolidating the DataLoader2 and ReadingService APIs and benchmarking. Highlights include:

Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found here
- AWS S3 Benchmarking result
Consolidated API for DataLoader2 and provided a few ReadingServices, with detailed documentation now available here
Provided more comprehensive DataPipe operations, e.g., random_split, repeat, set_length, and prefetch.
Provided pre-compiled torchdata binaries for arm64 Apple Silicon

Backwards Incompatible Change

DataPipe

Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (pytorch/pytorch#83202)

IterDataPipe is used to to preserve data order

MapDataPipe.shuffle
0.4.1	0.5.0
_{>>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False}	_{>>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True}

MapDataPipe.shuffle

0.4.1

0.5.0

_{>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False}

`on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (#810)

on_disk_cache
0.4.1	0.5.0
_{>>> url_dp = IterableWrapper(["https://path/to/filename", ]) >>> def filepath_gen_fn(url): … yield from [url + f”/{i}” for i in range(3)] >>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)}	_{>>> url_dp = IterableWrapper(["https://path/to/filename", ]) >>> def filepath_gen_fn(url): … yield from [url + f”/{i}” for i in range(3)] >>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn) # AssertionError}

on_disk_cache

0.4.1

0.5.0

_{>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)}

DataLoader2

Imposed single iterator constraint on `DataLoader2` (#700)

DataLoader2 with a single iterator
0.4.1	0.5.0
_{>>> dl = DataLoader2(IterableWrapper(range(10))) >>> it1 = iter(dl) >>> print(next(it1)) 0 >>> it2 = iter(dl) # No reset here >>> print(next(it2)) 1 >>> print(next(it1)) 2}	_{>>> dl = DataLoader2(IterableWrapper(range(10))) >>> it1 = iter(dl) >>> print(next(it1)) 0 >>> it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator >>> print(next(it2)) 0 >>> print(next(it1)) # Raises exception, since it1 is no longer valid}

DataLoader2 with a single iterator

0.4.1

0.5.0

_{>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) # No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2}

_{>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
# Raises exception, since it1 is no longer valid}

Deep copy `DataPipe` during `DataLoader2` initialization or restoration (#786, #833)

Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

Deep copy DataPipe during DataLoader2 constructor
0.4.1	0.5.0
_{>>> dp = IterableWrapper([0, 1, 2, 3, 4]) >>> dl1 = DataLoader2(dp) >>> dl2 = DataLoader2(dp) >>> for x, y in zip(dl1, dl2): … print(x, y) # RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...}	_{>>> dp = IterableWrapper([0, 1, 2, 3, 4]) >>> dl1 = DataLoader2(dp) >>> dl2 = DataLoader2(dp) >>> for x, y in zip(dl1, dl2): … print(x, y) 0 0 1 1 2 2 3 3 4 4}

Deep copy DataPipe during DataLoader2 constructor

0.4.1

0.5.0

_{>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
# RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...}

_{>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
0 0
1 1
2 2
3 3
4 4}

Deprecations

DataLoader2

Deprecated `traverse` function and `only_datapipe` argument (pytorch/pytorch#85667)

Please use traverse_dps with the behavior the same as only_datapipe=True. (#793)

DataPipe traverse function
0.4.1	0.5.0
_{>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)}	_{>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False) FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.}

DataPipe traverse function

0.4.1

0.5.0

_{>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)}

_{>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.}

New Features

DataPipe

Added AIStore DataPipe (#545, #667)
Added support for IterDataPipe to trace DataFrames operations (pytorch/pytorch#71931,
Added support for DataFrameMakerIterDataPipe to accept dtype_generator to solve unserializable dtype (#537)
Added graph snapshotting by counting number of successful yields for IterDataPipe (pytorch/pytorch#79479, pytorch/pytorch#79657)
Implemented drop operation for IterDataPipe to drop column(s) (#725)
Implemented FullSyncIterDataPipe to synchronize distributed shards (#713)
Implemented slice and flatten operations for IterDataPipe (#730)
Implemented repeat operation for IterDataPipe (#748)
Added LengthSetterIterDataPipe (#747)
Added RandomSplitter (without buffer) (#724)
Added padden_tokens to max_token_bucketize to bucketize samples based on total padded token length (#789)
Implemented thread based PrefetcherIterDataPipe (#770, #818, #826, #842)

DataLoader2

Added CacheTimeout Adapter to redefine cache timeout of the DataPipe graph (#571)
Added DistribtuedReadingService to support uneven data sharding (#727)
Added PrototypeMultiProcessingReadingService
- Added prefetching (#826)
- Fixed process termination (#837)
- Enabled deterministic training in distributed/non-distributed environment (#827)
- Handled empty queue exception properly (#785)

Releng

Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)

Improvements

DataPipe

Fixed error message coming from singler iterator constraint (pytorch/pytorch#79547)
Enabled profiler record context in __next__ for IterDataPipe (pytorch/pytorch#79757)
Raised warning for unpickable local function (#547) (pytorch/pytorch#80232, #547)
Cleaned up opened streams on the best effort basis (#560, pytorch/pytorch#78952)
Used streaming reading mode for unseekable streams in TarArchiveLoader (#653)
Improved GDrive 'content-disposition' error message (#654)
Added as_tuple argument for CSVParserIterDataPipe` to convert output from list to tuple (#646)
Raised Error when HTTPReader get 404 Response (#160) (#569)
Added default no-op behavior for flatmap (#749)
Added support to validate input_col with the provided map function for DataPipe (pytorch/pytorch#80267, #755, pytorch/pytorch#84279)
Made ShufflerIterDataPipe support snapshotting (#83535)
Unified implementations between in_batch_shuffle with shuffle for IterDataPipe (#745)
Made IterDataPipe.to_map_datapipe loading data lazily (#765)
Added kwargs to open files for FSSpecFileLister and FSSpecSaver (#804)
Added missing functional name for FileLister (#86497)

DataLoader

Controlled shuffle option to all DataPipes with set_shuffle API pytorch/pytorch#83741)
Made distributed process group lazily initialized & share seed via the process group (pytorch/pytorch#85279)

DataLoader2

Improved graph traverse function
- Added support for unhashable DataPipe (pytorch/pytorch#80509, #559)
- Added support for all python collection objects (pytorch/pytorch#84079, #773)
Ensured finalize and finalize_iteration are called during shutdown or exception (#846)

Releng

Enabled conda release to support GLIBC_2.27 (#859)

Bug Fixes

DataPipe

Fixed error for static typing (#572, #645, #651, pytorch/pytorch#81275, #758)
Fixed fork and unzip operations for the case of a single child (pytorch/pytorch#81502)
Corrected the type of exception that is being raised by ShufflerMapDataPipe (pytorch/pytorch#82666)
Fixed buffer overflow for unzip when columns_to_skip is specified (#658)
Fixed TarArchiveLoader to skip open for opened TarFile stream (#679)
Fixed mishandling of exception message in IterDataPipe (pytorch/pytorch#84676)
Fixed interface generation in setup.py (#87081)

Performance

DataLoader2

Added benchmarking for DataLoader2
- Added AWS cloud configurations (#680)
- Added benchmark from torchvision training references (#714)

Documentation

DataPipe

Added examples for data loading with DataPipe
- Read Criteo TSV and Parquet files and apply TorchArrow operations (#561)
- Read caltech256 and coco with AIStoreDataPipe (#582)
- Read from tigergraph database (#783)
Improved docstring for DataPipe
- DataPipe converters (#710)
- S3 DataPipe (#784)
- FileOpenerIterDataPipe (pytorch/pytorch#81407)
- buffer_size for MaxTokenBucketizer (#834)
- Prefetcher (#835)
Added tutorial to load from Cloud Storage Provider including AWS S3, Google Cloud Platform and Azure Blob Storage (#812, #836)
Improved tutorial
- Fixed tutorial for newline on Windows in generate_csv (#675)
- Improved note on shuffling behavior (#688)
- Fixed tutorial about shuffing before sharding (#715)
- Added random_split example (#843)
Simplified long type names for online doc (#838)

DataLoader2

Improved docstring for DataLoader2 (#581, #817)
Added training examples using DataLoader2, ReadingService and DataPipe (#563, #664, #670, #787)

Releng

Added contribution guide for third-party library (#663)

Future Plans

We will continue benchmarking over datasets on local disk and cloud storage using TorchData. And, we will continue making DataLoader2 and related ReadingService more stable and provide more features like snapshotting the data pipeline and restoring it from the serialized state. Stay tuned and welcome any feedback.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchData 0.5.0 Release Notes

TorchData 0.5.0 Release Notes

Highlights

Backwards Incompatible Change

DataPipe

Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (pytorch/pytorch#83202)

`on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (#810)

DataLoader2

Imposed single iterator constraint on `DataLoader2` (#700)

Deep copy `DataPipe` during `DataLoader2` initialization or restoration (#786, #833)

Deprecations

DataLoader2

Deprecated `traverse` function and `only_datapipe` argument (pytorch/pytorch#85667)

New Features

DataPipe

DataLoader2

Releng

Improvements

DataPipe

DataLoader

DataLoader2

Releng

Bug Fixes

DataPipe

Performance

DataLoader2

Documentation

DataPipe

DataLoader2

Releng

Future Plans

Beta Usage Note

TorchData 0.5.0 Release Notes

TorchData 0.5.0 Release Notes

Highlights

Backwards Incompatible Change

DataPipe

Changed the returned value of MapDataPipe.shuffle to an IterDataPipe (pytorch/pytorch#83202)

on_disk_cache now doesn’t accept generator functions for the argument of filename_fn (#810)

DataLoader2

Imposed single iterator constraint on DataLoader2 (#700)

Deep copy DataPipe during DataLoader2 initialization or restoration (#786, #833)

Deprecations

DataLoader2

Deprecated traverse function and only_datapipe argument (pytorch/pytorch#85667)

New Features

DataPipe

DataLoader2

Releng

Improvements

DataPipe

DataLoader

DataLoader2

Releng

Bug Fixes

DataPipe

Performance

DataLoader2

Documentation

DataPipe

DataLoader2

Releng

Future Plans

Beta Usage Note

Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (pytorch/pytorch#83202)

`on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (#810)

Imposed single iterator constraint on `DataLoader2` (#700)

Deep copy `DataPipe` during `DataLoader2` initialization or restoration (#786, #833)

Deprecated `traverse` function and `only_datapipe` argument (pytorch/pytorch#85667)