Skip to content

TorchData 0.4.0 Beta Release

Compare
Choose a tag to compare
@ejguan ejguan released this 28 Jun 18:30
· 334 commits to main since this release

TorchData 0.4.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Performance
  • Documentation
  • Future Plans
  • Beta Usage Note

Highlights

We are excited to announce the release of TorchData 0.4.0. This release is composed of about 120 commits since 0.3.0, made by 23 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.4.0 updates are focused on consolidating the DataPipe APIs and supporting more remote file systems. Highlights include:

  • DataPipe graph is now backward compatible with DataLoader regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial here.
  • AWSSDK is integrated to support listing/loading files from AWS S3.
  • Adding support to read from TFRecord and Hugging Face Hub.
  • DataLoader2 became available in prototype mode. For more details, please check our future plans.

Backwards Incompatible Change

DataPipe

Updated Multiplexer (functional API mux) to stop merging multiple DataPipes whenever the shortest one is exhausted (pytorch/pytorch#77145)

Please use MultiplexerLongest (functional API mux_longgest) to achieve the previous functionality.

0.3.00.4.0
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]
>>> len(output_dp)
13
      
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> len(output_dp)
9
      

Enforcing single valid iterator for IterDataPipes w/wo multiple outputs pytorch/pytorch#70479, (pytorch/pytorch#75995)

If you need to reference the same IterDataPipe multiple times, please apply .fork() on the IterDataPipe instance.

IterDataPipe with a single output
0.3.00.4.0
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)
0
>>> next(it1)
1
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
[(0, 0), ..., (9, 9)]
      
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)  # This doesn't raise any warning or error
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)  # Invalidates `it1`
0
>>> next(it1)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
      

IterDataPipe with multiple outputs
0.3.00.4.0
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1)
# Basically share the same reference as `it1`
# doesn't reset because `cdp1` hasn't been read since reset
>>> next(it1)
0
>>> next(it2)
0
>>> next(it3)
1
# The next line resets all ChildDataPipe
# because `cdp2` has started reading
>>> it4 = iter(cdp2)
>>> next(it3)
0
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
      
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1)  # This invalidates `it1` and `it2`
>>> next(it1)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it2)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it3)
0
# The next line should not invalidate anything, as there was no new iterator created
# for `cdp2` after `it2` was invalidated
>>> it4 = iter(cdp2)
>>> next(it3)
1
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
      

Deprecations

DataPipe

Deprecated functional APIs of open_file_by_fsspec and open_file_by_iopath for IterDataPipe (pytorch/pytorch#78970, pytorch/pytorch#79302)

Please use open_files_by_fsspec and open_files_by_iopath

0.3.00.4.0
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()  # No Warning
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()  # No Warning
      
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()
FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_fsspec()` instead.
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()
FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_iopath()` instead.
      

Argument drop_empty_batches of Filter (functional API filter) is deprecated and going to be removed in the future release (pytorch/pytorch#76060)

0.3.00.4.0
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
      
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.
      

New Features

DataPipe

  • Added utility to visualize DataPipe graphs (#330)

IterDataPipe

  • Added Bz2FileLoader with functional API of load_from_bz2 (#312)
  • Added BatchMapper (functional API: map_batches) and FlatMapper (functional API: flat_map) (#359)
  • Added support for WebDataset-style archives (#367)
  • Added MultiplexerLongest with functional API of mux_longest (#372)
  • Add ZipperLongest with functional API of zip_longest (#373)
  • Added MaxTokenBucketizer with functional API of max_token_bucketize (#283)
  • Added S3FileLister (functional API: list_files_by_s3) and S3FileLoader (functional API: load_files_by_s3) integrated with the native AWSSDK (#165)
  • Added HuggingFaceHubReader (#490)
  • Added TFRecordLoader with functional API of load_from_tfrecord (#308)

MapDataPipe

  • Added UnZipper with functional API of unzip (#325)
  • Added MapToIterConverter with functional API of to_iter_datapipe (#327)
  • Added InMemoryCacheHolder with functional API of in_memory_cache (#328)

Releng

  • Added nightly releases for TorchData. Users should be able to install nightly TorchData via
    • pip install –pre torchdata -f https://download.pytorch.org/whl/nightly/cpu
    • conda install -c pytorch-nightly torchdata
  • Added support of AWSSDK enabled DataPipes. See: README
    • AWSSDK was pre-compiled and assembled in TorchData for both nightly and 0.4.0 releases

Improvements

DataPipe

  • Added optional encoding argument to FileOpener (pytorch/pytorch#72715)
  • Renamed BucketBatcher argument to avoid name collision (#304)
  • Removed default parameter of ShufflerIterDataPipe (pytorch/pytorch#74370)
  • Made profiler wrapper can delegating function calls to DataPipe iterator (pytorch/pytorch#75275)
  • Added input_col argument to flatmap for applying fn to the specific column(s) (#363)
  • Improved debug message when exceptions are raised within IterDataPipe (pytorch/pytorch#75618)
  • Improved debug message when argument is a tuple/list of DataPipes (pytorch/pytorch#76134)
  • Add functional API to StreamReader (functional API: open_files) and FileOpener (functional API: read_from_stream) (pytorch/pytorch#76233)
  • Enabled graph traversal for MapDataPipe (pytorch/pytorch#74851)
  • Added input_col argument to filter for applying filter_fn to the specific column(s) (pytorch/pytorch#76060)
  • Added functional APIs for OnlineReaders (#369)
    • HTTPReaderIterDataPipe: read_from_http
    • GDriveReaderDataPipe: read_from_gdrive
    • OnlineReaderIterDataPipe: read_from_remote
  • Cleared buffer for DataPipe during __del__ (pytorch/pytorch#76345)
  • Overrode wrong python https proxy on Windows (#371)
  • Exposed functional API of 'to_map_datapipe' from IterDataPipe's pyi interface (#326)
  • Moved buffer for IterDataPipe from iterator to instance (self) (#388)
  • Improved DataPipe serialization:
  • Moved IterDataPipe buffers from iter to instance (self) (#76999)
  • Refactored buffer of Multiplexer from __iter__ to instance (self) (pytorch/pytorch#77775)
  • Made GDriveReader handling Virus Scan Warning (#442)
  • Added **kwargs arguments to HttpReader to specify extra parameters for HTTP requests (#392)
  • Updated FSSpecFileLister and IoPathFileLister to support multiple root paths and updated FSSpecFileLister to support S3 urls (#383)
  • Fixed racing condition issue with writing files in multiprocessing
    • Added filelock to IoPathSaver to prevent racing condition (#413)
    • Added lock mechanism to prevent on_disk_cache downloading twice #409)
    • Add instructions about ImportError for portalocker (#506)
  • Added a 's' to the functional names of open/list DataPipes (#479)
  • Added list_file functional API to FSSpecFileLister and IoPathFileLister (#463)
  • Added list_files functional API to FileLister (pytorch/pytorch#78419)
  • Improved FSSpec DataPipes to accept extra keyword arguments (#495)
  • Pass through kwargs to json.loads call in JsonParse (#518)

DataLoader

Releng

  • Made requirements.txt as the single source of truth for TorchData version (#414)
  • Prohibited Release GHA workflows running on forked branches. (#361)

Performance

DataPipe

  • Lazily generated exception message for performance (pytorch/pytorch#78673)
    • Fixes regression introduced from single iterator constraint related PRs.
  • Disabled profiler for IterDataPipe by default (pytorch/pytorch#78674)
    • By skipping over the record function when the profiler is not enabled, the speedup is up to 5-6x for DataPipes when their internal operations are very simple (e.g. IterableWrapper)

Documentation

DataPipe

  • Fixed typo in TorchVision example (#311)
  • Updated DataPipe naming guidelines (#428)
  • Updated documents from DataSet to PyTorch Dataset (#292)
  • Added examples for graphs, meshes and point clouds using DataPipe (#337)
  • Added examples for semantic segmentation and time series using DataPipe (#340)
  • Expanded the contribution guide, especially including instructions to add a new DataPipe (#354)
  • Updated tutorial about placing sharding_filter (#487)
  • Improved graph visualization documentation (#504)
  • Added instructions about ImportError for portalocker (#506)
  • Updated examples to avoid lambdas (#524)
  • Updated documentation for S3 DataPipes (#534)
  • Updated links for tutorial (#543)

IterDataPipe

  • Fixed documentation for IterToMapConverter, S3FileLister and S3FileLoader (#381)
  • Update documentation for S3 DataPipes (#534)

MapDataPipe

  • Updated contributing guide and added guidance for MapDataPipe (#379)
    • Rather than re-implementing the same functionalities twice for both IterDataPipe and MapDataPipe, we encourage users to use the built-in functionalities of IterDataPipe and use the converter to MapDataPipe as needed.

DataLoader/DataLoader2

  • Fixed tutorial about DataPipe working with DataLoader (#458)
  • Updated examples and tutorial after automatic sharding has landed (#505)
  • Add README for DataLoader2 (#526, #541)

Releng

Future Plans

For DataLoader2, we are introducing new ways to interact between DataPipes, DataLoading API, and backends (aka ReadingServices). Feature is stable in terms of API, but functionally not complete yet. We welcome early adopters and feedback, as well as potential contributors.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.