[data] Fix silent data corruption in read_webdataset for unordered tars (#44068)#63374
[data] Fix silent data corruption in read_webdataset for unordered tars (#44068)#63374abhid-007 wants to merge 2 commits into
Conversation
When a tar file's entries are not contiguous by WebDataset key prefix, _group_by_keys silently emits broken samples (one partial sample per prefix change) instead of failing. Track emitted prefixes and raise a clear ValueError when a prefix is re-encountered, naming the offending tar URL, filename, and prefix. Adds a parametrized test covering both the interleaved (raises) and spec-compliant (succeeds) cases. Fixes ray-project#44068 Signed-off-by: Abhisek Das <abhid.cs@outlook.com>
There was a problem hiding this comment.
Code Review
This pull request adds validation to ensure that WebDataset tar files are ordered by key prefix, raising a ValueError if entries are interleaved. This change prevents the silent emission of partial samples. The review feedback suggests using the match parameter in pytest.raises for more idiomatic exception testing in the new test suite.
| with pytest.raises(Exception) as exc_info: | ||
| ds.take_all() | ||
| msg = str(exc_info.value) | ||
| assert "not ordered by WebDataset key" in msg |
There was a problem hiding this comment.
Using pytest.raises with the match parameter is more concise and idiomatic than manually checking the exception message. This simplifies the test code by removing the need for an explicit exc_info variable and a separate assertion. While Exception is used here to account for potential wrapping by Ray's execution engine (e.g., RayTaskError), using match ensures we are still validating the specific failure mode.
| with pytest.raises(Exception) as exc_info: | |
| ds.take_all() | |
| msg = str(exc_info.value) | |
| assert "not ordered by WebDataset key" in msg | |
| with pytest.raises(Exception, match="not ordered by WebDataset key"): | |
| ds.take_all() |
There was a problem hiding this comment.
Good catch, applied in 6cd7322. Matches the existing match= pattern in python/ray/data/tests/datasource/test_kafka.py and test_parquet.py.
Addresses Gemini review on PR ray-project#63374. Matches the convention used elsewhere in python/ray/data/tests/ (e.g. test_kafka.py, test_parquet.py). Signed-off-by: Abhisek Das <abhid.cs@outlook.com>
Description
read_webdatasetsilently corrupts data when the input tar file is not ordered by WebDataset key prefix. No warning, no error, just more "samples" than you wrote, each one missing half its fields.I confirmed this by running the current
_group_by_keyslogic frompython/ray/data/_internal/datasource/webdataset_datasource.pyon a 4-entry tar with two prefixes interleaved.Input tar (4 files, 2 logical samples):
Current behavior on main, 4 broken samples emitted with no error:
After this patch:
Root cause
_group_by_keysis a streaming group-by. It only emits a sample when the prefix of the current entry differs from the prefix of the previous one. When the tar is interleaved, every prefix change emits a partial sample, and re-encountering a prefix later starts a new partial sample for it instead of merging. The existing "duplicate file name in tar file"ValueErroronly fires if you happen to re-encounter the sameprefix + suffixpair, which is the wrong signal for the wrong failure mode.The WebDataset spec requires adjacency, but Ray cannot enforce that at write time for tars produced by other tools, so the read path needs to fail loudly when adjacency is violated.
What this PR does
Tracks emitted prefixes in
_group_by_keys. When a new entry's prefix has already been emitted as a sample, raises a clearValueErrornaming the offending tar URL, the filename, and the prefix, and telling the user to re-sort the tar.This is the minimal correctness fix. It does not add in-reader sorting or buffering, because WebDataset shards are commonly multi-GB and buffering would break Ray Data's streaming model. Happy to do an opt-in
presort=Truefollow-up if maintainers want it.Tests
Added two tests to
python/ray/data/tests/datasource/test_webdataset.py:test_webdataset_unordered_keys_raises: builds an interleaved tar, asserts the newValueErrorfires with the expected message.test_webdataset_ordered_keys_still_works: spec-compliant tar with the same data still produces 2 correctly-collated samples.All existing webdataset tests continue to pass.
Related issues
Fixes #44068
Additional information
Saw @Hutaph expressed interest in this issue on 2026-05-13. Proceeded with a PR since no implementation followed and the bug is P1. Happy to coordinate if they are still working on it.