[data] Fix silent data corruption in read_webdataset for unordered tars (#44068) by abhid-007 · Pull Request #63374 · ray-project/ray

abhid-007 · 2026-05-15T18:47:54Z

Description

read_webdataset silently corrupts data when the input tar file is not ordered by WebDataset key prefix. No warning, no error, just more "samples" than you wrote, each one missing half its fields.

I confirmed this by running the current _group_by_keys logic from python/ray/data/_internal/datasource/webdataset_datasource.py on a 4-entry tar with two prefixes interleaved.

Input tar (4 files, 2 logical samples):

0.a -> b"zero-a"
1.a -> b"one-a"
0.b -> b"zero-b"
1.b -> b"one-b"

Current behavior on main, 4 broken samples emitted with no error:

{'__key__': '0', 'a': b'zero-a', '__url__': 'interleaved.tar'}
{'__key__': '1', 'a': b'one-a', '__url__': 'interleaved.tar'}
{'__key__': '0', 'b': b'zero-b', '__url__': 'interleaved.tar'}
{'__key__': '1', 'b': b'one-b', '__url__': 'interleaved.tar'}

After this patch:

ValueError: Tar file interleaved.tar is not ordered by WebDataset key:
entry '0.b' re-uses prefix '0' after that prefix was already emitted as
a sample. The WebDataset format requires that all files sharing a key
prefix be stored contiguously in the tar archive. Re-create the tar
with entries sorted by name before reading.

Root cause

_group_by_keys is a streaming group-by. It only emits a sample when the prefix of the current entry differs from the prefix of the previous one. When the tar is interleaved, every prefix change emits a partial sample, and re-encountering a prefix later starts a new partial sample for it instead of merging. The existing "duplicate file name in tar file" ValueError only fires if you happen to re-encounter the same prefix + suffix pair, which is the wrong signal for the wrong failure mode.

The WebDataset spec requires adjacency, but Ray cannot enforce that at write time for tars produced by other tools, so the read path needs to fail loudly when adjacency is violated.

What this PR does

Tracks emitted prefixes in _group_by_keys. When a new entry's prefix has already been emitted as a sample, raises a clear ValueError naming the offending tar URL, the filename, and the prefix, and telling the user to re-sort the tar.

This is the minimal correctness fix. It does not add in-reader sorting or buffering, because WebDataset shards are commonly multi-GB and buffering would break Ray Data's streaming model. Happy to do an opt-in presort=True follow-up if maintainers want it.

Tests

Added two tests to python/ray/data/tests/datasource/test_webdataset.py:

test_webdataset_unordered_keys_raises: builds an interleaved tar, asserts the new ValueError fires with the expected message.
test_webdataset_ordered_keys_still_works: spec-compliant tar with the same data still produces 2 correctly-collated samples.

All existing webdataset tests continue to pass.

Related issues

Fixes #44068

Additional information

Saw @Hutaph expressed interest in this issue on 2026-05-13. Proceeded with a PR since no implementation followed and the bug is P1. Happy to coordinate if they are still working on it.

When a tar file's entries are not contiguous by WebDataset key prefix, _group_by_keys silently emits broken samples (one partial sample per prefix change) instead of failing. Track emitted prefixes and raise a clear ValueError when a prefix is re-encountered, naming the offending tar URL, filename, and prefix. Adds a parametrized test covering both the interleaved (raises) and spec-compliant (succeeds) cases. Fixes ray-project#44068 Signed-off-by: Abhisek Das <abhid.cs@outlook.com>

gemini-code-assist

Code Review

This pull request adds validation to ensure that WebDataset tar files are ordered by key prefix, raising a ValueError if entries are interleaved. This change prevents the silent emission of partial samples. The review feedback suggests using the match parameter in pytest.raises for more idiomatic exception testing in the new test suite.

gemini-code-assist · 2026-05-15T18:49:50Z

+        with pytest.raises(Exception) as exc_info:
+            ds.take_all()
+        msg = str(exc_info.value)
+        assert "not ordered by WebDataset key" in msg


Using pytest.raises with the match parameter is more concise and idiomatic than manually checking the exception message. This simplifies the test code by removing the need for an explicit exc_info variable and a separate assertion. While Exception is used here to account for potential wrapping by Ray's execution engine (e.g., RayTaskError), using match ensures we are still validating the specific failure mode.

Suggested change

with pytest.raises(Exception) as exc_info:

ds.take_all()

msg = str(exc_info.value)

assert "not ordered by WebDataset key" in msg

with pytest.raises(Exception, match="not ordered by WebDataset key"):

ds.take_all()

Good catch, applied in 6cd7322. Matches the existing match= pattern in python/ray/data/tests/datasource/test_kafka.py and test_parquet.py.

Addresses Gemini review on PR ray-project#63374. Matches the convention used elsewhere in python/ray/data/tests/ (e.g. test_kafka.py, test_parquet.py). Signed-off-by: Abhisek Das <abhid.cs@outlook.com>

abhid-007 requested a review from a team as a code owner May 15, 2026 18:47

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

[data] Use pytest.raises match= idiom in unordered-tar test

6cd7322

Addresses Gemini review on PR ray-project#63374. Matches the convention used elsewhere in python/ray/data/tests/ (e.g. test_kafka.py, test_parquet.py). Signed-off-by: Abhisek Das <abhid.cs@outlook.com>

ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Fix silent data corruption in read_webdataset for unordered tars (#44068)#63374

[data] Fix silent data corruption in read_webdataset for unordered tars (#44068)#63374
abhid-007 wants to merge 2 commits into
ray-project:masterfrom
abhid-007:fix/webdataset-unordered-tar-44068

abhid-007 commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

abhid-007 May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abhid-007 commented May 15, 2026

Description

Root cause

What this PR does

Tests

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

abhid-007 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant