Skip to content

[data] Fix silent data corruption in read_webdataset for unordered tars (#44068)#63374

Open
abhid-007 wants to merge 2 commits into
ray-project:masterfrom
abhid-007:fix/webdataset-unordered-tar-44068
Open

[data] Fix silent data corruption in read_webdataset for unordered tars (#44068)#63374
abhid-007 wants to merge 2 commits into
ray-project:masterfrom
abhid-007:fix/webdataset-unordered-tar-44068

Conversation

@abhid-007
Copy link
Copy Markdown

Description

read_webdataset silently corrupts data when the input tar file is not ordered by WebDataset key prefix. No warning, no error, just more "samples" than you wrote, each one missing half its fields.

I confirmed this by running the current _group_by_keys logic from python/ray/data/_internal/datasource/webdataset_datasource.py on a 4-entry tar with two prefixes interleaved.

Input tar (4 files, 2 logical samples):

0.a -> b"zero-a"
1.a -> b"one-a"
0.b -> b"zero-b"
1.b -> b"one-b"

Current behavior on main, 4 broken samples emitted with no error:

{'__key__': '0', 'a': b'zero-a', '__url__': 'interleaved.tar'}
{'__key__': '1', 'a': b'one-a', '__url__': 'interleaved.tar'}
{'__key__': '0', 'b': b'zero-b', '__url__': 'interleaved.tar'}
{'__key__': '1', 'b': b'one-b', '__url__': 'interleaved.tar'}

After this patch:

ValueError: Tar file interleaved.tar is not ordered by WebDataset key:
entry '0.b' re-uses prefix '0' after that prefix was already emitted as
a sample. The WebDataset format requires that all files sharing a key
prefix be stored contiguously in the tar archive. Re-create the tar
with entries sorted by name before reading.

Root cause

_group_by_keys is a streaming group-by. It only emits a sample when the prefix of the current entry differs from the prefix of the previous one. When the tar is interleaved, every prefix change emits a partial sample, and re-encountering a prefix later starts a new partial sample for it instead of merging. The existing "duplicate file name in tar file" ValueError only fires if you happen to re-encounter the same prefix + suffix pair, which is the wrong signal for the wrong failure mode.

The WebDataset spec requires adjacency, but Ray cannot enforce that at write time for tars produced by other tools, so the read path needs to fail loudly when adjacency is violated.

What this PR does

Tracks emitted prefixes in _group_by_keys. When a new entry's prefix has already been emitted as a sample, raises a clear ValueError naming the offending tar URL, the filename, and the prefix, and telling the user to re-sort the tar.

This is the minimal correctness fix. It does not add in-reader sorting or buffering, because WebDataset shards are commonly multi-GB and buffering would break Ray Data's streaming model. Happy to do an opt-in presort=True follow-up if maintainers want it.

Tests

Added two tests to python/ray/data/tests/datasource/test_webdataset.py:

  • test_webdataset_unordered_keys_raises: builds an interleaved tar, asserts the new ValueError fires with the expected message.
  • test_webdataset_ordered_keys_still_works: spec-compliant tar with the same data still produces 2 correctly-collated samples.

All existing webdataset tests continue to pass.

Related issues

Fixes #44068

Additional information

Saw @Hutaph expressed interest in this issue on 2026-05-13. Proceeded with a PR since no implementation followed and the bug is P1. Happy to coordinate if they are still working on it.

When a tar file's entries are not contiguous by WebDataset key prefix,
_group_by_keys silently emits broken samples (one partial sample per
prefix change) instead of failing. Track emitted prefixes and raise a
clear ValueError when a prefix is re-encountered, naming the offending
tar URL, filename, and prefix.

Adds a parametrized test covering both the interleaved (raises) and
spec-compliant (succeeds) cases.

Fixes ray-project#44068

Signed-off-by: Abhisek Das <abhid.cs@outlook.com>
@abhid-007 abhid-007 requested a review from a team as a code owner May 15, 2026 18:47
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds validation to ensure that WebDataset tar files are ordered by key prefix, raising a ValueError if entries are interleaved. This change prevents the silent emission of partial samples. The review feedback suggests using the match parameter in pytest.raises for more idiomatic exception testing in the new test suite.

Comment on lines +320 to +323
with pytest.raises(Exception) as exc_info:
ds.take_all()
msg = str(exc_info.value)
assert "not ordered by WebDataset key" in msg
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using pytest.raises with the match parameter is more concise and idiomatic than manually checking the exception message. This simplifies the test code by removing the need for an explicit exc_info variable and a separate assertion. While Exception is used here to account for potential wrapping by Ray's execution engine (e.g., RayTaskError), using match ensures we are still validating the specific failure mode.

Suggested change
with pytest.raises(Exception) as exc_info:
ds.take_all()
msg = str(exc_info.value)
assert "not ordered by WebDataset key" in msg
with pytest.raises(Exception, match="not ordered by WebDataset key"):
ds.take_all()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, applied in 6cd7322. Matches the existing match= pattern in python/ray/data/tests/datasource/test_kafka.py and test_parquet.py.

Addresses Gemini review on PR ray-project#63374. Matches the convention used
elsewhere in python/ray/data/tests/ (e.g. test_kafka.py, test_parquet.py).

Signed-off-by: Abhisek Das <abhid.cs@outlook.com>
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Ray Data read_webdataset assumes files are ordered by key when reading a tarfile

1 participant