Skip to content

[data] Detect non-contiguous keys in read_webdataset#63407

Open
lonexreb wants to merge 2 commits into
ray-project:masterfrom
lonexreb:fix/44068-webdataset-key-ordering
Open

[data] Detect non-contiguous keys in read_webdataset#63407
lonexreb wants to merge 2 commits into
ray-project:masterfrom
lonexreb:fix/44068-webdataset-key-ordering

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

Description

ray.data.read_webdataset silently produces fragmented samples when a tar archive's entries are not sorted by sample key. Per the WebDataset specification, entries sharing a sample key (the base name before the first .) must be contiguous in the tar — Ray's _group_by_keys reader relies on that invariant: it flushes the in-progress sample whenever the key changes.

If a key reappears in a later run (e.g. 001.jpg, 002.jpg, 001.txt), the current code:

  1. Flushes {001: jpg} when it sees 002.jpg
  2. Flushes {002: jpg} when it sees 001.txt
  3. Emits {001: txt} at EOF

The user sees three partial samples instead of two complete ones, with no warning and no error. There's a duplicate-suffix check, but it only fires within a single contiguous run, so it doesn't catch this case.

Fix

In _group_by_keys, track the set of sample keys that have already been flushed. When the parser sees a known key start a fresh run, raise a ValueError that:

  • Names the offending key and the source tar's __url__.
  • Cites the WebDataset specification requirement.
  • Gives concrete fix instructions: tar --sort=name ... or find ... | sort | tar -T - ....

The existing same-run duplicate-suffix check is unchanged. No detection logic relies on buffering or sorting the full tar, so streaming behavior is preserved.

Related issues

Closes #44068

Additional information

Test plan

Three tests in test_webdataset.py:

  • test_webdataset_non_contiguous_keys_raises — feeds _group_by_keys an iterator where key 001 reappears non-contiguously; asserts ValueError with "not contiguous".
  • test_webdataset_contiguous_keys_no_false_positive — feeds a well-formed sequence; asserts the parser still emits the two expected sample keys, so the new detection logic does not regress the happy path.
  • test_webdataset_duplicate_suffix_within_run_still_raises — locked-in via the existing focused-test runner; the pre-existing duplicate-suffix-within-a-run error path is preserved.

All three pass locally against a clean Ray install:

test_non_contiguous_keys_raises PASSED
test_contiguous_keys_no_false_positive PASSED
test_duplicate_suffix_within_run_still_raises PASSED

ruff check is clean on both modified files (the only outstanding ruff format diff in the repo is on pre-existing code I didn't touch).

`_group_by_keys` flushes the in-progress sample whenever the tar entry's
sample key changes. That works only if entries sharing a sample key are
adjacent in the tar — the invariant the WebDataset spec requires. If a
key reappears after its run has ended (e.g. `001.jpg, 002.jpg, 001.txt`),
the function silently produces fragmented partial samples instead of the
intended grouped sample, and the user has no signal that their data was
silently misread.

This change tracks the keys that have already been flushed; when the
parser sees a known key start a fresh run, it raises a `ValueError`
naming the offending key and the tar URL, with concrete fix instructions
(`tar --sort=name` / pre-sort with `find | sort | tar -T -`). The
existing same-run duplicate-suffix check is unchanged.

Closes ray-project#44068

Signed-off-by: lonexreb <reach2shubhankar@gmail.com>
@lonexreb lonexreb requested a review from a team as a code owner May 17, 2026 09:34
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the WebDataset datasource to enforce the requirement that entries sharing a sample key must be contiguous within the tar archive. It introduces a seen_keys set to track processed keys and raises a ValueError with actionable advice if a non-contiguous key is encountered, preventing silent data loss. Corresponding regression tests have been added to verify this behavior. The review feedback identifies a potential robustness issue where a consumer could modify the yielded dictionary, leading to a KeyError, and suggests capturing the key in a local variable to prevent this.

Comment on lines +166 to +170
if current_sample is not None:
if _valid_sample(current_sample):
current_sample.update(meta)
yield current_sample
seen_keys.add(current_sample["__key__"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure robustness, it is recommended to extract the key into a local variable before yielding the sample. Since current_sample is a dictionary yielded to the consumer (which may include user-provided decoders), the consumer could potentially modify or clear the dictionary. If the __key__ field is removed, the subsequent call to seen_keys.add(current_sample["__key__"]) would raise a KeyError when the generator resumes.

Suggested change
if current_sample is not None:
if _valid_sample(current_sample):
current_sample.update(meta)
yield current_sample
seen_keys.add(current_sample["__key__"])
if current_sample is not None:
last_key = current_sample["__key__"]
if _valid_sample(current_sample):
current_sample.update(meta)
yield current_sample
seen_keys.add(last_key)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — applied in f3fc228. Captured last_key = current_sample["__key__"] before the yield exactly as suggested, and added test_webdataset_consumer_mutates_yielded_sample which clears the dict between the first and second next(gen) call to lock the contract in.

Address gemini-code-assist review on ray-project#63407: `_group_by_keys` yields
`current_sample` to the consumer (which may include user-supplied
decoders) and then on resume reads `current_sample["__key__"]` to
update `seen_keys`. If the consumer mutates or clears the dict between
the yield and the resume, the bookkeeping `KeyError`s. Capture the key
in a local variable before the yield. Add a focused regression test
that wipes the yielded dict and verifies the generator continues.

Signed-off-by: lonexreb <reach2shubhankar@gmail.com>
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Ray Data read_webdataset assumes files are ordered by key when reading a tarfile

1 participant