Skip to content

Conversation

@lgeiger
Copy link
Collaborator

@lgeiger lgeiger commented Jun 16, 2025

Currently NoShuffleBeamWriter would compute the wrong number of shards for cases where splits have overlapping names.

E.g. given a dataset with two splits foo and foo_bar the generated tfrecord files would look something like this:

builder-foo.tfrecord-00000-of-00003
builder-foo.tfrecord-00001-of-00003
builder-foo.tfrecord-00002-of-00003
builder-foo_bar.tfrecord-00000-of-00002
builder-foo_bar.tfrecord-00001-of-00002

The current regex for the foo split would be builder-foo* which also matches the files of the foo_bar split. This results in the number of shards for foo to also include shards from the other split. This fixes it by changing the regex to builder-foo.*.

@lgeiger
Copy link
Collaborator Author

lgeiger commented Jun 17, 2025

@tomvdw Do you mind having a look at this?

@fineguy fineguy self-requested a review June 25, 2025 08:39
@fineguy
Copy link
Collaborator

fineguy commented Jun 25, 2025

Hi @lgeiger, thanks for the PR!

Could you please add tests? Then I'll help you merge it.

@fineguy fineguy self-assigned this Jun 25, 2025
@lgeiger
Copy link
Collaborator Author

lgeiger commented Jun 25, 2025

@fineguy I extended the existing test in 6f96eaa which fails on main but is fixed with this PT. Best to reviewed with "hide whitespace" diff.

@fineguy fineguy added the copybara-import Internal label for PR management label Jun 25, 2025
@lgeiger lgeiger force-pushed the fix-no-shuffle-beam-writer branch from 1a18417 to 5de1d20 Compare June 25, 2025 13:59
@copybara-service copybara-service bot merged commit bdf87d0 into tensorflow:master Jul 1, 2025
18 of 19 checks passed
@lgeiger lgeiger deleted the fix-no-shuffle-beam-writer branch July 1, 2025 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

copybara-import Internal label for PR management

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants