-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataPipe] shard expander #405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please support other types of expansions:
imagenet-{000000..001199}.tar
-> as designed
imagenet-{0..1199}.tar
-> expands to imagenet-0.tar, imagenet-1.tar, ..., imagenet-10.tar, imagenet-11.tar, ..., imagenet-1199.tar
imagenet-{182..1199}.tar
-> expands to imagenet-182.tar, imagenet-183.tar, ..., imagenet-210.tar, imagenet-211.tar, ..., imagenet-1199.tar
imagenet-{0182..1199}.tar
-> works
imagenet-{182..01199}.tar
-> throws error as it is unclear when to start adding leading 0
imagenet-{082..1199}.tar
-> throws error
test/test_iterdatapipe.py
Outdated
|
||
# Functional Test: ensure expansion generates the right number of shards | ||
stage1 = IterableWrapper(["ds-{000000..000009}.tar"]) | ||
print(list(iter(stage1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop print
test/test_iterdatapipe.py
Outdated
print(list(iter(stage1))) | ||
stage2 = ShardExpander(stage1) | ||
output = list(iter(stage2)) | ||
assert len(output) == 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use self.assertEqual instead
test/test_iterdatapipe.py
Outdated
print(list(iter(stage1))) | ||
stage2 = ShardExpander(stage1) | ||
output = list(iter(stage2)) | ||
assert len(output) == 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will pass even if ShardExpander will return garbage for each line.
@tmbdev Do you still plan to work on this (and related) PRs? |
Yes, I'm planning on working on the handful of PRs we have been discussing and addressing the issues raised. |
I've update the PR and I believe I have addressed all the comments. |
@VitalyFedyunin Mind taking a look at this PR? |
We'll have a look tomorrow or Wednesday. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks good to me! Just small changes (and rebase); after that I think we can merge.
Thank you so much for your contribution!
cc: @VitalyFedyunin
LGTM for me with @NivekT suggestions applied |
Co-authored-by: Kevin Tse <NivekT@users.noreply.github.com>
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds a ShardExpander filter, a filter that will take shard specs of the form "prefix-{000..999}.tar" and expand them into the 1000 corresponding file names, like the shell brace expansion. This only implements numerical range expansion (rather than full shell-style brace expansion).
Specifying collections of files in terms of numerical shard specs (instead of using classes like FileLister) has a number of advantages: (1) it does not require the ability to list files on the target (not possible in general for, for example, HTTP), (2) it documents and makes specifying a collection of files reproducible and independent of the state of the storage system, (3) it makes it easy to choose different dataset sizes by specifying different numerical subranges.
ShardExpander is frequently used with WebDataset to specify collections of training shards.