Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Distributed reads for from_huggingface #42599
[Data] Distributed reads for from_huggingface #42599
Changes from all commits
1c5f7d0
ab001e5
dbad83d
027b88f
6220aff
dddf9d4
edd4c16
572202b
a5cb130
8d47704
1e72361
aed5369
06a00fd
76f69c4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add
parallelism
in the docstring? let's make sure we state that the user should not need to setparallelism
in most cases since it is auto-configured--similar to the docstring forparallelism
underread_parquet()
and other read methods.in addition, we should also clarify that this is only used when the distributed parquet read is possible (i.e. when it is a public dataset with no transformations), otherwise it's a single node read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment briefly explaining why we check for
ReadParquet
in this test for Hugging Face?or even better would be to have two test paths, one which uses the new distributed parquet path, and one which uses the backup direct read from HF dataset. I know that might be a bit tricky though, since you'll likely need to find a private dataset (that we can access) or a public dataset that doesn't have parquet files converted for some reason. So no worries if you cannot find a way to easily test both paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up adding a second test case which tests the same on a modified dataset (just did a train test split and had similar tests as the unmodified), so we should be testing both paths now!