New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Distributed reads for from_huggingface #42599
Conversation
01cadf8
to
53c53c6
Compare
@@ -2372,7 +2372,7 @@ def from_spark( | |||
|
|||
@PublicAPI | |||
def from_huggingface( | |||
dataset: Union["datasets.Dataset", "datasets.IterableDataset"], | |||
dataset: Union["datasets.Dataset", "datasets.IterableDataset"], parallelism=-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add parallelism
in the docstring? let's make sure we state that the user should not need to set parallelism
in most cases since it is auto-configured--similar to the docstring for parallelism
under read_parquet()
and other read methods.
in addition, we should also clarify that this is only used when the distributed parquet read is possible (i.e. when it is a public dataset with no transformations), otherwise it's a single node read.
python/ray/data/read_api.py
Outdated
# If file urls are returned, the parquet files are available via API | ||
import fsspec.implementations.http | ||
|
||
http = fsspec.implementations.http.HTTPFileSystem() | ||
return read_parquet(file_urls, parallelism=parallelism, filesystem=http) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create a new GH issue for adding support for reading from http via FileBasedDatasource, and leave a TODO comment linking to the issue here?
@@ -8,7 +8,9 @@ | |||
|
|||
|
|||
def test_from_huggingface(ray_start_regular_shared): | |||
data = datasets.load_dataset("tweet_eval", "emotion") | |||
data = datasets.load_dataset( | |||
"tweet_eval", "emotion", download_mode="force_redownload" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the test work without download_mode="force_redownload"
now? if so can we drop it?
"validation": ray.data.from_huggingface(data["validation"]), | ||
"test": ray.data.from_huggingface(data["test"]), | ||
} | ||
for num_par in [1, 4]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we parametrize these with pytest parameters? https://docs.pytest.org/en/7.1.x/example/parametrize.html
assert ray_dataset._plan._logical_plan.dag.name == "FromArrow" | ||
assert ray.get(ray_dataset.to_arrow_refs())[0].equals(data["train"].data.table) | ||
_check_usage_record(["FromArrow"]) | ||
# assert "FromArrow" in ds.stats() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# assert "FromArrow" in ds.stats() | |
assert "ReadParquet" in ds.stats() |
assert ray.get(ray_dataset.to_arrow_refs())[0].equals(data["train"].data.table) | ||
_check_usage_record(["FromArrow"]) | ||
# assert "FromArrow" in ds.stats() | ||
assert ds._plan._logical_plan.dag.name == "ReadParquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment briefly explaining why we check for ReadParquet
in this test for Hugging Face?
or even better would be to have two test paths, one which uses the new distributed parquet path, and one which uses the backup direct read from HF dataset. I know that might be a bit tricky though, since you'll likely need to find a private dataset (that we can access) or a public dataset that doesn't have parquet files converted for some reason. So no worries if you cannot find a way to easily test both paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up adding a second test case which tests the same on a modified dataset (just did a train test split and had similar tests as the unmodified), so we should be testing both paths now!
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
253e085
to
edd4c16
Compare
Signed-off-by: Matthew Owen <mowen@anyscale.com>
65023d6
to
572202b
Compare
expected_table = data[ds_key].data.table.sort_by("text") | ||
output_full_table = pyarrow.concat_tables( | ||
[ray.get(tbl) for tbl in ds.to_arrow_refs()] | ||
).sort_by("text") | ||
expected_table.equals(output_full_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's consolidate the logic for comparing tables with column sort + concat + equals, since it is used several times across the code. we can put it in test_huggingface.py
or ray/data/tests/util.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is also an error with pyarrow 6 not supporting the sort_by()
method:
AttributeError: 'pyarrow.lib.Table' object has no attribute 'sort_by'
so let's find an alternate implementation that also works with pyarrow 6
Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
c4e1980
to
76f69c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
Why are these changes needed?
Currently reads performed in
from_huggingface
must be performed on a single node. This allows parallelism to be used if the Hugging Face dataset matches a public dataset on the Hugging Face Hub. These reads will be performed using the dataset server list parquet files API method provided by Hugging Face.Testing Data
To test the performance improvements conducted some tests before and after the code change. Each test creates a hugging face dataset, creates a ray dataset from the hugging face dataset, then reads counts the number of rows using iter batches to consume the dataset. The dataset being used is the ca subset of the mc4 dataset (14.5M rows). The code executed was roughly as follows (missing timing of the three stages, and parallelism for the post change testing).
Pre Testing
Before the change there was only auto-parallelism selection, so 3 trials were conducted with this setting.
Post Testing
After the change, parallelism can directly be set through the
from_huggingface
method, so various parallelisms were tested (3 trials each) as well as auto parallelism. The increase of parallelism results in a speedup ofiter_batches
although with only 90 parquet files hosted, this speedup is quickly saturated.Read Parquet Testing
In addition to testing
from_huggingface
,read_parquet
was directly tested with the list of parquet files hosted by hugging face. Theread_parquet
data matches thefrom_huggingface
data quite closely which shows there is not much additional overhead in thefrom_huggingface
method.Related issue number
This issue addresses #37591, although this does add parallelism for all calls to
from_huggingface
, namely reads from private datasets, local datasets, or datasets that have been transformed in anyway from the publicly hosted files will not be parallelizable with this change.Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.