[Data] Improve stability of Parquet metadata prefetch task #42044

c21 · 2023-12-20T20:52:44Z

Why are these changes needed?

This PR is a fix for Parquet metadata prefetching task, when reading a large amount of Parquet files on S3 (>50k). Before this PR, the Parquet prefetch metadata task is running on head node (w/ DEFAULT scheduling strategy), and not retry on S3 transient exception. So it can fail very quickly because it launches too many request from same node, and throttled by S3.

This PR does 3 things:

Fix scheduling strategy to use SPREAD same as read task, to spread out metadata prefetch task across cluster. This avoids hit S3 w/ too many requests from same node.
Auto-retry on OSError, where S3 throws transient error such as Access Denied, Read Timeout.
Extract num_cpus default value out as a variable. So we can tune the value to control the concurrency of prefetch metadata task for particular workload. Sometime num_cpus=0.5 does not work well.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 · 2023-12-20T20:53:59Z

It's hard to understand a unit test given it depends on failure from prefetching metadata.

I can run the Parquet metadata prefetching release test, to make sure it still works - https://github.com/ray-project/ray/blob/master/release/release_tests.yaml#L5485 .

Signed-off-by: Cheng Su <scnju13@gmail.com>

scottjlee · 2023-12-20T22:08:08Z

python/ray/data/datasource/parquet_datasource.py

+# The application-level exceptions to retry for metadata prefetching task.
+# Default to retry on `OSError` because AWS S3 would throw this transient
+# error when load is too high.
+RETRY_EXCEPTIONS_FOR_META_FETCH_TASK = [OSError]


should we add this in general for all FileBasedDatasource reads?

ray/python/ray/data/datasource/file_based_datasource.py

Line 68 in 9740642

OPEN_FILE_RETRY_ON_ERRORS = ["AWS Error SLOW_DOWN"]

Only Parquet has this metadata prefetching. So we can add it first here. For all file-based datasources, looks like the expoential backoff retry is working.

scottjlee · 2023-12-20T22:24:22Z

python/ray/data/datasource/parquet_datasource.py

@@ -47,6 +47,15 @@
 FRAGMENTS_PER_META_FETCH = 6
 PARALLELIZE_META_FETCH_THRESHOLD = 24

+# The `num_cpus` for each metadata prefetching task.
+# Default to 0.5 instead of 1 because it is cheaper than normal read task.
+NUM_CPUS_FOR_META_FETCH_TASK = 0.5


so currently, there is no configurable way (e.g. through DataContext) to modify this right?

yes, I am hesitating to add to DataContext (which exposed to users). If we see this is a value needed by users, we can expose it later. For now, let's start w/ a varilable, so at least we (internal developer) can change it.

raulchen · 2023-12-20T22:35:55Z

python/ray/data/tests/test_parquet.py

@@ -206,7 +206,10 @@ def test_parquet_read_meta_provider(ray_start_regular_shared, fs, data_path):
    pq.write_table(table, path2, filesystem=fs)

    class TestMetadataProvider(DefaultParquetMetadataProvider):
-        def prefetch_file_metadata(self, fragments):
+        def prefetch_file_metadata(self, fragments, **ray_remote_args):
+            assert ray_remote_args["num_cpus"] == 0.5


Suggested change

assert ray_remote_args["num_cpus"] == 0.5

assert ray_remote_args["num_cpus"] == NUM_CPUS_FOR_META_FETCH_TASK

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 · 2023-12-21T01:41:46Z

Reran the Parquet metadata release test, no regression - https://buildkite.com/ray-project/release/builds/4714#018c89f0-5bde-4011-835d-90d244d2c3f0 .

…ct#42044) This PR is a fix for Parquet metadata prefetching task, when reading a large amount of Parquet files on S3 (>50k). Before this PR, the Parquet prefetch metadata task is running on head node (w/ `DEFAULT` scheduling strategy), and not retry on S3 transient exception. So it can fail very quickly because it launches too many request from same node, and throttled by S3. This PR does 3 things: * Fix scheduling strategy to use `SPREAD` same as read task, to spread out metadata prefetch task across cluster. This avoids hit S3 w/ too many requests from same node. * Auto-retry on `OSError`, where S3 throws transient error such as `Access Denied`, `Read Timeout`. * Extract `num_cpus` default value out as a variable. So we can tune the value to control the concurrency of prefetch metadata task for particular workload. Sometime `num_cpus=0.5` does not work well. Signed-off-by: Cheng Su <scnju13@gmail.com>

Improve stability of Parquet metadata prefetch task

132bbf5

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 requested review from ericl, scv119, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and Zandew as code owners December 20, 2023 20:52

c21 assigned raulchen and scottjlee Dec 20, 2023

Fix unit test

e3b9013

Signed-off-by: Cheng Su <scnju13@gmail.com>

scottjlee approved these changes Dec 20, 2023

View reviewed changes

raulchen approved these changes Dec 20, 2023

View reviewed changes

Address comment

0502ba9

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 merged commit 6fdc9e3 into ray-project:master Dec 21, 2023
10 checks passed

c21 deleted the fix-prefetch branch December 21, 2023 01:41

bveeramani mentioned this pull request Feb 6, 2024

[Data] Use application-level retries for Parquet metadata tasks #42922

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Improve stability of Parquet metadata prefetch task #42044

[Data] Improve stability of Parquet metadata prefetch task #42044

c21 commented Dec 20, 2023

c21 commented Dec 20, 2023

scottjlee Dec 20, 2023

c21 Dec 20, 2023

scottjlee Dec 20, 2023

c21 Dec 20, 2023

raulchen Dec 20, 2023

c21 Dec 20, 2023

c21 commented Dec 21, 2023

	assert ray_remote_args["num_cpus"] == 0.5
	assert ray_remote_args["num_cpus"] == NUM_CPUS_FOR_META_FETCH_TASK

[Data] Improve stability of Parquet metadata prefetch task #42044

[Data] Improve stability of Parquet metadata prefetch task #42044

Conversation

c21 commented Dec 20, 2023

Why are these changes needed?

Related issue number

Checks

c21 commented Dec 20, 2023

scottjlee Dec 20, 2023

Choose a reason for hiding this comment

c21 Dec 20, 2023

Choose a reason for hiding this comment

scottjlee Dec 20, 2023

Choose a reason for hiding this comment

c21 Dec 20, 2023

Choose a reason for hiding this comment

raulchen Dec 20, 2023

Choose a reason for hiding this comment

c21 Dec 20, 2023

Choose a reason for hiding this comment

c21 commented Dec 21, 2023