[Data] Estimate object store memory from in-flight tasks #42504

bveeramani · 2024-01-19T01:28:16Z

Why are these changes needed?

Ray Data's streaming executor launches as many as 50 tasks in a single scheduling step. If the executor doesn't account for the potential output of in-flight tasks, it launches too many tasks (since tasks don't immediately output data) and causes spilling.

This PR fixes the issue by considering data buffered at the Ray Core level to computations of topology resource usage.

Related issue number

Fixes #42374

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

python/ray/data/_internal/execution/streaming_executor_state.py

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

raulchen · 2024-01-24T00:52:21Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

@@ -311,10 +311,13 @@ def base_resource_usage(self) -> ExecutionResources:
    def current_resource_usage(self) -> ExecutionResources:
        # Both pending and running actors count towards our current resource usage.
        num_active_workers = self._actor_pool.num_total_actors()
+        object_store_memory = self.metrics.obj_store_mem_cur
+        if self.metrics.obj_store_mem_pending is not None:
+            object_store_memory += self.metrics.obj_store_mem_pending


(not blocking this PR. just a note.)
I plan to move this method to OpRuntimeMetrics. Currently it's weird that we have 2 different places reporting resource metrics.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani added 2 commits January 18, 2024 17:27

Initial commit

0b14e49

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Appease lint

f63de65

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and Zandew as code owners January 19, 2024 01:28

bveeramani marked this pull request as draft January 19, 2024 01:28

bveeramani commented Jan 19, 2024

View reviewed changes

python/ray/data/_internal/execution/streaming_executor_state.py Outdated Show resolved Hide resolved

bveeramani mentioned this pull request Jan 19, 2024

[data] Ray Data is not respecting object store memory limit #42374

Closed

bveeramani assigned raulchen Jan 19, 2024

bveeramani added 2 commits January 22, 2024 04:14

Update stuff

7046eee

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Update stuff

788d080

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

raulchen reviewed Jan 22, 2024

View reviewed changes

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py Outdated Show resolved Hide resolved

bveeramani added 3 commits January 23, 2024 15:58

Update stuff

983b557

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Appease lint

139d391

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Add note

db6f02f

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani marked this pull request as ready for review January 24, 2024 00:10

Merge branch 'master' into conservative-estimate

77ac0ad

raulchen approved these changes Jan 24, 2024

View reviewed changes

bveeramani and others added 5 commits January 24, 2024 20:16

Address review comments

6417890

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Appease lint

57e63a7

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Fix test

6efb412

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Merge branch 'master' into conservative-estimate

40025de

Merge branch 'master' into conservative-estimate

bd1f35d

bveeramani merged commit 0c0ed96 into ray-project:master Jan 25, 2024
8 of 9 checks passed

bveeramani deleted the conservative-estimate branch January 25, 2024 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Estimate object store memory from in-flight tasks #42504

[Data] Estimate object store memory from in-flight tasks #42504

bveeramani commented Jan 19, 2024 •

edited

Loading

raulchen Jan 24, 2024

[Data] Estimate object store memory from in-flight tasks #42504

[Data] Estimate object store memory from in-flight tasks #42504

Conversation

bveeramani commented Jan 19, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

raulchen Jan 24, 2024

Choose a reason for hiding this comment

bveeramani commented Jan 19, 2024 •

edited

Loading