Time backpressure #43110

omatthew98 · 2024-02-12T19:51:27Z

Why are these changes needed?

Currently we are not timing how much time is spent in backpressure. This allows a per op timing for backpressure to be recorded, this can be used to inform how operators are being executed and how a set of backpressure policies are affecting the execution.

These stats are not propagated through to DatasetStats yet, they currently are only stored in OpRuntimeMetrics. Ideally we would be able to pass them along at a per operator level, but we could also pass along a total time spent in backpressure. A later PR will include this and other StreamingExecutor stats into the DatasetStatsSummary.

Related issue number

Closes #42799

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Owen <mowen@anyscale.com>

scottjlee · 2024-02-12T21:28:33Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

@@ -105,6 +106,17 @@ class OpRuntimeMetrics:
        default=0, metadata={"map_only": True, "export_metric": True}
    )

+    # Time operator spent in backpressure
+    # TODO: Do we need both of these metadata here


shouldn't need map_only, since we expect this to apply for all operators. (the map_only field indicates that the field is only working for map-style operators).

the export_metric flag gates whether the field is included in OpRuntimeMetrics.as_dict(), which is the string outputted: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/execution/streaming_executor.py#L315

so, we should enable this later once we are set to include this in the stats output string.

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 · 2024-02-14T22:45:34Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

@@ -118,17 +115,16 @@ def extra_metrics(self) -> Dict[str, Any]:
        """Return a dict of extra metrics."""
        return self._extra_metrics

-    def as_dict(self, metrics_only: bool = False):
+    def as_dict(self, metrics_only: bool = True):


I have modified the as_dict function here to have metrics_only default to True and have made it so unless a metric explicitly has export_metric: False then the metric will be exported. The same metrics will be exported but this allows us to properly hide the backpressure metrics for now.

python/ray/data/_internal/execution/interfaces/physical_operator.py

scottjlee · 2024-02-16T00:19:43Z

python/ray/data/tests/test_backpressure_policies.py

+            from ray.data._internal.execution.interfaces.op_runtime_metrics import (
+                OpRuntimeMetrics,
+            )
+
+            OpRuntimeMetrics.__dataclass_fields__["backpressure_time"].metadata = {
+                "export_metric": True
+            }


is it feasible to mock OpRuntimeMetrics instead and modify the attribute?

python/ray/data/tests/test_backpressure_policies.py

scottjlee · 2024-02-16T00:21:36Z

python/ray/data/tests/test_backpressure_policies.py

+            ds = ds.map_batches(map_func1, batch_size=None, num_cpus=1, concurrency=1)
+            ds = ds.map_batches(map_func2, batch_size=None, num_cpus=1.1, concurrency=1)
+            ds.take_all()
+            assert 0 < ds._plan.stats().extra_metrics["backpressure_time"] < 1


maybe a short comment explaining how the upper limit 1 was calculated/estimated?

Actually going to remove this upper limit, we re really just testing that this time is greater than 0. 1 was just chosen arbitrarily.

Signed-off-by: Matthew Owen <mowen@anyscale.com>

scottjlee · 2024-02-16T20:18:46Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

-    block_generation_time: float = field(
-        default=0, metadata={"map_only": True, "export_metric": True}
-    )
+    block_generation_time: float = field(default=0, metadata={"map_only": True})


should we keep "export_metric": True for this?

Oh yeah, lemme change back!

Signed-off-by: Matthew Owen <mowen@anyscale.com>

raulchen · 2024-02-26T21:28:35Z

python/ray/data/_internal/execution/streaming_executor_state.py

@@ -523,14 +523,19 @@ def select_operator_to_run(
    ops = []
    for op, state in topology.items():
        under_resource_limits = _execution_allowed(op, resource_manager)
+        # TODO: is this the only component of backpressure or should these
+        # other conditions be considered for timing?
+        in_backpressure = any(not p.can_add_input(op) for p in backpressure_policies)


when under_resource_limits or any(...), the operator is in backpressure for "task submission".
For this one, we can probably rename the metric to something like in_task_submission_backpressure.

another place regarding backpressure is here. This is to backpressure the output speed of the running tasks.
for the latter, it doesn't seem to make sense to record the "backpressure time". we can think of a better metric later.

Updated the metric names to make it more clear this was dealing with task submission backpressure and included the under_resource_limits in the in_backpressure boolean.

Did you mean not under_resource_limits or any(...)? In other words is this the correct logic:

under_resource_limits = _execution_allowed(op, resource_manager) in_backpressure = ( any(not p.can_add_input(op) for p in backpressure_policies) or not under_resource_limits )

And agree that makes sense to add something for the second backpressure case but will do that in a later PR when we have a better understanding on what metric would be.

Signed-off-by: Matthew Owen <mowen@anyscale.com>

raulchen · 2024-02-27T18:53:35Z

python/ray/data/_internal/execution/streaming_executor_state.py

@@ -523,14 +523,19 @@ def select_operator_to_run(
    ops = []
    for op, state in topology.items():
        under_resource_limits = _execution_allowed(op, resource_manager)
+        in_backpressure = (
+            any(not p.can_add_input(op) for p in backpressure_policies)
+            or not under_resource_limits


nit, swap the order of these 2 conditions. so when not under_resource_limits, we don't need to go over the policies.

raulchen · 2024-02-27T18:54:41Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

+    # Start time of current pause due to task submission backpressure
+    _task_submission_backpressure_start_time: float = field(
+        default=-1, metadata={"export": False}
+    )


nit, if this is an internal-only field, we can move it to __init__.

raulchen · 2024-02-27T18:55:55Z

python/ray/data/tests/test_backpressure_policies.py

@@ -138,6 +138,36 @@ def test_e2e_normal(self):
        start2, end2 = ray.get(actor.get_start_and_end_time_for_op.remote(2))
        assert start1 < start2 < end1 < end2, (start1, start2, end1, end2)

+    def test_e2e_time_backpressure(self):


maybe move this outside of TestConcurrencyCapBackpressurePolicy? since the purpose of this test is to test measuring backpress time, not the ConcurrencyCapBackpressurePolicy.

Going to leave as is to prevent having to duplicate some of the class methods outside the class. Once the metric is being exported, I will remove this duplicated test and include the assert from this test in each of the backpressure policy tests.

Signed-off-by: Matthew Owen <mowen@anyscale.com>

timing backpressure

9b69978

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 assigned scottjlee Feb 12, 2024

scottjlee reviewed Feb 12, 2024

View reviewed changes

omatthew98 added 3 commits February 12, 2024 16:32

log only metrics, update metadata

02aaf21

Signed-off-by: Matthew Owen <mowen@anyscale.com>

refactor OpRuntimeMetrics to respect export_metrics / fix tests

d29e00d

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in test for timing backpressure

925c173

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 commented Feb 14, 2024

View reviewed changes

omatthew98 marked this pull request as ready for review February 14, 2024 22:45

omatthew98 requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen and stephanie-wang as code owners February 14, 2024 22:45

scottjlee approved these changes Feb 16, 2024

View reviewed changes

omatthew98 added 3 commits February 15, 2024 16:46

Merge branch 'master' into time-backpressure

0fd0bb2

Signed-off-by: Matthew Owen <mowen@anyscale.com>

rolling back metrics metadata changes

8ce7724

Signed-off-by: Matthew Owen <mowen@anyscale.com>

respond to pr comments

dc5dee4

Signed-off-by: Matthew Owen <mowen@anyscale.com>

scottjlee approved these changes Feb 16, 2024

View reviewed changes

scottjlee assigned raulchen Feb 16, 2024

restore changed field

296639d

Signed-off-by: Matthew Owen <mowen@anyscale.com>

raulchen reviewed Feb 26, 2024

View reviewed changes

omatthew98 added 2 commits February 27, 2024 10:32

backpressure -> task submission backpressure

2cc33a3

Signed-off-by: Matthew Owen <mowen@anyscale.com>

Merge branch 'master' into time-backpressure

d3cacbe

raulchen approved these changes Feb 27, 2024

View reviewed changes

omatthew98 added 4 commits February 27, 2024 11:38

respond to pr feedback

ad3b6d0

Signed-off-by: Matthew Owen <mowen@anyscale.com>

Merge branch 'master' into time-backpressure

d1cc9d5

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in deleted import

8a836c7

Signed-off-by: Matthew Owen <mowen@anyscale.com>

Merge branch 'master' into time-backpressure

42d65f4

Signed-off-by: Matthew Owen <mowen@anyscale.com>

Merge branch 'master' into time-backpressure

4a861f5

Signed-off-by: Matthew Owen <mowen@anyscale.com>

raulchen merged commit 076d3ba into ray-project:master Feb 28, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time backpressure #43110

Time backpressure #43110

omatthew98 commented Feb 12, 2024 •

edited

scottjlee Feb 12, 2024

scottjlee Feb 12, 2024

omatthew98 Feb 14, 2024

scottjlee Feb 16, 2024

scottjlee Feb 16, 2024

omatthew98 Feb 16, 2024

scottjlee Feb 16, 2024

omatthew98 Feb 16, 2024

raulchen Feb 26, 2024

omatthew98 Feb 27, 2024

omatthew98 Feb 27, 2024

raulchen Feb 27, 2024

raulchen Feb 27, 2024

raulchen Feb 27, 2024

omatthew98 Feb 27, 2024

Time backpressure #43110

Time backpressure #43110

Conversation

omatthew98 commented Feb 12, 2024 • edited

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omatthew98 commented Feb 12, 2024 •

edited