[Data] Additional Ray Data Dashboard Metrics #43628

scottjlee · 2024-03-01T20:16:46Z

Why are these changes needed?

Adds remaining metrics from OpRuntimeMetrics class in new time series charts on the Grafana and Ray Data dashboards.
Clean up the OpRuntimeMetrics and StatsActor code, grouping related metrics by area and consolidating descriptions and comments.
Visually group each section of Ray Data metrics. See below for screenshots of each section.
~~- Programmatically generate Grafana panels from OpRuntimeMetrics fields.~~ this is currently not possible, since we would need to add ray data as a dependency for ray dashboards / serve.
Overview:
Inputs:
Outputs:
Tasks:
Object Store Memory:
Iteration:

Related issue number

Closes #42437

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2024-03-04T18:13:45Z

premerge failures look unrelated to this PR.

raulchen

by the way, do we already have OpState.outqueue.memory_usage()?

python/ray/data/_internal/stats.py

dashboard/modules/metrics/dashboards/data_dashboard_panels.py

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

omatthew98

Few micro nits, but lgtm!

python/ray/data/_internal/stats.py

omatthew98 · 2024-03-04T18:58:08Z

dashboard/client/src/pages/metrics/Metrics.tsx

@@ -208,10 +208,111 @@ const DATA_METRICS_CONFIG: MetricsSectionConfig[] = [
        title: "Rows Outputted",
        pathParams: "orgId=1&theme=light&panelId=11",
      },
+      // Inputs-related metrics
+      {
+        title: "Input Blocks Received by Operator",


Nit: I assume the ordering of display is based on panelId? Should these be ordered by that for ~organization or is the order here important?

panel id is actually unrelated to the order, the order directly follows the order of elements in this .tsx file. the only restriction is that panel id needs to be unique and matches the id from grafana panel

omatthew98 · 2024-03-04T18:59:16Z

dashboard/modules/metrics/dashboards/data_dashboard_panels.py

@@ -119,6 +119,330 @@
        fill=0,
        stack=False,
    ),
+    # Inputs-related metrics
+    Panel(
+        id=17,


Nit: Similar to above, should we order these by panel id?

panel id doesn't need to be in increasing order. i found that eventually when we need to insert new metrics/panels, it breaks all of the id's anyways, so either we will need to always continue incrementing id's, or we can just make sure they are all unique id's (this is enforced by the dashboard code already)

python/ray/data/tests/test_stats.py

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2024-03-05T18:48:07Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

+    num_task_inputs_processed: int = field(
+        default=0,
+        metadata={
+            "description": "Number of input blocks processed by tasks.",


maybe mention "finished processing" to make it more clear.

Suggested change

"description": "Number of input blocks processed by tasks.",

"description": "Number of input blocks that operator tasks has finished processing.

raulchen · 2024-03-05T18:48:19Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

+    bytes_task_inputs_processed: int = field(
+        default=0,
+        metadata={
+            "description": "Byte size of blocks processed by tasks.",


raulchen · 2024-03-05T18:52:56Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

+        metadata={
+            "description": (
+                "Number of rows in generated output blocks "
+                "that are from finished tasks."


the original comment is wrong. it's not only from finished tasks. should just be Number of rows generated by tasks.

raulchen · 2024-03-05T18:55:25Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

+            "metrics_group": "outputs",
+        },
+    )
+    num_outputs_of_finished_tasks: int = field(


ok to not expose this metric and the next. they are used to compute another property.

removed from the grafana and ray data dashboards.

raulchen · 2024-03-05T18:57:22Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

+    num_tasks_have_outputs: int = field(
+        default=0,
+        metadata={
+            "description": "Number of tasks with at least one output block.",


Suggested change

"description": "Number of tasks with at least one output block.",

"description": "Number of tasks that already have output.",

raulchen · 2024-03-05T19:00:41Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

@@ -133,14 +282,12 @@ def extra_metrics(self) -> Dict[str, Any]:
        """Return a dict of extra metrics."""
        return self._extra_metrics

-    def as_dict(self, metrics_only: bool = False):
+    def as_dict(self):
        """Return a dict representation of the metrics."""
        result = []
        for f in fields(self):
            if f.metadata.get("export", True):


this "export" seems not being used any more

raulchen · 2024-03-05T19:21:03Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

@@ -253,25 +398,42 @@ def average_bytes_change_per_task(self) -> Optional[float]:

        return self.average_bytes_outputs_per_task - self.average_bytes_inputs_per_task

+    @property
+    def estimated_object_store_usage(self) -> Optional[float]:


object store usage of an op is actually calculated as obj_store_mem_pending_task_outputs + obj_store_mem_internal_outqueue + OpState.outqueue.memory_usage + sum(next_op.obj_store_mem_pending_task_inputs + next_op.obj_store_mem_internal_inqueue).
Currently it's calculated in ResourceManager here. because OpState isn't accessible here.
It'd be also useful to report this to the dashboard.

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2024-03-06T19:07:10Z

python/ray/data/_internal/execution/streaming_executor.py

+        if op:
+            execution_resources = self._resource_manager._op_usages[op]
+            op_object_store_memory = execution_resources.object_store_memory
+            op._metrics.obj_store_mem_used = op_object_store_memory


actually, let's update this in ResourceManager.update_resources.
so we won't forgot calling this method.

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added 10 commits March 1, 2024 12:16

add input/output queue metrics

c3615d4

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0229-dash-metrics

c15bf3b

Signed-off-by: Scott Lee <sjl@anyscale.com>

add input related metrics

7f3d7e2

Signed-off-by: Scott Lee <sjl@anyscale.com>

add task related metrics

ad5f80a

Signed-off-by: Scott Lee <sjl@anyscale.com>

add remaining metrics

e93616b

Signed-off-by: Scott Lee <sjl@anyscale.com>

tests

ad04cbb

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0229-dash-metrics

edf90cb

Signed-off-by: Scott Lee <sjl@anyscale.com>

update docs

1f6773b

Signed-off-by: Scott Lee <sjl@anyscale.com>

vale spelling

708ffe5

Signed-off-by: Scott Lee <sjl@anyscale.com>

additional spelling

a6d4ba9

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review March 4, 2024 18:06

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners March 4, 2024 18:06

scottjlee assigned raulchen and c21 Mar 4, 2024

scottjlee changed the title ~~[Data] Additional Ray Dashboard Metrics~~ [Data] Additional Ray Data Dashboard Metrics Mar 4, 2024

raulchen reviewed Mar 4, 2024

View reviewed changes

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

dashboard/modules/metrics/dashboards/data_dashboard_panels.py Outdated Show resolved Hide resolved

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py Outdated Show resolved Hide resolved

scottjlee added the release-blocker P0 Issue that blocks the release label Mar 4, 2024

omatthew98 approved these changes Mar 4, 2024

View reviewed changes

scottjlee added 4 commits March 4, 2024 20:26

group metrics into dict

1eacbc8

Signed-off-by: Scott Lee <sjl@anyscale.com>

generate panel from metric

6b94ec1

Signed-off-by: Scott Lee <sjl@anyscale.com>

group into sections

85cfb9b

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0229-dash-metrics

2f777ca

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from raulchen March 5, 2024 18:05

scottjlee added 2 commits March 5, 2024 10:28

move data import inside method

26f09e0

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0229-dash-metrics

f1a450f

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen reviewed Mar 5, 2024

View reviewed changes

scottjlee added 5 commits March 5, 2024 14:21

revert ray data import from dashboard, address comments

361fd93

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0229-dash-metrics

4f497c3

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

f12885e

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0229-dash-metrics

a61bff9

Signed-off-by: Scott Lee <sjl@anyscale.com>

deflake tests

115062b

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from raulchen March 6, 2024 04:31

raulchen reviewed Mar 6, 2024

View reviewed changes

raulchen approved these changes Mar 6, 2024

View reviewed changes

update op object store usage in update_usages

d7cb1ff

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen approved these changes Mar 6, 2024

View reviewed changes

raulchen merged commit 52455e5 into ray-project:master Mar 6, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Additional Ray Data Dashboard Metrics #43628

[Data] Additional Ray Data Dashboard Metrics #43628

scottjlee commented Mar 1, 2024 •

edited

scottjlee commented Mar 4, 2024

raulchen left a comment

omatthew98 left a comment

omatthew98 Mar 4, 2024

scottjlee Mar 4, 2024

omatthew98 Mar 4, 2024

scottjlee Mar 4, 2024

raulchen Mar 5, 2024

raulchen Mar 5, 2024

raulchen Mar 5, 2024

raulchen Mar 5, 2024

scottjlee Mar 5, 2024

raulchen Mar 5, 2024

raulchen Mar 5, 2024

raulchen Mar 5, 2024

raulchen Mar 6, 2024

	"description": "Number of input blocks processed by tasks.",
	"description": "Number of input blocks that operator tasks has finished processing.

	"description": "Number of tasks with at least one output block.",
	"description": "Number of tasks that already have output.",

[Data] Additional Ray Data Dashboard Metrics #43628

[Data] Additional Ray Data Dashboard Metrics #43628

Conversation

scottjlee commented Mar 1, 2024 • edited

Why are these changes needed?

Related issue number

Checks

scottjlee commented Mar 4, 2024

raulchen left a comment

Choose a reason for hiding this comment

omatthew98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottjlee commented Mar 1, 2024 •

edited