[data] store ray dashboard metrics in _StatsActor #40118

Zandew · 2023-10-04T20:03:32Z

Why are these changes needed?

Stores dataset metrics in _StatsActor. These stats will be emitted to prometheus and used to create the Ray Data Dashboard. Stats will be collected by StreamingExecutors after each _scheduling_loop_step.

To be displayed under the "Ray Data Metrics" under metrics tab in ray dashboard

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/_internal/execution/streaming_executor.py

Zandew · 2023-10-05T21:04:55Z

@c21 @raulchen @ericl are we okay with creating a cluster-wide actor to collect these stats (exactly like _StatsActor) for the ray data dashboard? I don't think the concerns here would hold for this case since we're directly emitting the metrics to prometheus.

ericl · 2023-10-05T22:38:19Z

Why not re-use the StatsActor?

python/ray/data/_internal/execution/streaming_executor.py

python/ray/data/_internal/stats.py

Zandew · 2023-10-05T22:51:31Z

Why not re-use the StatsActor?

If we intend to deprecate/remove StatsActor it might be easier if we keep it separate? I am cool with reusing it as well.

python/ray/data/_internal/stats.py

ericl · 2023-10-05T22:59:43Z

If we intend to deprecate/remove StatsActor it might be easier if we keep it separate? I am cool with reusing it as well.

Yeah please don't create a new actor in this case then. You can add new methods to the actor and deprecate old methods, without having two similar/duplicate actors show up to users, which will be confusing.

raulchen

FYI, I'm also working on a PR that will standardize metrics recording. #40173

python/ray/data/_internal/execution/streaming_executor.py

python/ray/data/_internal/stats.py

raulchen · 2023-10-12T22:20:14Z

python/ray/data/_internal/execution/streaming_executor.py

+            metrics = op.metrics
+            resource_usage = op.current_resource_usage()
+
+            stats[DataMetric.BYTES_SPILLED] += metrics.obj_store_mem_spilled


Can we just reuse the same key as in OpRuntimeMetrics? Also, can we just report the entire metrics metrics.as_dict()?
I'm concerned about the maintenance overheads and inconsistency.

There will be metrics that are not held in OpRuntimeMetrics right? what key would we use for those? And are you saying we should remove the DataMetric enum as well?

I think eventually we should move everything to OpRuntimeMetrics. and then DataMetric doesn't seem necessary anymore.

actually, we can easily move op.current_resource_usage() to OpRuntimeMetrics. Just return incremental_resource_usage() * num_running_tasks

Would we still want to store these keys as variables somewhere? Otherwise we would have to hardcode these keys in tests and in stats.py

python/ray/data/_internal/execution/streaming_executor.py

python/ray/data/tests/test_stats.py

python/ray/data/_internal/stats.py

stephanie-wang

Just reviewed the stats descriptions. Thanks!

stephanie-wang · 2023-10-14T17:26:46Z

python/ray/data/_internal/stats.py

+        tags_keys = ("dataset",)
+        self.bytes_spilled = Gauge(
+            DataMetric.BYTES_SPILLED.value,
+            description="Bytes spilled by dataset operators",


This description isn't really accurate for the object store-based metrics, is it? Isn't it for the whole cluster?

This is the one that uses get_object_locations, so it's for the dataset

python/ray/data/_internal/stats.py

raulchen

LGTM. only some final small comments.

python/ray/data/_internal/execution/operators/input_data_buffer.py

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

python/ray/data/_internal/stats.py

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

Config changes for #40118 Creates a ray data dashboard. --------- Signed-off-by: Andrew Xue <andrewxue@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com>

Config changes for ray-project#40118 Creates a ray data dashboard. --------- Signed-off-by: Andrew Xue <andrewxue@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com>

Zandew commented Oct 4, 2023

View reviewed changes

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved

Zandew assigned scottjlee Oct 4, 2023

scottjlee reviewed Oct 5, 2023

View reviewed changes

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

scottjlee reviewed Oct 5, 2023

View reviewed changes

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

Zandew changed the title ~~[data] create dataset metrics actor~~ [data] store ray dashboard metrics in _StatsActor Oct 5, 2023

raulchen reviewed Oct 6, 2023

View reviewed changes

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved

Zandew mentioned this pull request Oct 6, 2023

[data] ray data dashboard config #40195

Merged

8 tasks

raulchen self-assigned this Oct 10, 2023

Zandew force-pushed the dataset-metrics branch from d4c5fd1 to 11e4d9d Compare October 11, 2023 17:19

Zandew marked this pull request as ready for review October 11, 2023 17:56

Zandew requested review from ericl, scv119, c21, amogkam, bveeramani and stephanie-wang as code owners October 11, 2023 17:56

c21 assigned stephanie-wang Oct 12, 2023

scottjlee reviewed Oct 12, 2023

View reviewed changes

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

scottjlee approved these changes Oct 12, 2023

View reviewed changes

raulchen reviewed Oct 12, 2023

View reviewed changes

alanwguo reviewed Oct 12, 2023

View reviewed changes

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

stephanie-wang reviewed Oct 14, 2023

View reviewed changes

Zandew force-pushed the dataset-metrics branch from 33dbff8 to ab0d26c Compare October 16, 2023 16:11

raulchen approved these changes Oct 16, 2023

View reviewed changes

python/ray/data/_internal/execution/operators/input_data_buffer.py Show resolved Hide resolved

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py Show resolved Hide resolved

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

Andrew Xue added 2 commits October 16, 2023 11:37

create dataset metrics actor

efdcf1d

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

test metrics actor call args

8753cd4

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

Andrew Xue added 20 commits October 16, 2023 11:37

re-use statsactor

85c9e15

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

add rest of metrics

6a77dd7

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

add tests and refactor statsactor methods

7178d01

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

change metric types to gauge

468245f

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

fix tests

1b15666

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

fix

6b0c6e3

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

use cluster_id to verify statsactor

3c9059f

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

use OpRuntimeMetrics

7f3c508

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

fix cur bytes for input op

ee4a3c0

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

update test and comments

b0346ca

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

refactor and comments

1dba7d2

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

rename _reset_metrics to _clear_metrics

12ce352

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

format comments

4439c83

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

update fn names and fix inputdatabuffer metrics

fcd55e1

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

rename

124ce8c

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

update descriptions

b8ffe52

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

refactor

7ae4583

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

update descriptions

c4ed604

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

comments

2896b38

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

add metrics_only flag

eed14ca

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

Zandew force-pushed the dataset-metrics branch from 0e018c6 to eed14ca Compare October 16, 2023 18:37

Andrew Xue added 2 commits October 16, 2023 11:50

lint

fe10511

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

add resource usage to tests

6ab9dc4

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

raulchen merged commit 0c06bb9 into ray-project:master Oct 16, 2023
33 of 40 checks passed

rickyyx mentioned this pull request Oct 26, 2023

Release test long_running_many_actor_tasks failed #40568

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] store ray dashboard metrics in _StatsActor #40118

[data] store ray dashboard metrics in _StatsActor #40118

Zandew commented Oct 4, 2023 •

edited

Loading

Zandew commented Oct 5, 2023 •

edited

Loading

ericl commented Oct 5, 2023

Zandew commented Oct 5, 2023

ericl commented Oct 5, 2023

raulchen left a comment •

edited

Loading

raulchen Oct 12, 2023

Zandew Oct 12, 2023

raulchen Oct 13, 2023

raulchen Oct 13, 2023

Zandew Oct 13, 2023

stephanie-wang left a comment

stephanie-wang Oct 14, 2023

Zandew Oct 16, 2023

raulchen left a comment

[data] store ray dashboard metrics in _StatsActor #40118

[data] store ray dashboard metrics in _StatsActor #40118

Conversation

Zandew commented Oct 4, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Zandew commented Oct 5, 2023 • edited Loading

ericl commented Oct 5, 2023

Zandew commented Oct 5, 2023

ericl commented Oct 5, 2023

raulchen left a comment • edited Loading

Choose a reason for hiding this comment

raulchen Oct 12, 2023

Choose a reason for hiding this comment

Zandew Oct 12, 2023

Choose a reason for hiding this comment

raulchen Oct 13, 2023

Choose a reason for hiding this comment

raulchen Oct 13, 2023

Choose a reason for hiding this comment

Zandew Oct 13, 2023

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

stephanie-wang Oct 14, 2023

Choose a reason for hiding this comment

Zandew Oct 16, 2023

Choose a reason for hiding this comment

raulchen left a comment

Choose a reason for hiding this comment

Zandew commented Oct 4, 2023 •

edited

Loading

Zandew commented Oct 5, 2023 •

edited

Loading

raulchen left a comment •

edited

Loading