Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] store ray dashboard metrics in _StatsActor #40118

Merged
merged 24 commits into from
Oct 16, 2023

Conversation

Zandew
Copy link
Contributor

@Zandew Zandew commented Oct 4, 2023

Why are these changes needed?

Stores dataset metrics in _StatsActor. These stats will be emitted to prometheus and used to create the Ray Data Dashboard. Stats will be collected by StreamingExecutors after each _scheduling_loop_step.

To be displayed under the "Ray Data Metrics" under metrics tab in ray dashboard

Screenshot 2023-10-11 at 10 44 23 AM

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Zandew
Copy link
Contributor Author

Zandew commented Oct 5, 2023

@c21 @raulchen @ericl are we okay with creating a cluster-wide actor to collect these stats (exactly like _StatsActor) for the ray data dashboard? I don't think the concerns here would hold for this case since we're directly emitting the metrics to prometheus.

@ericl
Copy link
Contributor

ericl commented Oct 5, 2023

Why not re-use the StatsActor?

@Zandew
Copy link
Contributor Author

Zandew commented Oct 5, 2023

Why not re-use the StatsActor?

If we intend to deprecate/remove StatsActor it might be easier if we keep it separate? I am cool with reusing it as well.

@ericl
Copy link
Contributor

ericl commented Oct 5, 2023

If we intend to deprecate/remove StatsActor it might be easier if we keep it separate? I am cool with reusing it as well.

Yeah please don't create a new actor in this case then. You can add new methods to the actor and deprecate old methods, without having two similar/duplicate actors show up to users, which will be confusing.

@Zandew Zandew changed the title [data] create dataset metrics actor [data] store ray dashboard metrics in _StatsActor Oct 5, 2023
Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm also working on a PR that will standardize metrics recording. #40173

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved
python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved
python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved
python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved
python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved
metrics = op.metrics
resource_usage = op.current_resource_usage()

stats[DataMetric.BYTES_SPILLED] += metrics.obj_store_mem_spilled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just reuse the same key as in OpRuntimeMetrics? Also, can we just report the entire metrics metrics.as_dict()?
I'm concerned about the maintenance overheads and inconsistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be metrics that are not held in OpRuntimeMetrics right? what key would we use for those? And are you saying we should remove the DataMetric enum as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eventually we should move everything to OpRuntimeMetrics. and then DataMetric doesn't seem necessary anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, we can easily move op.current_resource_usage() to OpRuntimeMetrics. Just return incremental_resource_usage() * num_running_tasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we still want to store these keys as variables somewhere? Otherwise we would have to hardcode these keys in tests and in stats.py

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_stats.py Outdated Show resolved Hide resolved
python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed the stats descriptions. Thanks!

tags_keys = ("dataset",)
self.bytes_spilled = Gauge(
DataMetric.BYTES_SPILLED.value,
description="Bytes spilled by dataset operators",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description isn't really accurate for the object store-based metrics, is it? Isn't it for the whole cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one that uses get_object_locations, so it's for the dataset

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved
Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. only some final small comments.

Andrew Xue added 2 commits October 16, 2023 11:37
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Andrew Xue added 20 commits October 16, 2023 11:37
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
fix
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Andrew Xue added 2 commits October 16, 2023 11:50
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
@raulchen raulchen merged commit 0c06bb9 into ray-project:master Oct 16, 2023
33 of 40 checks passed
stephanie-wang pushed a commit that referenced this pull request Oct 18, 2023
Config changes for #40118

Creates a ray data dashboard.

---------

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
jonathan-anyscale pushed a commit to jonathan-anyscale/ray that referenced this pull request Oct 26, 2023
Config changes for ray-project#40118

Creates a ray data dashboard.

---------

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
jonathan-anyscale pushed a commit to jonathan-anyscale/ray that referenced this pull request Oct 26, 2023
Config changes for ray-project#40118

Creates a ray data dashboard.

---------

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants