-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] store ray dashboard metrics in _StatsActor #40118
Conversation
Why not re-use the StatsActor? |
If we intend to deprecate/remove StatsActor it might be easier if we keep it separate? I am cool with reusing it as well. |
Yeah please don't create a new actor in this case then. You can add new methods to the actor and deprecate old methods, without having two similar/duplicate actors show up to users, which will be confusing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I'm also working on a PR that will standardize metrics recording. #40173
d4c5fd1
to
11e4d9d
Compare
metrics = op.metrics | ||
resource_usage = op.current_resource_usage() | ||
|
||
stats[DataMetric.BYTES_SPILLED] += metrics.obj_store_mem_spilled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just reuse the same key as in OpRuntimeMetrics? Also, can we just report the entire metrics metrics.as_dict()
?
I'm concerned about the maintenance overheads and inconsistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be metrics that are not held in OpRuntimeMetrics
right? what key would we use for those? And are you saying we should remove the DataMetric
enum as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eventually we should move everything to OpRuntimeMetrics
. and then DataMetric
doesn't seem necessary anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, we can easily move op.current_resource_usage()
to OpRuntimeMetrics
. Just return incremental_resource_usage() * num_running_tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we still want to store these keys as variables somewhere? Otherwise we would have to hardcode these keys in tests and in stats.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed the stats descriptions. Thanks!
python/ray/data/_internal/stats.py
Outdated
tags_keys = ("dataset",) | ||
self.bytes_spilled = Gauge( | ||
DataMetric.BYTES_SPILLED.value, | ||
description="Bytes spilled by dataset operators", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description isn't really accurate for the object store-based metrics, is it? Isn't it for the whole cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the one that uses get_object_locations
, so it's for the dataset
33dbff8
to
ab0d26c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. only some final small comments.
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
0e018c6
to
eed14ca
Compare
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Config changes for #40118 Creates a ray data dashboard. --------- Signed-off-by: Andrew Xue <andrewxue@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com>
Config changes for ray-project#40118 Creates a ray data dashboard. --------- Signed-off-by: Andrew Xue <andrewxue@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com>
Config changes for ray-project#40118 Creates a ray data dashboard. --------- Signed-off-by: Andrew Xue <andrewxue@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com>
Why are these changes needed?
Stores dataset metrics in
_StatsActor
. These stats will be emitted to prometheus and used to create the Ray Data Dashboard. Stats will be collected byStreamingExecutors
after each_scheduling_loop_step
.To be displayed under the "Ray Data Metrics" under metrics tab in ray dashboard
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.