Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Additional Ray Data Dashboard Metrics #43628

Merged
merged 22 commits into from Mar 6, 2024

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Mar 1, 2024

Why are these changes needed?

  • Adds remaining metrics from OpRuntimeMetrics class in new time series charts on the Grafana and Ray Data dashboards.

  • Clean up the OpRuntimeMetrics and StatsActor code, grouping related metrics by area and consolidating descriptions and comments.

  • Visually group each section of Ray Data metrics. See below for screenshots of each section.
    - Programmatically generate Grafana panels from OpRuntimeMetrics fields. this is currently not possible, since we would need to add ray data as a dependency for ray dashboards / serve.

  • Overview:
    overview

  • Inputs:
    inputs

  • Outputs:
    outputs

  • Tasks:
    tasks

  • Object Store Memory:
    object_store_memory

  • Iteration:
    iteration

Related issue number

Closes #42437

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee marked this pull request as ready for review March 4, 2024 18:06
@scottjlee scottjlee changed the title [Data] Additional Ray Dashboard Metrics [Data] Additional Ray Data Dashboard Metrics Mar 4, 2024
@scottjlee
Copy link
Contributor Author

premerge failures look unrelated to this PR.

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the way, do we already have OpState.outqueue.memory_usage()?

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved
@scottjlee scottjlee added the release-blocker P0 Issue that blocks the release label Mar 4, 2024
Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few micro nits, but lgtm!

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved
@@ -208,10 +208,111 @@ const DATA_METRICS_CONFIG: MetricsSectionConfig[] = [
title: "Rows Outputted",
pathParams: "orgId=1&theme=light&panelId=11",
},
// Inputs-related metrics
{
title: "Input Blocks Received by Operator",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I assume the ordering of display is based on panelId? Should these be ordered by that for ~organization or is the order here important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panel id is actually unrelated to the order, the order directly follows the order of elements in this .tsx file. the only restriction is that panel id needs to be unique and matches the id from grafana panel

@@ -119,6 +119,330 @@
fill=0,
stack=False,
),
# Inputs-related metrics
Panel(
id=17,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Similar to above, should we order these by panel id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panel id doesn't need to be in increasing order. i found that eventually when we need to insert new metrics/panels, it breaks all of the id's anyways, so either we will need to always continue incrementing id's, or we can just make sure they are all unique id's (this is enforced by the dashboard code already)

python/ray/data/tests/test_stats.py Outdated Show resolved Hide resolved
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee requested a review from raulchen March 5, 2024 18:05
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
num_task_inputs_processed: int = field(
default=0,
metadata={
"description": "Number of input blocks processed by tasks.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention "finished processing" to make it more clear.

Suggested change
"description": "Number of input blocks processed by tasks.",
"description": "Number of input blocks that operator tasks has finished processing.

bytes_task_inputs_processed: int = field(
default=0,
metadata={
"description": "Byte size of blocks processed by tasks.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

metadata={
"description": (
"Number of rows in generated output blocks "
"that are from finished tasks."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original comment is wrong. it's not only from finished tasks. should just be Number of rows generated by tasks.

"metrics_group": "outputs",
},
)
num_outputs_of_finished_tasks: int = field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to not expose this metric and the next. they are used to compute another property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed from the grafana and ray data dashboards.

num_tasks_have_outputs: int = field(
default=0,
metadata={
"description": "Number of tasks with at least one output block.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"description": "Number of tasks with at least one output block.",
"description": "Number of tasks that already have output.",

@@ -133,14 +282,12 @@ def extra_metrics(self) -> Dict[str, Any]:
"""Return a dict of extra metrics."""
return self._extra_metrics

def as_dict(self, metrics_only: bool = False):
def as_dict(self):
"""Return a dict representation of the metrics."""
result = []
for f in fields(self):
if f.metadata.get("export", True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this "export" seems not being used any more

@@ -253,25 +398,42 @@ def average_bytes_change_per_task(self) -> Optional[float]:

return self.average_bytes_outputs_per_task - self.average_bytes_inputs_per_task

@property
def estimated_object_store_usage(self) -> Optional[float]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object store usage of an op is actually calculated as obj_store_mem_pending_task_outputs + obj_store_mem_internal_outqueue + OpState.outqueue.memory_usage + sum(next_op.obj_store_mem_pending_task_inputs + next_op.obj_store_mem_internal_inqueue).
Currently it's calculated in ResourceManager here. because OpState isn't accessible here.
It'd be also useful to report this to the dashboard.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee requested a review from raulchen March 6, 2024 04:31
if op:
execution_resources = self._resource_manager._op_usages[op]
op_object_store_memory = execution_resources.object_store_memory
op._metrics.obj_store_mem_used = op_object_store_memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, let's update this in ResourceManager.update_resources.
so we won't forgot calling this method.

Signed-off-by: Scott Lee <sjl@anyscale.com>
@raulchen raulchen merged commit 52455e5 into ray-project:master Mar 6, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] adding more info to data dashboard
4 participants