[Data] Add Runtime Metrics String #43790

omatthew98 · 2024-03-07T20:44:18Z

Why are these changes needed?

This adds an additional set of runtime metrics printed to help identify bottlenecks in the Ray Data code. For example for the following code:

def f(row):
    row["length"] = len(row["text"])
    return row

def g(row):
    return row["length"] > 100

hf_ds = datasets.load_dataset("tweet_eval", "emotion")
ds = ray.data.from_huggingface(hf_ds["train"])
ds = ds.map(f).sort("length").filter(g).materialize()

print(ds.stats())

the following runtime metrics would be printed:

Runtime Metrics:
* ReadParquet->SplitBlocks(24): 1.42s (80.860%)
* Map(f): 290.89ms (16.539%)
* Sort: 0us (0.000%)
* Filter(g): 39.68ms (2.256%)
* Scheduling: 166.04ms (9.440%)
* Total: 1.76s (100%)

Unrelated to the main changes, I noticed that I was using the wrong computation for the total_wall_time used in the dataset throughput, so I fixed that. That unearthed a bug in the DatasetStatsSummary.get_total_wall_time, so I fixed that as well.

Related issue number

Closes #42804

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Owen <mowen@anyscale.com>

c21

LGTM

Re the example

Runtime Metrics:
* ReadParquet->SplitBlocks(24): 1.42s (80.860%)
* Map(f): 290.89ms (16.539%)
* Sort: 0us (0.000%)
* Filter(g): 39.68ms (2.256%)
* Scheduling: 166.04ms (9.440%)
* Total: 1.76s (100%)

Any reason why Sort took 0us in metrics here?

omatthew98 · 2024-03-08T00:10:18Z

LGTM

Re the example

Runtime Metrics:
* ReadParquet->SplitBlocks(24): 1.42s (80.860%)
* Map(f): 290.89ms (16.539%)
* Sort: 0us (0.000%)
* Filter(g): 39.68ms (2.256%)
* Scheduling: 166.04ms (9.440%)
* Total: 1.76s (100%)

Any reason why Sort took 0us in metrics here?

Hmm I think it might be a quirk of Sort having two sub-operators. From a little pdb investigation it seems like the SortMap and SortReduct OperatorStatsSummarys have their time_total_s set to 0 after execution. I think I have a potential fix for why this would be happening and will put in a separate PR.

#43790 Unearthed a bug with the calculation of time_total_s for suboperators, this fixes that bug by moving the calculation of the time to the top level of the from_block_metadata function. --------- Signed-off-by: Matthew Owen <mowen@anyscale.com>

This adds an additional set of runtime metrics printed to help identify bottlenecks in the Ray Data code. Signed-off-by: Matthew Owen <mowen@anyscale.com>

ray-project#43790 Unearthed a bug with the calculation of time_total_s for suboperators, this fixes that bug by moving the calculation of the time to the top level of the from_block_metadata function. --------- Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 added 2 commits March 7, 2024 12:40

adding in initial code to print out additional runtime metrics

450bbaf

Signed-off-by: Matthew Owen <mowen@anyscale.com>

fixing tests

202f0c4

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 added the release-blocker P0 Issue that blocks the release label Mar 7, 2024

omatthew98 marked this pull request as ready for review March 7, 2024 22:58

omatthew98 requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen and stephanie-wang as code owners March 7, 2024 22:58

omatthew98 assigned raulchen and c21 Mar 7, 2024

c21 approved these changes Mar 7, 2024

View reviewed changes

raulchen approved these changes Mar 7, 2024

View reviewed changes

scottjlee approved these changes Mar 8, 2024

View reviewed changes

omatthew98 mentioned this pull request Mar 8, 2024

[Data] Fix suboperator timing calculation #43802

Merged

8 tasks

c21 merged commit 76a3464 into ray-project:master Mar 8, 2024
9 checks passed

bveeramani mentioned this pull request Mar 11, 2024

CI test linux://python/ray/data:test_streaming_integration is flaky #43481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add Runtime Metrics String #43790

[Data] Add Runtime Metrics String #43790

omatthew98 commented Mar 7, 2024 •

edited

Loading

c21 left a comment

omatthew98 commented Mar 8, 2024

[Data] Add Runtime Metrics String #43790

[Data] Add Runtime Metrics String #43790

Conversation

omatthew98 commented Mar 7, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 left a comment

Choose a reason for hiding this comment

omatthew98 commented Mar 8, 2024

omatthew98 commented Mar 7, 2024 •

edited

Loading