[data] Test core performance metrics #40757

stephanie-wang · 2023-10-27T19:45:36Z

Why are these changes needed?

Adds performance testing utilities to assert for Ray core metrics on tasks submitted and objects created. Also adds tests for these metrics on some key operations:

map with dynamic block splitting
.limit/.take
.schema

This acts as a regression test for (at least) the following issues:

TODO: Also test #38400. I had started this but ran into #41018.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…into data-metrics-testing

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

raulchen

Nice improvement. I left a few small comments.

python/ray/data/tests/conftest.py

raulchen · 2023-11-14T19:03:37Z

python/ray/data/tests/conftest.py

+    # Wait for a task to finish to prevent a race condition where not all of
+    # the task metrics have been collected yet.
+    if expected_metrics.get_task_count() is not None:
+        ref = barrier.remote()


I guess this doesn't strictly guarantee that all previous tasks metrics are collected. As task metrics are reported from multiple nodes. Is this an issue?

instead of inserting the barrier. maybe it's more robust to use wait_for_condition to wait for the assert conditions become true.

Sometimes wait_for_condition is not enough because we also need to check negative conditions (like no tasks executed).

But yes right now there is a possible race condition here. Need to check if this will be an issue in CI or not.

raulchen · 2023-11-14T19:12:59Z

python/ray/data/tests/conftest.py

+
+class CoreExecutionMetrics:
+    def __init__(self, task_count=None, object_store_stats=None, actor_count=None):
+        self.task_count = task_count


nit, can we make these variables default to empty dicts? so we don't need those None checks.

This is a bad idea in python: https://stackoverflow.com/questions/26320899/why-is-the-empty-dictionary-a-dangerous-default-value-in-python

python/ray/data/tests/conftest.py

python/ray/data/tests/test_block_sizing.py

raulchen · 2023-11-14T19:27:35Z

python/ray/data/tests/conftest.py

+            )
+        total_bytes_expected = num_blocks_expected * block_size_expected
+
+    print(f"Expecting {total_bytes_expected} bytes, {num_blocks_expected} blocks")


logger.debug? Seems could be verbose.

I think it's fine since this is only used in testing and it's useful to see the output immediately if the test fails.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Adds performance testing utilities to assert for Ray core metrics on tasks submitted and objects created. Also adds tests for these metrics on some key operations: map with dynamic block splitting .limit/.take .schema This acts as a regression test for (at least) the following issues: [data] Dataset.schema() may get recomputed each time ray-project#37077 [data] .limit() does not truncate execution as expected ray-project#37858 [data] read_images().take(1) is very slow on S3 / pushdown limit() into individual tasks ray-project#38023 --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Stacked on #40757 Compute the block size for each operation before applying other optimizer rules that depend on it (SplitReadOutputBlocksRule). This also simplifies the block sizing, so we always propagate an op's target block size to all upstream ops, until we find an op that has a different block size set. Related issue number Closes #41018. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang added 4 commits October 25, 2023 13:44

first test, use task metrics first

a6fc9f2

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

More .schema() tests

ed87288

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Collect obj store and actor stats

50027a7

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Test .limit()

1144218

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen and Zandew as code owners October 27, 2023 19:45

stephanie-wang added 11 commits October 27, 2023 14:27

Basic block split test

582b431

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge remote-tracking branch 'upstream/master' into data-metrics-testing

65aadd5

merge

9763f43

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge remote-tracking branch 'upstream/master' into data-metrics-testing

791e0a1

test

6697f85

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge remote-tracking branch 'upstream/master' into data-metrics-testing

28e0605

Merge remote-tracking branch 'upstream/master' into data-metrics-testing

7a9a8ce

threads

2e229a6

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ci

237c142

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge branch 'data-metrics-testing' of github.com:stephanie-wang/ray …

b66a90e

…into data-metrics-testing

test

2337708

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang assigned raulchen and c21 Nov 7, 2023

stephanie-wang mentioned this pull request Nov 8, 2023

[data] Inherit block size from downstream ops #41019

Merged

8 tasks

stephanie-wang changed the title ~~[WIP][data] Test core performance metrics~~ [data] Test core performance metrics Nov 8, 2023

lint

0bbe7f0

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang assigned ericl Nov 14, 2023

raulchen approved these changes Nov 14, 2023

View reviewed changes

stephanie-wang added 10 commits November 16, 2023 11:50

fix

44e8b17

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix

056f344

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

tests

96bbabf

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix?

047bd51

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix?

3c85bb9

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix?

8600226

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

oops

8c5103a

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge branch 'master' into data-metrics-testing

b293000

update default

90e29d9

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix

10d86bc

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang merged commit 29e4a7d into ray-project:master Nov 21, 2023
18 checks passed

stephanie-wang deleted the data-metrics-testing branch November 21, 2023 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Test core performance metrics #40757

[data] Test core performance metrics #40757

stephanie-wang commented Oct 27, 2023 •

edited

Loading

raulchen left a comment

raulchen Nov 14, 2023

raulchen Nov 14, 2023

stephanie-wang Nov 16, 2023

stephanie-wang Nov 16, 2023

raulchen Nov 14, 2023

stephanie-wang Nov 16, 2023

raulchen Nov 14, 2023

stephanie-wang Nov 16, 2023

[data] Test core performance metrics #40757

[data] Test core performance metrics #40757

Conversation

stephanie-wang commented Oct 27, 2023 • edited Loading

Why are these changes needed?

Checks

raulchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Oct 27, 2023 •

edited

Loading