Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Test core performance metrics #40757

Merged
merged 26 commits into from
Nov 21, 2023

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Oct 27, 2023

Why are these changes needed?

Adds performance testing utilities to assert for Ray core metrics on tasks submitted and objects created. Also adds tests for these metrics on some key operations:

  • map with dynamic block splitting
  • .limit/.take
  • .schema

This acts as a regression test for (at least) the following issues:

TODO: Also test #38400. I had started this but ran into #41018.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang stephanie-wang changed the title [WIP][data] Test core performance metrics [data] Test core performance metrics Nov 8, 2023
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement. I left a few small comments.

python/ray/data/tests/conftest.py Outdated Show resolved Hide resolved
# Wait for a task to finish to prevent a race condition where not all of
# the task metrics have been collected yet.
if expected_metrics.get_task_count() is not None:
ref = barrier.remote()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this doesn't strictly guarantee that all previous tasks metrics are collected. As task metrics are reported from multiple nodes. Is this an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of inserting the barrier. maybe it's more robust to use wait_for_condition to wait for the assert conditions become true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes wait_for_condition is not enough because we also need to check negative conditions (like no tasks executed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes right now there is a possible race condition here. Need to check if this will be an issue in CI or not.


class CoreExecutionMetrics:
def __init__(self, task_count=None, object_store_stats=None, actor_count=None):
self.task_count = task_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, can we make these variables default to empty dicts? so we don't need those None checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python/ray/data/tests/conftest.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_block_sizing.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_block_sizing.py Outdated Show resolved Hide resolved
)
total_bytes_expected = num_blocks_expected * block_size_expected

print(f"Expecting {total_bytes_expected} bytes, {num_blocks_expected} blocks")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.debug? Seems could be verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine since this is only used in testing and it's useful to see the output immediately if the test fails.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang stephanie-wang merged commit 29e4a7d into ray-project:master Nov 21, 2023
18 checks passed
@stephanie-wang stephanie-wang deleted the data-metrics-testing branch November 21, 2023 17:11
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
Adds performance testing utilities to assert for Ray core metrics on tasks submitted and objects created. Also adds tests for these metrics on some key operations:

    map with dynamic block splitting
    .limit/.take
    .schema

This acts as a regression test for (at least) the following issues:

[data] Dataset.schema() may get recomputed each time ray-project#37077
[data] .limit() does not truncate execution as expected ray-project#37858
[data] read_images().take(1) is very slow on S3 / pushdown limit() into individual tasks ray-project#38023

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
stephanie-wang added a commit that referenced this pull request Nov 29, 2023
Stacked on #40757

Compute the block size for each operation before applying other optimizer rules that depend on it (SplitReadOutputBlocksRule). This also simplifies the block sizing, so we always propagate an op's target block size to all upstream ops, until we find an op that has a different block size set.
Related issue number

Closes #41018.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants