-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Test core performance metrics #40757
[data] Test core performance metrics #40757
Conversation
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
…into data-metrics-testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement. I left a few small comments.
python/ray/data/tests/conftest.py
Outdated
# Wait for a task to finish to prevent a race condition where not all of | ||
# the task metrics have been collected yet. | ||
if expected_metrics.get_task_count() is not None: | ||
ref = barrier.remote() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this doesn't strictly guarantee that all previous tasks metrics are collected. As task metrics are reported from multiple nodes. Is this an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of inserting the barrier. maybe it's more robust to use wait_for_condition
to wait for the assert conditions become true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes wait_for_condition is not enough because we also need to check negative conditions (like no tasks executed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yes right now there is a possible race condition here. Need to check if this will be an issue in CI or not.
|
||
class CoreExecutionMetrics: | ||
def __init__(self, task_count=None, object_store_stats=None, actor_count=None): | ||
self.task_count = task_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, can we make these variables default to empty dicts? so we don't need those None checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bad idea in python: https://stackoverflow.com/questions/26320899/why-is-the-empty-dictionary-a-dangerous-default-value-in-python
) | ||
total_bytes_expected = num_blocks_expected * block_size_expected | ||
|
||
print(f"Expecting {total_bytes_expected} bytes, {num_blocks_expected} blocks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug
? Seems could be verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine since this is only used in testing and it's useful to see the output immediately if the test fails.
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Adds performance testing utilities to assert for Ray core metrics on tasks submitted and objects created. Also adds tests for these metrics on some key operations: map with dynamic block splitting .limit/.take .schema This acts as a regression test for (at least) the following issues: [data] Dataset.schema() may get recomputed each time ray-project#37077 [data] .limit() does not truncate execution as expected ray-project#37858 [data] read_images().take(1) is very slow on S3 / pushdown limit() into individual tasks ray-project#38023 --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Stacked on #40757 Compute the block size for each operation before applying other optimizer rules that depend on it (SplitReadOutputBlocksRule). This also simplifies the block sizing, so we always propagate an op's target block size to all upstream ops, until we find an op that has a different block size set. Related issue number Closes #41018. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Why are these changes needed?
Adds performance testing utilities to assert for Ray core metrics on tasks submitted and objects created. Also adds tests for these metrics on some key operations:
This acts as a regression test for (at least) the following issues:
TODO: Also test #38400. I had started this but ran into #41018.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.