-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Report first block size per Operator
#39656
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
_calculate_ref_hits, | ||
blocks_to_batches, | ||
collate, | ||
finalize_batches, | ||
format_batches, | ||
resolve_block_refs, | ||
) | ||
from ray.data._internal.util import make_async_gen | ||
from ray.data._internal.util import Queue, make_async_gen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixes breaking test import (unrelated to changes in this PR)
logger.get_logger().info( | ||
f"{self.op.name} in-memory block size: " | ||
f"{(self.first_block_size_bytes / 2**20):.2f} MB" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unclear how useful this would be. 1) in which case the block size would be larger than target_max_block_size
? It shouldn't happen unless there is a bug? 2) printing this as an info may be too verbose, and may cause confusions to the users.
@ericl @scottjlee could one of you elaborate on what exact issue this is trying to solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl what was the original scenario where this issue appeared? The block splitting should be ensuring we don't get block sizes larger than ``target_max_block_size`, are there operators where this is not being handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @raulchen.
This information is useful for when the Ray team is debugging, but if a user sees this message I'm not sure what they are expected to do.
I think this information is more useful to save as part of the persisted dataset stats (and we can even go one step further and track min output block size, average output block size, max output block size). This way users can easily send logs for debugging.
But I don't think this is helpful to log on stdout.
f"{self.op.name} in-memory block size of " | ||
f"{(self.first_block_size_bytes / 2**20):.2f} MB is significantly " | ||
f"larger than the maximium target block size of " | ||
f"{(target_max_block_size / 2**20):.2f} MB." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a user sees this warning message, what are they expected to do?
@amogkam @raulchen - discussed with @ericl, ideally we would want the user to look at the warning and realize they should try adjusting parallelism in order to reduce block size. we can add a paragraph in the Performance Tips page describing this -- i think we discuss it in the "Tuning read parallelism" section, but there's no clear course of action. does that make sense from user's perspective? |
…out and only ray-data.log Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
I'm not sure. One reason to add these logs is to surface cases where block
splitting didn't work for whatever reason.
…On Thu, Sep 28, 2023, 1:48 PM Scott Lee ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/data/_internal/execution/streaming_executor_state.py
<#39656 (comment)>:
> + target_max_block_size = DataContext.get_current().target_max_block_size
+
+ if self.first_block_size_bytes > (
+ BLOCK_SIZE_TO_MAX_TARGET_RATIO * target_max_block_size
+ ):
+ logger.get_logger().warning(
+ f"{self.op.name} in-memory block size of "
+ f"{(self.first_block_size_bytes / 2**20):.2f} MB is significantly "
+ f"larger than the maximium target block size of "
+ f"{(target_max_block_size / 2**20):.2f} MB."
+ )
+ else:
+ logger.get_logger().info(
+ f"{self.op.name} in-memory block size: "
+ f"{(self.first_block_size_bytes / 2**20):.2f} MB"
+ )
@ericl <https://github.com/ericl> what was the original scenario where
this issue appeared? The block splitting should be ensuring we don't get
block sizes larger than ``target_max_block_size`, are there operators where
this is not being handled?
—
Reply to this email directly, view it on GitHub
<#39656 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSVXZDLYBGJHCZGFTC3X4XPALANCNFSM6AAAAAA4XKGVOU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It sounds like from this discussion, we want to add this as an "insurance policy," i.e. users really shouldn't be running into this issue, but in case they do, we log a warning with a suggestion to increase parallelism. Since the current changes will only emit a warning if the block size exceeds the configured target size (very rare), there's little downside to including this (i.e. no excessive spam, it's printed only when truly needed), so I think this is a beneficial addition? @amogkam @raulchen Since we only log the warning when the block size exceeds the configured target size (which should rarely be happening), and we always log the info to the data-specific log file, I think users wouldn't be getting spammed. Later, we can add additional statistics like min/avg/max block size to the dashboard, like Amog suggested. |
Yes, and thinking about this more another scenario is if the user accidentally returns a single humongous row or something like this, which isn't an uncommon error when working with tensor data. Even just playing around with a couple examples, I found a bug where |
Since this is for dev to debugging perf, only logging to data logs makes more sense. |
@raulchen Do you suggest we move both the log and warn to data log only? I think showing the warning in stdout still makes sense, since this is a pretty large issue that users should be aware of without having to look at data specific logs. |
Keeping the warning in stdout is fine, as long as we can the message clear. I don't think we should suggest users to increase the parallelism. Because this issue can only happen when (1) one single row is bigger than the target block size; 2) there is a bug in Ray Data. |
I think this has to be a high severity level, if we think it indicates a bug. We might even raise an exception in the future if this happens. However, I don't think today we can raise an exception, since UDFs can return large blocks in map batches and these aren't split. The best we can do is warn about this. |
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@@ -684,6 +687,56 @@ def test_execution_allowed_nothrottle(): | |||
) | |||
|
|||
|
|||
def test__check_first_block_size(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a typo to have 2 underscores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function name is _check_first_block_size
, so i was following the usual format of test_<fn_name>
. let me know if you want me to keep as 1 underscore
Will wait for #40173 to be merged, then we can also emit better warnings based on the block size metrics available per Operator. |
Original issue closed, no longer needed. |
Why are these changes needed?
For each
Operator
, logs an info message with the in-memory size of the first block that is produced. If the size of this block greatly exceeds the target max block size (currently configured at 2x), logs a warning message.Related issue number
Closes #39647
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.