[Data] Report first block size per `Operator` #39656

scottjlee · 2023-09-14T01:25:47Z

Why are these changes needed?

For each Operator, logs an info message with the in-memory size of the first block that is produced. If the size of this block greatly exceeds the target max block size (currently configured at 2x), logs a warning message.

Related issue number

Closes #39647

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-09-14T16:43:39Z

python/ray/data/tests/block_batching/test_util.py

    _calculate_ref_hits,
    blocks_to_batches,
    collate,
    finalize_batches,
    format_batches,
    resolve_block_refs,
 )
-from ray.data._internal.util import make_async_gen
+from ray.data._internal.util import Queue, make_async_gen


fixes breaking test import (unrelated to changes in this PR)

raulchen · 2023-09-25T23:01:48Z

python/ray/data/_internal/execution/streaming_executor_state.py

+            logger.get_logger().info(
+                f"{self.op.name} in-memory block size: "
+                f"{(self.first_block_size_bytes / 2**20):.2f} MB"
+            )


I'm still unclear how useful this would be. 1) in which case the block size would be larger than target_max_block_size? It shouldn't happen unless there is a bug? 2) printing this as an info may be too verbose, and may cause confusions to the users.
@ericl @scottjlee could one of you elaborate on what exact issue this is trying to solve?

@ericl what was the original scenario where this issue appeared? The block splitting should be ensuring we don't get block sizes larger than ``target_max_block_size`, are there operators where this is not being handled?

amogkam

+1 to @raulchen.

This information is useful for when the Ray team is debugging, but if a user sees this message I'm not sure what they are expected to do.

I think this information is more useful to save as part of the persisted dataset stats (and we can even go one step further and track min output block size, average output block size, max output block size). This way users can easily send logs for debugging.

But I don't think this is helpful to log on stdout.

amogkam · 2023-09-25T23:09:51Z

python/ray/data/_internal/execution/streaming_executor_state.py

+                f"{self.op.name} in-memory block size of "
+                f"{(self.first_block_size_bytes / 2**20):.2f} MB is significantly "
+                f"larger than the maximium target block size of "
+                f"{(target_max_block_size / 2**20):.2f} MB."


if a user sees this warning message, what are they expected to do?

scottjlee · 2023-09-26T23:41:55Z

@amogkam @raulchen - discussed with @ericl, ideally we would want the user to look at the warning and realize they should try adjusting parallelism in order to reduce block size. we can add a paragraph in the Performance Tips page describing this -- i think we discuss it in the "Tuning read parallelism" section, but there's no clear course of action.

does that make sense from user's perspective?

…out and only ray-data.log Signed-off-by: Scott Lee <sjl@anyscale.com>

Signed-off-by: Scott Lee <sjl@anyscale.com>

ericl · 2023-09-29T01:55:06Z

I'm not sure. One reason to add these logs is to surface cases where block splitting didn't work for whatever reason.

…

On Thu, Sep 28, 2023, 1:48 PM Scott Lee ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/data/_internal/execution/streaming_executor_state.py <#39656 (comment)>: > + target_max_block_size = DataContext.get_current().target_max_block_size + + if self.first_block_size_bytes > ( + BLOCK_SIZE_TO_MAX_TARGET_RATIO * target_max_block_size + ): + logger.get_logger().warning( + f"{self.op.name} in-memory block size of " + f"{(self.first_block_size_bytes / 2**20):.2f} MB is significantly " + f"larger than the maximium target block size of " + f"{(target_max_block_size / 2**20):.2f} MB." + ) + else: + logger.get_logger().info( + f"{self.op.name} in-memory block size: " + f"{(self.first_block_size_bytes / 2**20):.2f} MB" + ) @ericl <https://github.com/ericl> what was the original scenario where this issue appeared? The block splitting should be ensuring we don't get block sizes larger than ``target_max_block_size`, are there operators where this is not being handled? — Reply to this email directly, view it on GitHub <#39656 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSVXZDLYBGJHCZGFTC3X4XPALANCNFSM6AAAAAA4XKGVOU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

scottjlee · 2023-09-29T23:34:47Z

It sounds like from this discussion, we want to add this as an "insurance policy," i.e. users really shouldn't be running into this issue, but in case they do, we log a warning with a suggestion to increase parallelism. Since the current changes will only emit a warning if the block size exceeds the configured target size (very rare), there's little downside to including this (i.e. no excessive spam, it's printed only when truly needed), so I think this is a beneficial addition? @amogkam @raulchen

Since we only log the warning when the block size exceeds the configured target size (which should rarely be happening), and we always log the info to the data-specific log file, I think users wouldn't be getting spammed. Later, we can add additional statistics like min/avg/max block size to the dashboard, like Amog suggested.

ericl · 2023-09-30T00:05:05Z

Yes, and thinking about this more another scenario is if the user accidentally returns a single humongous row or something like this, which isn't an uncommon error when working with tensor data.

Even just playing around with a couple examples, I found a bug where map_batches doesn't seem to split blocks into the right size, and without this kind of log it would be pretty hard to identify these sort of issues.

raulchen · 2023-10-02T18:03:04Z

Since this is for dev to debugging perf, only logging to data logs makes more sense.

scottjlee · 2023-10-02T18:19:01Z

Since this is for dev to debugging perf, only logging to data logs makes more sense.

@raulchen Do you suggest we move both the log and warn to data log only? I think showing the warning in stdout still makes sense, since this is a pretty large issue that users should be aware of without having to look at data specific logs.

raulchen · 2023-10-02T18:49:42Z

Keeping the warning in stdout is fine, as long as we can the message clear. I don't think we should suggest users to increase the parallelism. Because this issue can only happen when (1) one single row is bigger than the target block size; 2) there is a bug in Ray Data.

ericl · 2023-10-02T20:30:35Z

I think this has to be a high severity level, if we think it indicates a bug. We might even raise an exception in the future if this happens.

However, I don't think today we can raise an exception, since UDFs can return large blocks in map batches and these aren't split. The best we can do is warn about this.

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-10-04T00:15:29Z

Got it, @raulchen @ericl updated the PR to match discussion from above, ready for another look.

raulchen · 2023-10-04T17:52:25Z

python/ray/data/tests/test_streaming_executor.py

@@ -684,6 +687,56 @@ def test_execution_allowed_nothrottle():
    )


+def test__check_first_block_size():


is it a typo to have 2 underscores?

the function name is _check_first_block_size, so i was following the usual format of test_<fn_name>. let me know if you want me to keep as 1 underscore

scottjlee · 2023-10-10T18:06:31Z

Will wait for #40173 to be merged, then we can also emit better warnings based on the block size metrics available per Operator.

scottjlee · 2023-11-08T22:50:01Z

Original issue closed, no longer needed.

scottjlee added 4 commits September 13, 2023 18:23

log first block size

86ca5d9

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0913-report-block-size

9dd7d19

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

36b0bc8

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0913-report-block-size

65450df

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee commented Sep 14, 2023

View reviewed changes

scottjlee marked this pull request as ready for review September 14, 2023 16:50

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani and raulchen as code owners September 14, 2023 16:50

scottjlee assigned c21 Sep 14, 2023

scottjlee added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 14, 2023

Merge branch 'master' into 0913-report-block-size

fe35502

scottjlee requested a review from stephanie-wang as a code owner September 25, 2023 21:17

scottjlee assigned raulchen and amogkam Sep 25, 2023

raulchen reviewed Sep 25, 2023

View reviewed changes

amogkam reviewed Sep 25, 2023

View reviewed changes

scottjlee added 6 commits September 27, 2023 19:00

add suggestion to increase parallelism in warning, log info skips std…

fa7c34e

…out and only ray-data.log Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0913-report-block-size

b95bcf3

Signed-off-by: Scott Lee <sjl@anyscale.com>

update tests

5cc6f37

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

632917a

Signed-off-by: Scott Lee <sjl@anyscale.com>

fix bad lint

3353451

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0913-report-block-size

e933602

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested review from amogkam and raulchen September 28, 2023 20:39

scottjlee added 2 commits October 3, 2023 16:37

remove parallelism suggestion

e7db8e4

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0913-report-block-size

955819c

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen approved these changes Oct 4, 2023

View reviewed changes

scottjlee marked this pull request as draft October 10, 2023 18:05

scottjlee closed this Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Report first block size per `Operator` #39656

[Data] Report first block size per `Operator` #39656

scottjlee commented Sep 14, 2023

scottjlee Sep 14, 2023 •

edited

Loading

raulchen Sep 25, 2023

scottjlee Sep 28, 2023

amogkam left a comment

amogkam Sep 25, 2023

scottjlee commented Sep 26, 2023 •

edited

Loading

ericl commented Sep 29, 2023 via email

scottjlee commented Sep 29, 2023

ericl commented Sep 30, 2023

raulchen commented Oct 2, 2023

scottjlee commented Oct 2, 2023

raulchen commented Oct 2, 2023

ericl commented Oct 2, 2023 •

edited

Loading

scottjlee commented Oct 4, 2023

raulchen Oct 4, 2023

scottjlee Oct 4, 2023

scottjlee commented Oct 10, 2023

scottjlee commented Nov 8, 2023

		@@ -684,6 +687,56 @@ def test_execution_allowed_nothrottle():
		)


		def test__check_first_block_size():

[Data] Report first block size per Operator #39656

[Data] Report first block size per Operator #39656

Conversation

scottjlee commented Sep 14, 2023

Why are these changes needed?

Related issue number

Checks

scottjlee Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

raulchen Sep 25, 2023

Choose a reason for hiding this comment

scottjlee Sep 28, 2023

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

amogkam Sep 25, 2023

Choose a reason for hiding this comment

scottjlee commented Sep 26, 2023 • edited Loading

ericl commented Sep 29, 2023 via email

scottjlee commented Sep 29, 2023

ericl commented Sep 30, 2023

raulchen commented Oct 2, 2023

scottjlee commented Oct 2, 2023

raulchen commented Oct 2, 2023

ericl commented Oct 2, 2023 • edited Loading

scottjlee commented Oct 4, 2023

raulchen Oct 4, 2023

Choose a reason for hiding this comment

scottjlee Oct 4, 2023

Choose a reason for hiding this comment

scottjlee commented Oct 10, 2023

scottjlee commented Nov 8, 2023

[Data] Report first block size per `Operator` #39656

[Data] Report first block size per `Operator` #39656

scottjlee Sep 14, 2023 •

edited

Loading

scottjlee commented Sep 26, 2023 •

edited

Loading

ericl commented Oct 2, 2023 •

edited

Loading