Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Report first block size per Operator #39656

Closed
wants to merge 13 commits into from

Conversation

scottjlee
Copy link
Contributor

Why are these changes needed?

For each Operator, logs an info message with the in-memory size of the first block that is produced. If the size of this block greatly exceeds the target max block size (currently configured at 2x), logs a warning message.

Related issue number

Closes #39647

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
_calculate_ref_hits,
blocks_to_batches,
collate,
finalize_batches,
format_batches,
resolve_block_refs,
)
from ray.data._internal.util import make_async_gen
from ray.data._internal.util import Queue, make_async_gen
Copy link
Contributor Author

@scottjlee scottjlee Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixes breaking test import (unrelated to changes in this PR)

@scottjlee scottjlee marked this pull request as ready for review September 14, 2023 16:50
@scottjlee scottjlee added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 14, 2023
logger.get_logger().info(
f"{self.op.name} in-memory block size: "
f"{(self.first_block_size_bytes / 2**20):.2f} MB"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unclear how useful this would be. 1) in which case the block size would be larger than target_max_block_size? It shouldn't happen unless there is a bug? 2) printing this as an info may be too verbose, and may cause confusions to the users.
@ericl @scottjlee could one of you elaborate on what exact issue this is trying to solve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl what was the original scenario where this issue appeared? The block splitting should be ensuring we don't get block sizes larger than ``target_max_block_size`, are there operators where this is not being handled?

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @raulchen.

This information is useful for when the Ray team is debugging, but if a user sees this message I'm not sure what they are expected to do.

I think this information is more useful to save as part of the persisted dataset stats (and we can even go one step further and track min output block size, average output block size, max output block size). This way users can easily send logs for debugging.

But I don't think this is helpful to log on stdout.

f"{self.op.name} in-memory block size of "
f"{(self.first_block_size_bytes / 2**20):.2f} MB is significantly "
f"larger than the maximium target block size of "
f"{(target_max_block_size / 2**20):.2f} MB."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user sees this warning message, what are they expected to do?

@scottjlee
Copy link
Contributor Author

scottjlee commented Sep 26, 2023

@amogkam @raulchen - discussed with @ericl, ideally we would want the user to look at the warning and realize they should try adjusting parallelism in order to reduce block size. we can add a paragraph in the Performance Tips page describing this -- i think we discuss it in the "Tuning read parallelism" section, but there's no clear course of action.

does that make sense from user's perspective?

…out and only ray-data.log

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@ericl
Copy link
Contributor

ericl commented Sep 29, 2023 via email

@scottjlee
Copy link
Contributor Author

It sounds like from this discussion, we want to add this as an "insurance policy," i.e. users really shouldn't be running into this issue, but in case they do, we log a warning with a suggestion to increase parallelism. Since the current changes will only emit a warning if the block size exceeds the configured target size (very rare), there's little downside to including this (i.e. no excessive spam, it's printed only when truly needed), so I think this is a beneficial addition? @amogkam @raulchen

Since we only log the warning when the block size exceeds the configured target size (which should rarely be happening), and we always log the info to the data-specific log file, I think users wouldn't be getting spammed. Later, we can add additional statistics like min/avg/max block size to the dashboard, like Amog suggested.

@ericl
Copy link
Contributor

ericl commented Sep 30, 2023

Yes, and thinking about this more another scenario is if the user accidentally returns a single humongous row or something like this, which isn't an uncommon error when working with tensor data.

Even just playing around with a couple examples, I found a bug where map_batches doesn't seem to split blocks into the right size, and without this kind of log it would be pretty hard to identify these sort of issues.

@raulchen
Copy link
Contributor

raulchen commented Oct 2, 2023

Since this is for dev to debugging perf, only logging to data logs makes more sense.

@scottjlee
Copy link
Contributor Author

Since this is for dev to debugging perf, only logging to data logs makes more sense.

@raulchen Do you suggest we move both the log and warn to data log only? I think showing the warning in stdout still makes sense, since this is a pretty large issue that users should be aware of without having to look at data specific logs.

@raulchen
Copy link
Contributor

raulchen commented Oct 2, 2023

Keeping the warning in stdout is fine, as long as we can the message clear. I don't think we should suggest users to increase the parallelism. Because this issue can only happen when (1) one single row is bigger than the target block size; 2) there is a bug in Ray Data.

@ericl
Copy link
Contributor

ericl commented Oct 2, 2023

I think this has to be a high severity level, if we think it indicates a bug. We might even raise an exception in the future if this happens.

However, I don't think today we can raise an exception, since UDFs can return large blocks in map batches and these aren't split. The best we can do is warn about this.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee
Copy link
Contributor Author

Got it, @raulchen @ericl updated the PR to match discussion from above, ready for another look.

@@ -684,6 +687,56 @@ def test_execution_allowed_nothrottle():
)


def test__check_first_block_size():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a typo to have 2 underscores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function name is _check_first_block_size, so i was following the usual format of test_<fn_name>. let me know if you want me to keep as 1 underscore

@scottjlee scottjlee marked this pull request as draft October 10, 2023 18:05
@scottjlee
Copy link
Contributor Author

Will wait for #40173 to be merged, then we can also emit better warnings based on the block size metrics available per Operator.

@scottjlee
Copy link
Contributor Author

Original issue closed, no longer needed.

@scottjlee scottjlee closed this Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] Better reporting for block sizes
5 participants