Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Warn on excessive driver memory usage during shuffle ops #42574

Merged
merged 5 commits into from Jan 29, 2024

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

Warns the user if too much driver memory is used for:

  • ObjectRef C++ metadata (estimated)
  • Small objects that get stored in driver heap memory

Related issue number

Closes #40861.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stephanie-wang. Change on data directory looks good to me, except one comment.

# If driver memory exceeds this threshold, warn the user. For now, this
# only applies to shuffle ops because most other ops are unlikely to use as
# much driver memory.
warn_on_driver_memory_usage_bytes: Optional[int] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass this argument through DataContext to ExchangeTaskScheduler.execute(warn_on_driver_memory_usage_bytes=...) directly?

TaskContext ideally should only contain the unique information per each task. Feeling it does not need to go through TaskContext in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, moved to ExchangeTaskScheduler.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang stephanie-wang added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 25, 2024
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stephanie-wang! The code change on Data part LGTM.

@stephanie-wang stephanie-wang merged commit bb65bdb into ray-project:master Jan 29, 2024
9 checks passed
@stephanie-wang stephanie-wang deleted the shuffle-warnings branch January 29, 2024 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] Warn on high driver memory usage during shuffle ops
4 participants