Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Skip recording memory spilled stats when get_memory_info_reply is failed #42824

Merged
merged 2 commits into from
Jan 30, 2024

Conversation

c21
Copy link
Contributor

@c21 c21 commented Jan 30, 2024

Why are these changes needed?

User found the issue the call of get_memory_info_reply throws GRPC error if memory load on cluster is heavy. The error stack trace is:

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/plan.py", line 675, in execute
    reply = get_memory_info_reply(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/internal_api.py", line 82, in get_memory_info_reply
    reply = stub.FormatGlobalMemoryInfo(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/grpc/_channel.py", line 1160, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/grpc/_channel.py", line 1003, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"...", grpc_status:4, grpc_message:"Deadline Exceeded"}"

This PR is to skip recording the memory spilled stats if the call didn't succeed. This stats reporting is not on critical path of job, so it can be skipped.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cheng Su <scnju13@gmail.com>
@c21
Copy link
Contributor Author

c21 commented Jan 30, 2024

Did manual test locally to throw one arbitrary error in try code block, made sure the job succeeded.

@c21
Copy link
Contributor Author

c21 commented Jan 30, 2024

Will merge in the morning tomorrow if no more comments.

@c21 c21 merged commit 43631f9 into ray-project:master Jan 30, 2024
9 checks passed
@c21 c21 deleted the core-memory branch January 30, 2024 15:36
c21 added a commit to c21/ray that referenced this pull request Jan 30, 2024
… is failed (ray-project#42824)

User found the issue the call of `get_memory_info_reply` throws GRPC error if memory load on cluster is heavy.

Signed-off-by: Cheng Su <scnju13@gmail.com>
architkulkarni pushed a commit that referenced this pull request Jan 30, 2024
… is failed (#42824) (#42834)

User found the issue the call of get_memory_info_reply throws GRPC error if memory load on cluster is heavy.

Why are these changes needed?
Cherry pick of #42824 to 2.9.2 release branch.
Signed-off-by: Cheng Su <scnju13@gmail.com>
c21 added a commit to c21/ray that referenced this pull request Jan 30, 2024
… is failed (ray-project#42824)

User found the issue the call of `get_memory_info_reply` throws GRPC error if memory load on cluster is heavy.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants