Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CherryPick][Core] Bumping up task failure logs to warnings to make sure failures could be traced in Ray Core logs #43147

Merged

Conversation

alexeykudinkin
Copy link
Contributor

…troubleshooted

Why are these changes needed?

NOTE: This is a cherry-pick of #43111 for 2.9.3

Currently, we observe a lot of failures like following in our production deployment:

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/handle.py", line 781, in __anext__
    return await next_obj_ref
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

However, we can't find any logs in Ray Core corresponding to this failure. Checking around i've realized that all of the log statements we have are DEBUG logs, which necessitates us to switch to DEBUG mode which will drown our logging infra.

Hence bumping failure logs to WARNING at least to make sure any failures are traceable in Ray Core logs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rynewang
Copy link
Contributor

@zhe-thoughts Could you review this cherry pick and approve it? Thanks

@rynewang
Copy link
Contributor

Does not build. I noticed there's difference between this and 149e400 in direct_task_transport.cc. Please fix this @alexeykudinkin

src/ray/core_worker/transport/direct_task_transport.cc: In member function 'void ray::core::CoreWorkerDirectTaskSubmitter::HandleGetTaskFailureCause(const ray::Status&, bool, const ray::TaskID&, const ray::rpc::WorkerAddress&, const ray::Status&, const ray::rpc::GetTaskFailureCauseReply&)':
--
  | 2024-02-13 14:45:04 PST | src/ray/core_worker/transport/direct_task_transport.cc:715:56: error: no match for call to '(const ray::NodeID) ()'
  | 2024-02-13 14:45:04 PST | 715 \|                      << " node id: " << addr.raylet_id() << " ip: " << addr.ip_address();
  | 2024-02-13 14:45:04 PST | \|                                                        ^
  | 2024-02-13 14:45:04 PST | src/ray/core_worker/transport/direct_task_transport.cc:715:88: error: no match for call to '(const string {aka const std::basic_string<char>}) ()'
  | 2024-02-13 14:45:04 PST | 715 \|                      << " node id: " << addr.raylet_id() << " ip: " << addr.ip_address();
  | 2024-02-13 14:45:04 PST | \|

…troubleshooted

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin changed the base branch from releases/2.9.2 to releases/2.9.3 February 13, 2024 23:37
@rynewang
Copy link
Contributor

LGTM, will pull after premerge done

@alexeykudinkin alexeykudinkin changed the title [Cherry-pick][Core] Bumping up task failure logs to warnings to make sure failures could be traced in Ray Core logs [CherryPick][Core] Bumping up task failure logs to warnings to make sure failures could be traced in Ray Core logs Feb 14, 2024
@aslonnie aslonnie merged commit 2613d7d into ray-project:releases/2.9.3 Feb 14, 2024
13 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants