-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix a bug where SIGTERM is ignored to worker processes #40210
Conversation
Note: our shutdown path is a bit complex, and it'd be nice to clean this up. Added to the tech debt item (and some doc regarding how shutdown path works here; https://docs.google.com/document/d/1mr4j5p9a0Ce5oYaUDoaiHUdOsbYfBOmAO4CPQTSjwR4/edit#heading=h.fwxjsbhjhpi5). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - a few high level questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+10000 for cleaning up the shutdown code
@@ -689,6 +697,33 @@ cdef int prepare_actor_concurrency_groups( | |||
concurrency_groups.push_back(cg) | |||
return 1 | |||
|
|||
|
|||
def raise_sys_exit_with_custom_error_message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rickyyx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we move this out of this _raylet.pyx file but to a better place?
python/ray/_raylet.pyx
Outdated
@@ -2057,7 +2086,7 @@ cdef CRayStatus task_execution_handler( | |||
except SystemExit as e: | |||
# Tell the core worker to exit as soon as the result objects | |||
# are processed. | |||
if hasattr(e, "is_ray_terminate"): | |||
if hasattr(e, "is_ray_terminate") and e.is_ray_terminate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we can have is_ray_terminate although is_ray_terminate = False so that we can add additional message when is_ray_terminate = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before:
- if
is_ray_terminate
present => we treat it as intentional system exit
Now:
- If
is_ray_terminate
peresent AND true => we raise intentional system exit - else: UnexpectedSystemExit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove is_ray_terminate
/// The overhead of this is only a single digit microsecond. | ||
auto status = options_.check_signals(); | ||
if (status.IsIntentionalSystemExit()) { | ||
Exit(rpc::WorkerExitType::INTENDED_USER_EXIT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new change (we handle Exit inside cpp instead of python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why so?
@@ -181,7 +183,7 @@ void CoreWorkerDirectTaskSubmitter::ReturnWorker(const rpc::WorkerAddress addr, | |||
} | |||
|
|||
auto status = lease_entry.lease_client->ReturnWorker( | |||
addr.port, addr.worker_id, was_error, worker_exiting); | |||
addr.port, addr.worker_id, was_error, error_detail, worker_exiting); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file change is to add more detail when a worker is returned with was_error = true. I found it currently doesn't give any detail
@@ -150,12 +150,15 @@ class CoreWorkerDirectTaskSubmitter { | |||
/// \param[in] addr The address of the worker. | |||
/// \param[in] task_queue_key The scheduling class of the worker. | |||
/// \param[in] was_error Whether the task failed to be submitted. | |||
/// \param[in] error_detail The reason why it was errored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file change is to add more detail when a worker is returned with was_error = true. I found it currently doesn't give any detail
@@ -131,6 +131,8 @@ message ReturnWorkerRequest { | |||
bool disconnect_worker = 3; | |||
// Whether the worker is exiting and cannot be reused. | |||
bool worker_exiting = 4; | |||
// The error message for disconnect_worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file change is to add more detail when a worker is returned with was_error = true. I found it currently doesn't give any detail
src/ray/raylet/node_manager.cc
Outdated
<< " died."; | ||
const auto &err_msg = stream.str(); | ||
RAY_LOG(INFO) << err_msg; | ||
DestroyWorker(worker, rpc::WorkerExitType::SYSTEM_ERROR, err_msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file change is to add more detail when a worker is returned with was_error = true. I found it currently doesn't give any detail
@jjyao I think we can do this all together with the exit code work for 2.9. The scope I am thinking
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am still trying to understand the changes (and as well as the original issues)
Had a couple of questions there. Not so much about requests for code changes. The change looks pretty clean to me.
And another question: how does this relate to the original issue that Antoni ran into? AFAIK, the original issue where actor leaked when pg removed didn't even get to the point of sigterm handling?
@@ -689,6 +697,33 @@ cdef int prepare_actor_concurrency_groups( | |||
concurrency_groups.push_back(cg) | |||
return 1 | |||
|
|||
|
|||
def raise_sys_exit_with_custom_error_message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we move this out of this _raylet.pyx file but to a better place?
elif status.IsUnexpectedSystemExit(): | ||
with gil: | ||
raise_sys_exit_with_custom_error_message( | ||
message, exit_code=1) | ||
else: | ||
raise RaySystemError(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would RaySystemError
differ from the above 2 cases now?
@@ -2093,10 +2123,30 @@ cdef c_bool kill_main_task(const CTaskID &task_id) nogil: | |||
|
|||
cdef CRayStatus check_signals() nogil: | |||
with gil: | |||
# The Python exceptions are not handled if it is raised from cdef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean?
python/ray/_raylet.pyx
Outdated
@@ -2057,7 +2086,7 @@ cdef CRayStatus task_execution_handler( | |||
except SystemExit as e: | |||
# Tell the core worker to exit as soon as the result objects | |||
# are processed. | |||
if hasattr(e, "is_ray_terminate"): | |||
if hasattr(e, "is_ray_terminate") and e.is_ray_terminate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before:
- if
is_ray_terminate
present => we treat it as intentional system exit
Now:
- If
is_ray_terminate
peresent AND true => we raise intentional system exit - else: UnexpectedSystemExit ?
reported to GCS, and any worker failure error will contain them. | ||
|
||
The behavior of this API while a task is running is undefined. | ||
Avoid using the API when a task is still running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this should only be called as part of the shutdown routine? But not an API for exiting workers in general?
@@ -794,8 +794,13 @@ void CoreWorker::Exit( | |||
exiting_detail_ = std::optional<std::string>{detail}; | |||
} | |||
// Release the resources early in case draining takes a long time. | |||
RAY_CHECK_OK( | |||
local_raylet_client_->NotifyDirectCallTaskBlocked(/*release_resources*/ true)); | |||
auto status = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intentional?
/// The overhead of this is only a single digit microsecond. | ||
auto status = options_.check_signals(); | ||
if (status.IsIntentionalSystemExit()) { | ||
Exit(rpc::WorkerExitType::INTENDED_USER_EXIT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why so?
@@ -365,6 +365,10 @@ class NodeManager : public rpc::NodeManagerServiceHandler, | |||
|
|||
/// Kill a worker. | |||
/// | |||
/// This shouldn't be directly used to kill a worker. If you use this API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's then the usecase for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced offline.
python/ray/_raylet.pyx
Outdated
@@ -2057,7 +2086,7 @@ cdef CRayStatus task_execution_handler( | |||
except SystemExit as e: | |||
# Tell the core worker to exit as soon as the result objects | |||
# are processed. | |||
if hasattr(e, "is_ray_terminate"): | |||
if hasattr(e, "is_ray_terminate") and e.is_ray_terminate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove is_ray_terminate
return CRayStatus.IntentionalSystemExit(error_msg.encode("utf-8")) | ||
else: | ||
return CRayStatus.UnexpectedSystemExit(error_msg.encode("utf-8")) | ||
# By default, if signals raise an exception, Python just prints them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test signals raising an exception
@@ -365,6 +365,10 @@ class NodeManager : public rpc::NodeManagerServiceHandler, | |||
|
|||
/// Kill a worker. | |||
/// | |||
/// This shouldn't be directly used to kill a worker. If you use this API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this comment.
Thanks for fixing this. This helps make the error handling more predictable. |
…oject#40210) --------- Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local> Co-authored-by: sangcho <sangcho@anyscale.com>
#40690) --------- Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local> Co-authored-by: sangcho <sangcho@anyscale.com>
Why are these changes needed?
When a worker process exits, it should follow the process specified below.
shutdown()
API from Python.Currently, there are 3 issues.
This PR fixes all 3 issues. Let's see the new behavior proposal upon SIGTERM.
New behavior for SIGTERM to worker
How the issues are fixed?
check_signals
, which makes ray.get returnIntentionalSystemExit
(exit code = 0) orUnexpectedSystemExit
(exit code != 0). And fromcheck_status
we raises a SystemExit, which will trigger code path for 2 (when a worker is running a task).Related issue number
Closes #40182
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.