New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix placement group leaks #42942
Conversation
can we have a unit test, e.g. start a pg and kill it right away, and see if there's any leak? |
@rynewang I am figuring that out. This requires very subtle timing (indeed, I could repro when I created so many pgs only), so writing an unit test is not trivial. Writing cpp unit test is even harder because of how our code is structured. |
// Retry 10 * worker registeration timeout to avoid race condition. | ||
// See https://github.com/ray-project/ray/pull/42942 | ||
// for more details. | ||
/*max_retry*/ RayConfig::instance().worker_register_timeout_seconds() * 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: should we do exponential backoff? Right now, we retry every 1 second.
@jjyao it is ready to be reviewed |
Q: Right now ,we retry for 10 minutes with 1 second interval. Should we
|
When can it happen? |
When a worker is not started yet! |
const rpc::CancelResourceReserveReply &reply) { | ||
RAY_LOG(DEBUG) << "Finished cancelling the resource reserved for bundle: " | ||
<< bundle_spec->DebugString() << " at node " << node_id; | ||
[this, bundle_spec, node_id, node, max_retry, current_retry_cnt]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you use the max retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops.. let me fix this. I will also add an unit test
@@ -59,8 +59,6 @@ void ClusterResourceManager::AddOrUpdateNode( | |||
|
|||
void ClusterResourceManager::AddOrUpdateNode(scheduling::NodeID node_id, | |||
const NodeResources &node_resources) { | |||
RAY_LOG(DEBUG) << "Update node info, node_id: " << node_id.ToInt() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems useless and too verbose. I can bring it back if you think it is necessary
@jjyao it'd be great if you can approve the PR. I will add changes and merge it |
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well. Signed-off-by: Ratnopam Chakrabarti <ratnopamc@yahoo.com>
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well. Signed-off-by: tterrysun <terry@anyscale.com>
Why are these changes needed?
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".
It fixes the issue by retrying cancellation. This also means
Alternatively, to improve the consistency, we can also do
I chose the current solution as it is needed for a network issue as well.
Related issue number
Closes #26761
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.