[Core] Fix placement group leaks #42942

rkooo567 · 2024-02-02T12:04:39Z

Why are these changes needed?

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen

Alternatively, to improve the consistency, we can also do

register removed pg and keep deleting resources (with a reconciler) until it is fully gone.
register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown.

I chose the current solution as it is needed for a network issue as well.

Related issue number

Closes #26761

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rynewang · 2024-02-02T17:55:45Z

can we have a unit test, e.g. start a pg and kill it right away, and see if there's any leak?

rkooo567 · 2024-02-02T22:43:03Z

@rynewang I am figuring that out. This requires very subtle timing (indeed, I could repro when I created so many pgs only), so writing an unit test is not trivial. Writing cpp unit test is even harder because of how our code is structured.

rkooo567 · 2024-02-06T14:32:10Z

src/ray/gcs/gcs_server/gcs_placement_group_scheduler.cc

+          // Retry 10 * worker registeration timeout to avoid race condition.
+          // See https://github.com/ray-project/ray/pull/42942
+          // for more details.
+          /*max_retry*/ RayConfig::instance().worker_register_timeout_seconds() * 10,


Q: should we do exponential backoff? Right now, we retry every 1 second.

rkooo567 · 2024-02-06T14:32:46Z

@jjyao it is ready to be reviewed

rkooo567 · 2024-02-06T14:33:40Z

Q: Right now ,we retry for 10 minutes with 1 second interval. Should we

retry with exponential backoff?
retry for longer time? Like infinite or an hour?

jjyao · 2024-02-06T17:52:24Z

There could be transient state where negative resources can exist.

When can it happen?

rkooo567 · 2024-02-06T21:09:10Z

There could be transient state where negative resources can exist.

When a worker is not started yet!

jjyao · 2024-02-07T01:42:08Z

src/ray/gcs/gcs_server/gcs_placement_group_scheduler.cc

-                             const rpc::CancelResourceReserveReply &reply) {
-        RAY_LOG(DEBUG) << "Finished cancelling the resource reserved for bundle: "
-                       << bundle_spec->DebugString() << " at node " << node_id;
+      [this, bundle_spec, node_id, node, max_retry, current_retry_cnt](


Where do you use the max retry

oops.. let me fix this. I will also add an unit test

jjyao · 2024-02-07T01:43:32Z

src/ray/raylet/scheduling/cluster_resource_manager.cc

@@ -59,8 +59,6 @@ void ClusterResourceManager::AddOrUpdateNode(

 void ClusterResourceManager::AddOrUpdateNode(scheduling::NodeID node_id,
                                             const NodeResources &node_resources) {
-  RAY_LOG(DEBUG) << "Update node info, node_id: " << node_id.ToInt()


Why remove?

This seems useless and too verbose. I can bring it back if you think it is necessary

rkooo567 · 2024-02-07T09:50:20Z

@jjyao it'd be great if you can approve the PR. I will add changes and merge it

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well. Signed-off-by: Ratnopam Chakrabarti <ratnopamc@yahoo.com>

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well. Signed-off-by: tterrysun <terry@anyscale.com>

ip

b350162

rkooo567 requested a review from a team as a code owner February 2, 2024 12:04

sangcho added 2 commits February 2, 2024 22:29

ip

9c6e351

Merge branch 'master' into pg-leak

4054750

Working.

5b4f623

rkooo567 commented Feb 6, 2024

View reviewed changes

rkooo567 changed the title ~~[WIP] Fix placement group leaks~~ [Core] Fix placement group leaks Feb 6, 2024

rkooo567 assigned jjyao Feb 6, 2024

rkooo567 assigned rynewang Feb 6, 2024

jjyao reviewed Feb 7, 2024

View reviewed changes

jjyao approved these changes Feb 7, 2024

View reviewed changes

fix max retry

5f2b93c

rkooo567 merged commit 4b38bfc into ray-project:master Feb 8, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Fix placement group leaks #42942

[Core] Fix placement group leaks #42942

rkooo567 commented Feb 2, 2024 •

edited

rynewang commented Feb 2, 2024

rkooo567 commented Feb 2, 2024 •

edited

rkooo567 Feb 6, 2024

rkooo567 commented Feb 6, 2024

rkooo567 commented Feb 6, 2024

jjyao commented Feb 6, 2024

rkooo567 commented Feb 6, 2024

jjyao Feb 7, 2024

rkooo567 Feb 7, 2024

jjyao Feb 7, 2024

rkooo567 Feb 7, 2024

rkooo567 commented Feb 7, 2024

[Core] Fix placement group leaks #42942

[Core] Fix placement group leaks #42942

Conversation

rkooo567 commented Feb 2, 2024 • edited

Why are these changes needed?

Related issue number

Checks

rynewang commented Feb 2, 2024

rkooo567 commented Feb 2, 2024 • edited

rkooo567 Feb 6, 2024

Choose a reason for hiding this comment

rkooo567 commented Feb 6, 2024

rkooo567 commented Feb 6, 2024

jjyao commented Feb 6, 2024

rkooo567 commented Feb 6, 2024

jjyao Feb 7, 2024

Choose a reason for hiding this comment

rkooo567 Feb 7, 2024

Choose a reason for hiding this comment

jjyao Feb 7, 2024

Choose a reason for hiding this comment

rkooo567 Feb 7, 2024

Choose a reason for hiding this comment

rkooo567 commented Feb 7, 2024

rkooo567 commented Feb 2, 2024 •

edited

rkooo567 commented Feb 2, 2024 •

edited