Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix placement group leaks #42942

Merged
merged 5 commits into from Feb 8, 2024
Merged

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Feb 2, 2024

Why are these changes needed?

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

  • If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen

Alternatively, to improve the consistency, we can also do

  • register removed pg and keep deleting resources (with a reconciler) until it is fully gone.
  • register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown.

I chose the current solution as it is needed for a network issue as well.

Related issue number

Closes #26761

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 requested a review from a team as a code owner February 2, 2024 12:04
@rynewang
Copy link
Contributor

rynewang commented Feb 2, 2024

can we have a unit test, e.g. start a pg and kill it right away, and see if there's any leak?

@rkooo567
Copy link
Contributor Author

rkooo567 commented Feb 2, 2024

@rynewang I am figuring that out. This requires very subtle timing (indeed, I could repro when I created so many pgs only), so writing an unit test is not trivial. Writing cpp unit test is even harder because of how our code is structured.

// Retry 10 * worker registeration timeout to avoid race condition.
// See https://github.com/ray-project/ray/pull/42942
// for more details.
/*max_retry*/ RayConfig::instance().worker_register_timeout_seconds() * 10,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: should we do exponential backoff? Right now, we retry every 1 second.

@rkooo567 rkooo567 changed the title [WIP] Fix placement group leaks [Core] Fix placement group leaks Feb 6, 2024
@rkooo567
Copy link
Contributor Author

rkooo567 commented Feb 6, 2024

@jjyao it is ready to be reviewed

@rkooo567
Copy link
Contributor Author

rkooo567 commented Feb 6, 2024

Q: Right now ,we retry for 10 minutes with 1 second interval. Should we

  1. retry with exponential backoff?
  2. retry for longer time? Like infinite or an hour?

@jjyao
Copy link
Contributor

jjyao commented Feb 6, 2024

There could be transient state where negative resources can exist.

When can it happen?

@rkooo567
Copy link
Contributor Author

rkooo567 commented Feb 6, 2024

There could be transient state where negative resources can exist.

When a worker is not started yet!

const rpc::CancelResourceReserveReply &reply) {
RAY_LOG(DEBUG) << "Finished cancelling the resource reserved for bundle: "
<< bundle_spec->DebugString() << " at node " << node_id;
[this, bundle_spec, node_id, node, max_retry, current_retry_cnt](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use the max retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops.. let me fix this. I will also add an unit test

@@ -59,8 +59,6 @@ void ClusterResourceManager::AddOrUpdateNode(

void ClusterResourceManager::AddOrUpdateNode(scheduling::NodeID node_id,
const NodeResources &node_resources) {
RAY_LOG(DEBUG) << "Update node info, node_id: " << node_id.ToInt()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems useless and too verbose. I can bring it back if you think it is necessary

@rkooo567
Copy link
Contributor Author

rkooo567 commented Feb 7, 2024

@jjyao it'd be great if you can approve the PR. I will add changes and merge it

@rkooo567 rkooo567 merged commit 4b38bfc into ray-project:master Feb 8, 2024
8 of 9 checks passed
ratnopamc pushed a commit to ratnopamc/ray that referenced this pull request Feb 11, 2024
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen
Alternatively, to improve the consistency, we can also do

register removed pg and keep deleting resources (with a reconciler) until it is fully gone.
register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown.
I chose the current solution as it is needed for a network issue as well.

Signed-off-by: Ratnopam Chakrabarti <ratnopamc@yahoo.com>
rkooo567 added a commit to rkooo567/ray that referenced this pull request Feb 12, 2024
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen
Alternatively, to improve the consistency, we can also do

register removed pg and keep deleting resources (with a reconciler) until it is fully gone.
register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown.
I chose the current solution as it is needed for a network issue as well.
aslonnie pushed a commit that referenced this pull request Feb 13, 2024
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen
Alternatively, to improve the consistency, we can also do

register removed pg and keep deleting resources (with a reconciler) until it is fully gone.
register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown.
I chose the current solution as it is needed for a network issue as well.
tterrysun pushed a commit to tterrysun/ray that referenced this pull request Feb 14, 2024
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen
Alternatively, to improve the consistency, we can also do

register removed pg and keep deleting resources (with a reconciler) until it is fully gone.
register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown.
I chose the current solution as it is needed for a network issue as well.

Signed-off-by: tterrysun <terry@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] negative resources
3 participants