Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If PG 2pc commit fails, don't call CommitBundleResources. #43405

Merged
merged 2 commits into from
Feb 26, 2024

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Feb 23, 2024

Placement group manager crashes in this time order:

  1. sent PREPARE to all nodes;
  2. received PREPARE reply from all nodes;
  3. one node dead
  4. send COMMIT to all nodes
  5. for the dead node, the manager knows it's dead, but still called CommitBundleResources
  6. CommitBundleResources asserts the node always exist and check failed.

This PR removed the call to CommitBundleResources in step 5.

Also, it fixed some unit tests that never worked (always pass). Added a test case specifically for this code path.

Fixes #43371.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang requested a review from a team as a code owner February 23, 2024 22:30
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. Let's add tests.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
Contributor Author

Tests added.

@jjyao jjyao merged commit d9e4e8a into ray-project:master Feb 26, 2024
8 of 9 checks passed
@rynewang rynewang deleted the pg-commit-fail branch February 26, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] GCS crashes when PG commit phase failed due to node failure
2 participants