Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakyness of replace node e2e #1126

Merged
merged 1 commit into from
Jan 3, 2023

Conversation

zimnx
Copy link
Collaborator

@zimnx zimnx commented Dec 22, 2022

Full quorum check is sensitive to seeing nodes not being part of the cluster, when such node is spotted it immediately returns an error.
In replace node e2e, it could happen that cluster become ready before replaced node was acknowledged by the others.
In this case, full quorum check returned an error, because it saw a IP not bound to any cluster Service.

To fix it, test waits until cluster is rolled out, and replaced node is seen by the other one, before the full quorum check is validated.

@zimnx zimnx added the kind/flake Categorizes issue or PR as related to a flaky test. label Dec 22, 2022
@zimnx zimnx added this to the v1.8 milestone Dec 22, 2022
@zimnx zimnx requested a review from tnozicka December 22, 2022 13:49
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the readiness check @rzetelskik is building help here?

@zimnx
Copy link
Collaborator Author

zimnx commented Dec 22, 2022

would the readiness check @rzetelskik is building help here?

AFAIR it allows for one DN, we have 2 nodes in this e2e, so it wouldn't help.

@rzetelskik
Copy link
Member

AFAIR it allows for one DN, we have 2 nodes in this e2e, so it wouldn't help.

Yes, it wouldn't help in this case. After it's been merged you could just increase the number of nodes in this test to 3 though.

@tnozicka tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 2, 2023
@zimnx zimnx modified the milestones: v1.8, v1.9 Jan 2, 2023
@zimnx zimnx added kind/flake Categorizes issue or PR as related to a flaky test. and removed kind/flake Categorizes issue or PR as related to a flaky test. labels Jan 2, 2023
@zimnx zimnx enabled auto-merge January 3, 2023 08:51
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Full quorum check is sensitive to seeing nodes not being part of the
cluster, when such node is spotted it immediately returns an error.
In replace node e2e, it could happen that cluster become ready before
replaced node was acknowledged by the others.
In this case, full quorum check returned an error, because it saw a IP
not bound to any cluster Service.

To fix it, test waits until cluster is rolled out, and replaced node is
seen by the other one, before the full quorum check is validated.
@tnozicka tnozicka merged commit c7a682d into scylladb:master Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants