Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: [vtgate] RemoveTablet during TabletExternallyReparented event can lead to healthcheck corruption #16373

Open
arthurschreiber opened this issue Jul 12, 2024 · 0 comments · May be fixed by #16371

Comments

@arthurschreiber
Copy link
Contributor

arthurschreiber commented Jul 12, 2024

Overview of the Issue

During external reparent events, a shard can be in a state where to vttablet processes are running as PRIMARY tablets. vtgates consult a so called PrimaryTermStartTimestamp to break ties in this situation, and will exclusively prefer the PRIMARY tablet with the highest PrimaryTermStartTimestamp to sent queries to.

The duration for how long multiple vttablet processes can be seen in PRIMARY role is influenced by the --shutdown_grace_period flag, which until version 20.0 was 0, meaning potentially "indefinite" (but in practice is usually limited by the transaction timeout).

If, during the time that a vtgate is seeing two PRIMARY tablets running, a tablet deletion is recognized (this can be any other tablet in the shard), the vtgate healtcheck would go into an invalid state where both the demoted and promoted primary tablets are seen as valid targets for @primary queries. This could lead to queries being silently sent to the wrong tablet (and silently being retried, so most of the time this error state was completely invisible).

But we've also seen cases where vtgate processes ended up trying to send queries to the demoted primary exclusively, causing all DML queries processed by these vtgates to fail.

Once in this invalid state, I don't think the vtgate can leave it until the affected tablet is deleted from the topology. Restarting the vttablet process on the demoted primary would cause it to start back up as a REPLICA, but it would still be seen as a valid candidate for @primary queries. Restarting / replacing an affected vtgate process would be another option to get the affected vtgate processes into a healthy state again.

Reproduction Steps

N/A

Binary Version

v17 and later

Operating System and Environment details

N/A

Log Fragments

N/A
@arthurschreiber arthurschreiber added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Jul 12, 2024
@arthurschreiber arthurschreiber added Component: VTGate and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jul 12, 2024
@arthurschreiber arthurschreiber self-assigned this Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant