Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform] Additional upgrade checks for cluster health before moving on to next node #8889

Open
iSignal opened this issue Jun 14, 2021 · 0 comments
Assignees
Labels
area/platform Yugabyte Platform
Projects

Comments

@iSignal
Copy link
Contributor

iSignal commented Jun 14, 2021

Currently we check the following before completing the upgrade operation on one node and moving to another node.

  • Check that the restarted tserver responds to an RPC call
  • Check that the restarted tserver heartbeats to the leader master

In the case of smart AMI upgrades and smart instance type changes, we will be keeping nodes down for potentially longer periods of time. This calls for additional checks on the tserver before moving on

  • Check that the restarted tserver reclaims its tablets in case it was down for > 15 mins and if its tablets were unassigned as a result.
  • Check that the restarted tserver catches up via WAL to its corresponding tablet leaders in case it was down for < 15 mins but has yet to catch up via WAL.
  • Potentially another check for masters in case they are kicked out if the cluster.

#8882

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform Yugabyte Platform
Projects
Platform
  
Backlog
Development

No branches or pull requests

2 participants