[Platform] Additional upgrade checks for cluster health before moving on to next node #8889

iSignal · 2021-06-14T20:45:47Z

Currently we check the following before completing the upgrade operation on one node and moving to another node.

Check that the restarted tserver responds to an RPC call
Check that the restarted tserver heartbeats to the leader master

In the case of smart AMI upgrades and smart instance type changes, we will be keeping nodes down for potentially longer periods of time. This calls for additional checks on the tserver before moving on

Check that the restarted tserver reclaims its tablets in case it was down for > 15 mins and if its tablets were unassigned as a result.
Check that the restarted tserver catches up via WAL to its corresponding tablet leaders in case it was down for < 15 mins but has yet to catch up via WAL.
Potentially another check for masters in case they are kicked out if the cluster.

#8882

iSignal added the area/platform Yugabyte Platform label Jun 14, 2021

iSignal assigned shahrooz1997 Jun 14, 2021

iSignal mentioned this issue Jun 14, 2021

[Platform] Smart instance type changes master task #8882

Open

9 tasks

hsu880 added this to Backlog in Platform Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Platform] Additional upgrade checks for cluster health before moving on to next node #8889

[Platform] Additional upgrade checks for cluster health before moving on to next node #8889

iSignal commented Jun 14, 2021 •

edited

[Platform] Additional upgrade checks for cluster health before moving on to next node #8889

[Platform] Additional upgrade checks for cluster health before moving on to next node #8889

Comments

iSignal commented Jun 14, 2021 • edited

iSignal commented Jun 14, 2021 •

edited