Bug Report: Tablets in RESTORE
and BACKUP
cause SwitchTraffic to fail
#15630
Labels
RESTORE
and BACKUP
cause SwitchTraffic to fail
#15630
Overview of the Issue
When utilizing MoveTables a call is made to
RefreshTabletsByShard
which attempts to update the topology server for all tablets in the shard. Tablets inRESTORE
andBACKUP
take out a lock for the duration of their respective restore/backups, which leads toTabletManager.RefreshState
failing as it is not able to obtain the lock.From my understanding, it appears the RefreshTabletsByShard functionality that is used by SwitchTraffic should filter out
BACKUP
andRESTORE
pods. Once a pod fails tablet refresh, all other pods not yet processed also report failing tablet refresh.In our use environments we have some large keyspaces that have 32 shards, and for each shard the timing of the backup (as a cronjob) is randomly selected to prevent performance issues. For highly sharded, large keyspaces where backups can take hours this can lead to very small windows where a team has no shards backing up and therefore can call SwitchTraffic successfully.
Reproduction Steps
RESTORE
orBACKUP
RefreshTabletsByShard
fails onBACKUP
andRESTORE
pods preventing traffic from switchingBinary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: