EmergencyReparentShard still trying to get replication status from broken primary VTTablet #7559

artemvovk · 2021-02-26T16:10:41Z

Overview of the Issue

During network failure/corruption on VTTablet pods, it is possible to enter an unrecoverable state for a keyspace. This happens when primary VTTablet has lost its mysql container and EmergencyReparent command still attempts to get replication status from a failed primary.
Potentially related: #7523

Reproduction Steps

Steps to reproduce this issue, example:

Deploy a simple keyspace with primary and two replicas:

uscentral1c0-1455014542 recurly - replica 10.171.181.32:15000 10.171.181.32:3306 [] <null>
uscentral1f0-1088525795 recurly - master 10.171.182.190:15000 10.171.182.190:3306 [] 2021-02-24T16:50:07Z
uscentral1f1-0375369489 recurly - replica 10.171.133.5:15000 10.171.133.5:3306 [] <null>

Simular network corruption using something like: https://docs.litmuschaos.io/docs/pod-network-corruption/ and target VTTablet pods.
Check VTGate logs as it starts putting VTTablets into unhealthy state due to timeouts and exhausted connections

vttablet: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.171.155.196:15999: i/o timeout"

VTTablet logs will report something like:

"Mysqld.Start(1614185326) stderr: 2021-02-26T01:31:12.291919Z 68364 [Note] Got timeout reading communication packets" 
...
"Failed to open topo server (etcd2,vitess-cluster-etcd-9819e775-client.qa4.svc:2379,/vitess/vitess-cluster/global): context deadline exceeded"

At this point primary VTTablet will show 2/3 containers running:
Probably a good time to try to recover the keyspace, because we're hard down:

vtctlclient -server host.docker.internal:15999 EmergencyReparentShard -keyspace_shard=recurly/- -new_master=uscentral1c0-1455014542
W0226 01:35:59.916871       1 main.go:64] W0226 01:35:59.885795 reparent.go:1157] failed to get replication status from uscentral1f0-1088525795: rpc error: code = Canceled desc = context canceled
E0226 01:36:00.067890       1 main.go:67] remote error: rpc error: code = Unknown desc = no valid candidates for emergency reparent
[18:36:00]artemavovk@artemavovk-C02TP681HTD6:rekube (DBA-222_moreqa)/ docker run --rm --network=host vitess/vtctlclient:latest vtctlclient -server host.docker.internal:15999 EmergencyReparentShard -keyspace_shard=recurly/- -new_master=uscentral1f1-0375369489
W0226 01:36:25.560732       1 main.go:64] W0226 01:36:25.268453 reparent.go:1157] failed to get replication status from uscentral1f0-1088525795: rpc error: code = Canceled desc = context canceled
E0226 01:36:25.567484       1 main.go:67] remote error: rpc error: code = Unknown desc = no valid candidates for emergency reparent

As you can see EmergencyReparentShard is failing, and trying to ignore_replicas on the broken primary still fails:

failed to get replication status from uscentral1c0-1455014542: rpc error: code = Canceled desc = context canceled

This fails even after network corruption even has passed. 30 minutes after network is healthy - the keyspace is still broken and is hard down.

Binary version

Example: Built from v9.0 release SHA

Version: daa608598 (Git branch 'HEAD') built on Wed Jan 27 22:05:48 UTC 2021 by vitess@d00a879dec03 using go1.15.6 linux/amd64

Operating system and Environment details

OS, Architecture, and any other information you can provide
about the environment.

Operating system (output of cat /etc/os-release):

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel version (output of uname -sr):

Linux 4.19.150+

Architecture (output of uname -m):

x86_64

Log Fragments

(see above)

The text was updated successfully, but these errors were encountered:

vkozjak · 2021-04-27T18:46:46Z

Adding my work notes , getting same issue "failed to get replication status from" failed master. Control Panel shows two masters after the error.
output.txt

vttablet.out:

W0427 18:32:10.407932   17179 rpc_server.go:80] TabletManager.PopulateReparentJournal(time_created_ns:1619548300406047238 action_name:"EmergencyReparentShard" master_alias:<cell:"zone1" uid:501 > replication_position:"MySQL56/2dd7759c-a784-11eb-987d-06f2fb527910:1-19" )(on zone1-0000000501 from ) error: ExecuteFetch(CREATE DATABASE IF NOT EXISTS _vt) failed: context canceled```

vkozjak · 2021-04-28T18:07:37Z

One more note from vttablet.out

W0428 17:53:39.331063    1495 rpc_server.go:80] TabletManager.StopReplicationAndGetStatus(stop_replication_mode:IOTHREADONLY )(on zone1-0000000500 from ) error: before status failed: no replication status
W0428 17:53:45.411195    1495 reparent.go:88] WaitForReparentJournal failed to see row before timeout.
W0428 17:53:45.411247    1495 rpc_server.go:80] TabletManager.SetMaster(parent:<cell:"zone1" uid:501 > time_created_ns:1619632419380027938 )(on zone1-0000000500 from ) error: context canceled

deepthi · 2021-06-29T19:30:00Z

@vkozjak which version of vitess did you run this with?

beneditomarques · 2021-08-19T19:32:08Z

Same issue here, and I'm using vitess-v11.0.0-aa798b8.tar.gz.

GuptaManan100 · 2021-09-04T07:38:09Z

Since, ERS is a manual command, it is meant to fail in case of any errors. If you know that the tablet is down and the replication status will be broken then you can try running ERS and specify the tablets which are broken as comma separated list in ignore_replicas command line flag

GuptaManan100 · 2021-09-23T11:11:03Z

Closing the issue for now. Please reopen if further information or discussion is required

deepthi self-assigned this Feb 26, 2021

deepthi added the Component: Cluster management label Feb 26, 2021

deepthi added this to To do in Cluster Management Feb 27, 2021

deepthi removed their assignment Mar 1, 2021

askdba assigned deepthi Mar 2, 2021

askdba added the Severity 3 label Mar 2, 2021

deepthi moved this from To do to In progress in Cluster Management Mar 13, 2021

deepthi moved this from In progress to To do in Cluster Management Jun 28, 2021

deepthi removed their assignment Jul 14, 2021

deepthi assigned GuptaManan100 Aug 6, 2021

GuptaManan100 closed this as completed Sep 23, 2021

Cluster Management automation moved this from To do to Done Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmergencyReparentShard still trying to get replication status from broken primary VTTablet #7559

EmergencyReparentShard still trying to get replication status from broken primary VTTablet #7559

artemvovk commented Feb 26, 2021

vkozjak commented Apr 27, 2021

vkozjak commented Apr 28, 2021

deepthi commented Jun 29, 2021

beneditomarques commented Aug 19, 2021

GuptaManan100 commented Sep 4, 2021

GuptaManan100 commented Sep 23, 2021

EmergencyReparentShard still trying to get replication status from broken primary VTTablet #7559

EmergencyReparentShard still trying to get replication status from broken primary VTTablet #7559

Comments

artemvovk commented Feb 26, 2021

Overview of the Issue

Reproduction Steps

Binary version

Operating system and Environment details

Log Fragments

vkozjak commented Apr 27, 2021

vkozjak commented Apr 28, 2021

deepthi commented Jun 29, 2021

beneditomarques commented Aug 19, 2021

GuptaManan100 commented Sep 4, 2021

GuptaManan100 commented Sep 23, 2021