Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmergencyReparentShard still trying to get replication status from broken primary VTTablet #7559

Closed
artemvovk opened this issue Feb 26, 2021 · 6 comments

Comments

@artemvovk
Copy link

Overview of the Issue

During network failure/corruption on VTTablet pods, it is possible to enter an unrecoverable state for a keyspace. This happens when primary VTTablet has lost its mysql container and EmergencyReparent command still attempts to get replication status from a failed primary.
Potentially related: #7523

Reproduction Steps

Steps to reproduce this issue, example:

  1. Deploy a simple keyspace with primary and two replicas:
uscentral1c0-1455014542 recurly - replica 10.171.181.32:15000 10.171.181.32:3306 [] <null>
uscentral1f0-1088525795 recurly - master 10.171.182.190:15000 10.171.182.190:3306 [] 2021-02-24T16:50:07Z
uscentral1f1-0375369489 recurly - replica 10.171.133.5:15000 10.171.133.5:3306 [] <null>
  1. Simular network corruption using something like: https://docs.litmuschaos.io/docs/pod-network-corruption/ and target VTTablet pods.
  2. Check VTGate logs as it starts putting VTTablets into unhealthy state due to timeouts and exhausted connections
vttablet: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.171.155.196:15999: i/o timeout"
  1. VTTablet logs will report something like:
"Mysqld.Start(1614185326) stderr: 2021-02-26T01:31:12.291919Z 68364 [Note] Got timeout reading communication packets" 
...
"Failed to open topo server (etcd2,vitess-cluster-etcd-9819e775-client.qa4.svc:2379,/vitess/vitess-cluster/global): context deadline exceeded" 
  1. At this point primary VTTablet will show 2/3 containers running:
    Screen Shot 2021-02-25 at 18 36 35
  2. Probably a good time to try to recover the keyspace, because we're hard down:
vtctlclient -server host.docker.internal:15999 EmergencyReparentShard -keyspace_shard=recurly/- -new_master=uscentral1c0-1455014542
W0226 01:35:59.916871       1 main.go:64] W0226 01:35:59.885795 reparent.go:1157] failed to get replication status from uscentral1f0-1088525795: rpc error: code = Canceled desc = context canceled
E0226 01:36:00.067890       1 main.go:67] remote error: rpc error: code = Unknown desc = no valid candidates for emergency reparent
[18:36:00]artemavovk@artemavovk-C02TP681HTD6:rekube (DBA-222_moreqa)/ docker run --rm --network=host vitess/vtctlclient:latest vtctlclient -server host.docker.internal:15999 EmergencyReparentShard -keyspace_shard=recurly/- -new_master=uscentral1f1-0375369489
W0226 01:36:25.560732       1 main.go:64] W0226 01:36:25.268453 reparent.go:1157] failed to get replication status from uscentral1f0-1088525795: rpc error: code = Canceled desc = context canceled
E0226 01:36:25.567484       1 main.go:67] remote error: rpc error: code = Unknown desc = no valid candidates for emergency reparent
  1. As you can see EmergencyReparentShard is failing, and trying to ignore_replicas on the broken primary still fails:
failed to get replication status from uscentral1c0-1455014542: rpc error: code = Canceled desc = context canceled
  1. This fails even after network corruption even has passed. 30 minutes after network is healthy - the keyspace is still broken and is hard down.

Binary version

Example: Built from v9.0 release SHA

Version: daa608598 (Git branch 'HEAD') built on Wed Jan 27 22:05:48 UTC 2021 by vitess@d00a879dec03 using go1.15.6 linux/amd64

Operating system and Environment details

OS, Architecture, and any other information you can provide
about the environment.

  • Operating system (output of cat /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel version (output of uname -sr):
Linux 4.19.150+
  • Architecture (output of uname -m):
x86_64

Log Fragments

(see above)

@deepthi deepthi self-assigned this Feb 26, 2021
@deepthi deepthi added this to To do in Cluster Management Feb 27, 2021
@deepthi deepthi removed their assignment Mar 1, 2021
@deepthi deepthi moved this from To do to In progress in Cluster Management Mar 13, 2021
@vkozjak
Copy link

vkozjak commented Apr 27, 2021

Adding my work notes , getting same issue "failed to get replication status from" failed master. Control Panel shows two masters after the error.
output.txt
image

vttablet.out:

W0427 18:32:10.407932   17179 rpc_server.go:80] TabletManager.PopulateReparentJournal(time_created_ns:1619548300406047238 action_name:"EmergencyReparentShard" master_alias:<cell:"zone1" uid:501 > replication_position:"MySQL56/2dd7759c-a784-11eb-987d-06f2fb527910:1-19" )(on zone1-0000000501 from ) error: ExecuteFetch(CREATE DATABASE IF NOT EXISTS _vt) failed: context canceled```

@vkozjak
Copy link

vkozjak commented Apr 28, 2021

One more note from vttablet.out

W0428 17:53:39.331063    1495 rpc_server.go:80] TabletManager.StopReplicationAndGetStatus(stop_replication_mode:IOTHREADONLY )(on zone1-0000000500 from ) error: before status failed: no replication status
W0428 17:53:45.411195    1495 reparent.go:88] WaitForReparentJournal failed to see row before timeout.
W0428 17:53:45.411247    1495 rpc_server.go:80] TabletManager.SetMaster(parent:<cell:"zone1" uid:501 > time_created_ns:1619632419380027938 )(on zone1-0000000500 from ) error: context canceled

@deepthi deepthi moved this from In progress to To do in Cluster Management Jun 28, 2021
@deepthi
Copy link
Member

deepthi commented Jun 29, 2021

@vkozjak which version of vitess did you run this with?

@deepthi deepthi removed their assignment Jul 14, 2021
@beneditomarques
Copy link

Same issue here, and I'm using vitess-v11.0.0-aa798b8.tar.gz.

@GuptaManan100
Copy link
Member

Since, ERS is a manual command, it is meant to fail in case of any errors. If you know that the tablet is down and the replication status will be broken then you can try running ERS and specify the tablets which are broken as comma separated list in ignore_replicas command line flag

@GuptaManan100
Copy link
Member

Closing the issue for now. Please reopen if further information or discussion is required

Cluster Management automation moved this from To do to Done Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

6 participants