You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've encountered a problem with TabletExternallyReparented: our vtgates continued to connect to the previous PRIMARY tablet even after reparenting.
The vtgate does not recognize the newly restarted tablet pod (with a new IP address) until the TopologyWatcher calls the loadTablets function to get the latest tablets.
As a result, the second TabletExternallyReparented call within tablet_refresh_interval (default: 1 minute) after the SPARE pod restart may cause a service interruption
# Make sure to set the CNI plugin to kindnet so that the pod's IP address changes every time it restarts, similar to the behavior in the GKE environment. # This is because the pod in the default minikube single node cluster does not change its IP address during vttablet rolling updates.
$ minikube start --cni=kindnet --kubernetes-version=v1.19.16 --cpus=4 --memory=8000 --disk-size=40g -p parallel-test
# Make sure that the vitess-operator's image is planetscale/vitess-operator:v2.8.1
$ minikube -p parallel-test kubectl -- apply -f operator.yaml
Deploy the following yaml:
Note: The default value of tablet_refresh_interval makes it difficult to reproduce this issue with just a few attempts. Increasing the tablet_refresh_interval duration makes it easier to reproduce the issue.
apiVersion: planetscale.com/v2kind: VitessClustermetadata:
name: examplespec:
images:
vtctld: vitess/lite:v15.0.1vtadmin: vitess/vtadmin:latestvtgate: vitess/lite:v15.0.1vttablet: vitess/lite:v15.0.1vtbackup: vitess/lite:v15.0.1mysqld:
mysql56Compatible: vitess/lite:v15.0.1mysqldExporter: prom/mysqld-exporter:v0.11.0cells:
- name: zone1gateway:
authentication:
static:
secret:
name: example-cluster-configkey: users.json# extraFlags:# tablet_refresh_interval: "5m"replicas: 1resources:
requests:
cpu: 100mmemory: 256Milimits:
memory: 256MivitessDashboard:
cells:
- zone1extraFlags:
security_policy: read-onlyreplicas: 1resources:
limits:
memory: 128Mirequests:
cpu: 100mmemory: 128Mivtadmin:
rbac:
name: example-cluster-configkey: rbac.yamlcells:
- zone1apiAddresses:
- http://localhost:14001replicas: 1readOnly: falseapiResources:
limits:
memory: 128Mirequests:
cpu: 100mmemory: 128MiwebResources:
limits:
memory: 128Mirequests:
cpu: 100mmemory: 128Mikeyspaces:
- name: maindurabilityPolicy: noneturndownPolicy: Immediatepartitionings:
- equal:
parts: 1shardTemplate:
databaseInitScriptSecret:
name: example-cluster-configkey: init_db.sqlreplication:
enforceSemiSync: falsetabletPools:
- cell: zone1type: externalmasterreplicas: 2vttablet:
extraFlags:
db_charset: utf8mb4_unicode_cilog_queries_to_file: vt/vtdataroot/queries.logqueryserver-config-pool-size: "10"resources:
limits:
memory: 256Mirequests:
cpu: 100mmemory: 256MiexternalDatastore:
user: roothost: mysqlport: 3306database: maincredentialsSecret:
name: example-cluster-configkey: ext_db_credentials_secret.jsonupdateStrategy:
type: Immediate
---
apiVersion: v1kind: Secretmetadata:
name: example-cluster-configtype: OpaquestringData:
users.json: | { "user": [{ "UserData": "user", "Password": "" }] }init_db.sql: | # This file is executed immediately after mysql_install_db, # to initialize a fresh data directory. ############################################################################### # Equivalent of mysql_secure_installation ############################################################################### # Changes during the init db should not make it to the binlog. # They could potentially create errant transactions on replicas. SET sql_log_bin = 0; # Remove anonymous users. DELETE FROM mysql.user WHERE User = ''; # Disable remote root access (only allow UNIX socket). DELETE FROM mysql.user WHERE User = 'root' AND Host != 'localhost'; # Remove test database. DROP DATABASE IF EXISTS test; ############################################################################### # Vitess defaults ############################################################################### # Vitess-internal database. CREATE DATABASE IF NOT EXISTS _vt; # Note that definitions of local_metadata and shard_metadata should be the same # as in production which is defined in go/vt/mysqlctl/metadata_tables.go. CREATE TABLE IF NOT EXISTS _vt.local_metadata ( name VARCHAR(255) NOT NULL, value VARCHAR(255) NOT NULL, db_name VARBINARY(255) NOT NULL, PRIMARY KEY (db_name, name) ) ENGINE=InnoDB; CREATE TABLE IF NOT EXISTS _vt.shard_metadata ( name VARCHAR(255) NOT NULL, value MEDIUMBLOB NOT NULL, db_name VARBINARY(255) NOT NULL, PRIMARY KEY (db_name, name) ) ENGINE=InnoDB; # Admin user with all privileges. CREATE USER 'vt_dba'@'localhost'; GRANT ALL ON *.* TO 'vt_dba'@'localhost'; GRANT GRANT OPTION ON *.* TO 'vt_dba'@'localhost'; # User for app traffic, with global read-write access. CREATE USER 'vt_app'@'localhost'; GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, CREATE USER, EVENT, TRIGGER ON *.* TO 'vt_app'@'localhost'; # User for app debug traffic, with global read access. CREATE USER 'vt_appdebug'@'localhost'; GRANT SELECT, SHOW DATABASES, PROCESS ON *.* TO 'vt_appdebug'@'localhost'; # User for administrative operations that need to be executed as non-SUPER. # Same permissions as vt_app here. CREATE USER 'vt_allprivs'@'localhost'; GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, CREATE USER, EVENT, TRIGGER ON *.* TO 'vt_allprivs'@'localhost'; # User for slave replication connections. # TODO: Should we set a password on this since it allows remote connections? CREATE USER 'vt_repl'@'%'; GRANT REPLICATION SLAVE ON *.* TO 'vt_repl'@'%'; # User for Vitess filtered replication (binlog player). # Same permissions as vt_app. CREATE USER 'vt_filtered'@'localhost'; GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, CREATE USER, EVENT, TRIGGER ON *.* TO 'vt_filtered'@'localhost'; # User for Orchestrator (https://github.com/openark/orchestrator). # TODO: Reenable when the password is randomly generated. CREATE USER 'orc_client_user'@'%' IDENTIFIED BY 'orc_client_user_password'; GRANT SUPER, PROCESS, REPLICATION SLAVE, RELOAD ON *.* TO 'orc_client_user'@'%'; GRANT SELECT ON _vt.* TO 'orc_client_user'@'%'; FLUSH PRIVILEGES; RESET SLAVE ALL; RESET MASTER;rbac.yaml: | rules: - resource: "*" actions: - "get" - "create" - "put" - "ping" subjects: ["*"] clusters: ["*"] - resource: "Shard" actions: - "emergency_reparent_shard" - "planned_reparent_shard" subjects: ["*"] clusters: - "local"orc_config.json: | { "Debug": true, "MySQLTopologyUser": "orc_client_user", "MySQLTopologyPassword": "orc_client_user_password", "MySQLReplicaUser": "vt_repl", "MySQLReplicaPassword": "", "RecoveryPeriodBlockSeconds": 5 }ext_db_credentials_secret.json: | { "root": ["password"] }
---
apiVersion: v1kind: PersistentVolumeClaimmetadata:
name: mysql-pv-claimspec:
accessModes:
- ReadWriteOnceresources:
requests:
storage: 2Gi
---
apiVersion: v1kind: Servicemetadata:
name: mysqllabels:
app: mysqlspec:
ports:
- port: 3306selector:
app: mysqlclusterIP: None
---
apiVersion: apps/v1kind: Deploymentmetadata:
name: mysqlspec:
replicas: 1selector:
matchLabels:
app: mysqltemplate:
metadata:
labels:
app: mysqlspec:
containers:
- name: mysqlimage: mysql:5.7env:
- name: MYSQL_ROOT_PASSWORDvalue: passwordports:
- containerPort: 3306volumeMounts:
- name: mysql-persistent-storagemountPath: /var/lib/mysqlvolumes:
- name: mysql-persistent-storagepersistentVolumeClaim:
claimName: mysql-pv-claim
# Issue the query in whichever way you prefer# ex. mysql -h 127.0.0.1 -P 15306 -u user --table --execute="select * from customer;"
2023-03-13 15:59:07.25799 +0900
#<Mysql2::Error::TimeoutError: [mysql_127.0.0.1:15306] Timeout waiting for a response from the last query. (waited 3 seconds)>
2023-03-13 15:59:11.282234 +0900
#<Mysql2::Error::TimeoutError: [mysql_127.0.0.1:15306] Timeout waiting for a response from the last query. (waited 3 seconds)>
2023-03-13 15:59:15.304278 +0900
#<Mysql2::Error: target: main.-.primary: vttablet: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.244.0.37:15999: connect: connection refused">
2023-03-13 15:59:18.287796 +0900
#<Mysql2::Error: target: main.-.primary: vttablet: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.244.0.37:15999: connect: connection refused">
2023-03-13 15:59:19.316975 +0900
#<Mysql2::Error: target: main.-.primary: vttablet: Connection Closed>
...
2023-03-13 15:59:35.7701 +0900
#<Mysql2::Error: target: main.-.primary: vttablet: Connection Closed>
$ watch -n 1 'mysql -h 127.0.0.1 -P 15306 -u user --table --execute="show vitess_tablets" | tee -a /tmp/vitess_tablets.txt; echo `date` | tee -a /tmp/vitess_tablets.txt'
...
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
| Cell | Keyspace | Shard | TabletType | State | Alias | Hostname | PrimaryTermStartTime |
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
| zone1 | main | - | PRIMARY | SERVING | zone1-4073072872 | 10.244.0.37 | 2023-03-13T06:58:09Z || zone1 | main | - | SPARE | NOT_SERVING | zone1-1951951717 | 10.244.0.36 ||
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
Mon Mar 13 15:59:07 JST 2023
# Start the errors
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
| Cell | Keyspace | Shard | TabletType | State | Alias | Hostname | PrimaryTermStartTime |
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
| zone1 | main | - | SPARE | NOT_SERVING | zone1-1951951717 | 10.244.0.36 ||| zone1 | main | - | SPARE | NOT_SERVING | zone1-4073072872 | 10.244.0.37 ||
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
Mon Mar 13 15:59:17 JST 2023
...
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
| Cell | Keyspace | Shard | TabletType | State | Alias | Hostname | PrimaryTermStartTime |
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
| zone1 | main | - | SPARE | NOT_SERVING | zone1-1951951717 | 10.244.0.36 ||| zone1 | main | - | SPARE | NOT_SERVING | zone1-4073072872 | 10.244.0.37 ||
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
Mon Mar 13 15:59:35 JST 2023
# Finish the errors
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
| Cell | Keyspace | Shard | TabletType | State | Alias | Hostname | PrimaryTermStartTime |
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
| zone1 | main | - | PRIMARY | SERVING | zone1-1951951717 | 10.244.0.38 | 2023-03-13T06:59:07Z || zone1 | main | - | SPARE | NOT_SERVING | zone1-1951951717 | 10.244.0.36 ||| zone1 | main | - | SPARE | NOT_SERVING | zone1-4073072872 | 10.244.0.39 ||
+-------+----------+-------+------------+-------------+------------------+-------------+----------------------+
Mon Mar 13 15:59:36 JST 2023
Binary Version
This issue can be reproduced in either version v15.0.1 or v16.0.0.
$ mysql -h 127.0.0.1 -P 15306 -u user --table
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.7.9-vitess-15.0.1 Version: 15.0.1 (Git revision 13ee9c817638d59bebd6bc598f9d673a893c41cd branch 'heads/v15.0.1') built on Tue Nov 29 21:08:39 UTC 2022 by vitess@buildkitsandbox using go1.18.7 linux/amd64
OR
$ mysql -h 127.0.0.1 -P 15306 -u user --table
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 244
Server version: 8.0.30-Vitess Version: 16.0.0 (Git revision bb768df0008fc09f7e6868a4fa571c32cc1cb526 branch 'heads/v16.0.0') built on Tue Feb 28 15:38:00 UTC 2023 by vitess@buildkitsandbox using go1.20.1 linux/amd64
Operating System and Environment details
This issue can be reproduced in either version v15.0.1 (with planetscale/vitess-operator:v2.8.1) or v16.0.0 (with planetscale/vitess-operator:v2.9.0).
Overview of the Issue
We've encountered a problem with
TabletExternallyReparented
: our vtgates continued to connect to the previousPRIMARY
tablet even after reparenting.The vtgate does not recognize the newly restarted tablet pod (with a new IP address) until the
TopologyWatcher
calls the loadTablets function to get the latest tablets.As a result, the second
TabletExternallyReparented
call within tablet_refresh_interval (default: 1 minute) after theSPARE
pod restart may cause a service interruptionDiscussion is https://vitess.slack.com/archives/C0PQY0PTK/p1678243354071319.
Reproduction Steps
tablet_refresh_interval
makes it difficult to reproduce this issue with just a few attempts. Increasing the tablet_refresh_interval duration makes it easier to reproduce the issue.Binary Version
This issue can be reproduced in either version v15.0.1 or v16.0.0.
OR
Operating System and Environment details
This issue can be reproduced in either version v15.0.1 (with planetscale/vitess-operator:v2.8.1) or v16.0.0 (with planetscale/vitess-operator:v2.9.0).
OR
Log Fragments
Here is a diagram that illustrates the sequence of events and their flow, as described in the above log.
The text was updated successfully, but these errors were encountered: