Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Pointing vttablet to different (unmaged) MySQL makes it get stuck in loop #15212

Closed
wiebeytec opened this issue Feb 12, 2024 · 6 comments

Comments

@wiebeytec
Copy link

wiebeytec commented Feb 12, 2024

Overview of the Issue

I have an unmanaged MySQL at Amazon RDS with a vttablet in front of it. I replaced the backing MySQL with one restored from snapshot. Amazon makes you restore that into a new instance, so the hostname changes. But, all users and permissions and stuff are the same.

I changed the --db_host argument to vttablet, after which vttablet got stuck NOT_SERVING:

mysql> show vitess_tablets;
+---------------------------+-----------+-------+------------+-------------+--------------------------------------+---------------+----------------------+
| Cell                      | Keyspace  | Shard | TabletType | State       | Alias                                | Hostname      | PrimaryTermStartTime |
+---------------------------+-----------+-------+------------+-------------+--------------------------------------+---------------+----------------------+
| foo-eu-central-1-fakelive | meta2023  | 0     | PRIMARY    | SERVING     | foo-eu-central-1-fakelive-0000000700 | 172.31.25.159 | 2024-01-02T15:57:00Z |
| foo-eu-central-1-fakelive | sites2023 | -80   | PRIMARY    | SERVING     | foo-eu-central-1-fakelive-0000000300 | 172.31.27.190 | 2024-02-12T13:40:54Z |
| foo-eu-central-1-fakelive | sites2023 | 80-   | PRIMARY    | SERVING     | foo-eu-central-1-fakelive-0000000301 | 172.31.17.118 | 2024-02-12T13:41:06Z |
| foo-eu-central-1-fakelive | bar2023   | 0     | PRIMARY    | NOT_SERVING | foo-eu-central-1-fakelive-0000000600 | 172.31.25.59  |                      |
+---------------------------+-----------+-------+------------+-------------+--------------------------------------+---------------+----------------------+

Every second it prints log lines about discovering its state. I pasted it in the box below.

Pointing --db_host back to the original made it work again.

Looking in etcd with etcdctl get --prefix / I see references to the hostname. Apparently there is some split brain situation going on. The --db_host does do something, because MySQL reports a connection, but it's not matched to the original DB perhaps?

Best expected behavior is that it just works. Giving a proper error message is also at least something. It kind of depends what falls in the Vitess 'way'. I know that caring about one MySQL instance is not normal, but augmenting an RDS instance with Vitess seems valid to me.

Reproduction Steps

Point vttablet to an exact copy of the database with by changing --db_host.

Binary Version

Version `vttablet version Version: 18.0.2 (Git revision d3012c188ea0cfc6837917fc6642ea23be9bb1ff branch 'HEAD') built on Wed Dec 20 14:27:31 UTC 2023 by runner@fv-az975-901 using go1.21.5 linux/amd64`

Operating System and Environment details

# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
# uname -a
Linux vitess-vtctld 6.2.0-1014-aws #14~22.04.1-Ubuntu SMP Thu Oct  5 22:43:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux


### Log Fragments

```sh
This every second:


feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.974908  625845 engine.go:149] In syncSidecarDB
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.974945  625845 sidecardb.go:191] _vt schema init called from vitess.io/vitess/go/vt/vttablet/tabletserver/schema.(*Engine).syncSidecarDB:163
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.974967  625845 sidecardb.go:232] Starting sidecardb.Init()
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.976136  625845 sidecardb.go:316] doesSidecarDBExist: found _vt
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.979674  625845 sidecardb.go:383] No changes needed for table heartbeat
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.979695  625845 sidecardb.go:436] Table schema was already up to date for the heartbeat table in the _vt sidecar database
...more like this...
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.995905  625845 sidecardb.go:436] Table schema was already up to date for the vreplication table in the _vt sidecar database
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.996832  625845 sidecardb.go:383] No changes needed for table vreplication_log
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.996849  625845 sidecardb.go:436] Table schema was already up to date for the vreplication_log table in the _vt sidecar database
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.997670  625845 engine.go:172] syncSidecarDB done
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.997688  625845 engine.go:151] syncSidecarDB took 22 ms
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.997712  625845 engine.go:160] VStreamer: opening
feb 12 15:59:09 tablet start_vttablet[625845]: I0212 15:59:09.997719  625845 query_engine.go:312] Query Engine: opening
feb 12 15:59:10 tablet start_vttablet[625845]: I0212 15:59:10.003454  625845 pool.go:152] connpool - started execution of Close
feb 12 15:59:10 tablet start_vttablet[625845]: I0212 15:59:10.003468  625845 pool.go:154] connpool - found the pool
feb 12 15:59:10 tablet start_vttablet[625845]: I0212 15:59:10.003473  625845 pool.go:161] connpool - calling close on the pool
feb 12 15:59:10 tablet start_vttablet[625845]: I0212 15:59:10.003505  625845 pool.go:163] connpool - acquiring lock
feb 12 15:59:10 tablet start_vttablet[625845]: I0212 15:59:10.003514  625845 pool.go:165] connpool - acquired lock
feb 12 15:59:10 tablet start_vttablet[625845]: I0212 15:59:10.003522  625845 pool.go:169] connpool - closing dbaPool
feb 12 15:59:10 tablet start_vttablet[625845]: I0212 15:59:10.003534  625845 pool.go:171] connpool - finished execution of Close

This is how vttablet is started:

vttablet 
--topo_implementation etcd2 
--topo_global_server_address 172.31.25.159:2379 
--topo_global_root /vitess/global 
--queryserver-config-transaction-cap 500 
--logtostderr 
--tablet-path foo-eu-central-1-fakelive-0000000600 
--init_keyspace bar2023 
--init_shard 0 
--init_tablet_type replica 
--tablet_hostname 172.31.25.59 
--port 18600 
--grpc_port 15600 
--service_map grpc-queryservice,grpc-tabletmanager,grpc-updatestream 
--pid_file /var/lib/vitess/vt_0000000600/vttablet.pid 
--db_host dbrestorevitess.removed.eu-central-1.rds.amazonaws.com 
--db_port 3306 
--db_app_user vitess_writer 
--db_app_password vitess_writer_u_pwd_removed 
--db_dba_user vitess_writer 
--db_dba_password vitess_writer_u_pwd_removed 
--db_repl_user vitess_writer 
--db_repl_password vitess_writer_u_pwd_removed 
--db_filtered_user vitess_writer 
--db_filtered_password vitess_writer_u_pwd_removed 
--db_allprivs_user vitess_writer 
--db_allprivs_password vitess_writer_u_pwd_removed 
--sql-max-length-errors 20 
--sanitize_log_messages 
--queryserver-config-terse-errors 
--redact-debug-ui-queries 
--init_db_name_override mydb_2012
@wiebeytec wiebeytec added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Feb 12, 2024
@mattlord
Copy link
Contributor

mattlord commented Feb 12, 2024

@wiebeytec To clarify, you stopped the vttablet process that was using the old --db_host value and started a new vttablet process with the new --db_host value?

My guess would be that it's trying to perform an operation that it cannot. Can you please share the full list of flags that you're using, with sensitive info redacted as necessary? I'm guessing it's related to this: #14871

@wiebeytec
Copy link
Author

First: I figured out what was wrong, and it was my mistake. So, the issue is one of logging errors. First to answer your questions:

@wiebeytec To clarify, you stopped the vttablet process that was using the old --db_host value and started a new vttablet process with the new --db_host value?

Indeed. I just stopped it and started it again, with a new --db-host.

My guess would be that it's trying to perform an operation that it cannot. Can you please share the full list of flags that you're using, with sensitive info redacted as necessary? I'm guessing it's related to this: #14871

I edited the top comment with the information.

It also seemed to me it was trying to do something it cannot, but it didn't log that.

The problem was that I forgot to pick the correct 'parameter group' for the settings. So it's one of these settings that made it fail:

rds_parameter_groups_diff

@mattlord mattlord self-assigned this Feb 13, 2024
@mattlord
Copy link
Contributor

Thanks, @wiebeytec . So not being able to start makes sense, but not having any logs for it does not. We should be able to repeat it easily with the local examples by editing one of the tablet's my.cnf files to add gtid_mode=OFF_PERMISSIVE, then restarting it. We could then turn this into a bug report about the lack of any logging for that and address that specific part.

@mattlord
Copy link
Contributor

@wiebeytec Here's a test case which should demonstrate the lack of a clear error message:

git checkout main && make build

pushd examples/local

./101_initial_cluster.sh

../common/scripts/vtorc-down.sh

vtctldclient SetKeyspaceDurabilityPolicy --durability-policy=none commerce

sleep 10

replicauid=$(vtctldclient GetTablets --keyspace commerce --tablet-type replica | awk '{print $1}' | cut -d- -f2 | bc)
primaryuid=$(vtctldclient GetTablets --keyspace commerce --tablet-type primary | awk '{print $1}' | cut -d- -f2 | bc)

CELL=zone1 KEYSPACE=commerce TABLET_UID="${replicauid}" ../common/scripts/vttablet-down.sh

# 102 is always the rdonly tablet
CELL=zone1 KEYSPACE=commerce TABLET_UID=102 ../common/scripts/vttablet-down.sh

echo "gtid_mode=OFF_PERMISSIVE" >> "${VTDATAROOT}/vt_$(vtctldclient GetTablets --keyspace commerce --tablet-type primary | awk '{print $1}' | cut -d '-' -f2)"/my.cnf

CELL=zone1 KEYSPACE=commerce TABLET_UID="${primaryuid}" ../common/scripts/vttablet-down.sh

CELL=zone1 KEYSPACE=commerce TABLET_UID="${primaryuid}" ../common/scripts/vttablet-up.sh

vtpid=$(pidof vttablet | awk '{print $1}')

tail -100f ${VTDATAROOT}/tmp/vttablet*.INFO.*.${vtpid}

But clearly there's something missing here.

What MySQL config option(s) in this parameter group did you think were the cause?

Are you using the Vitess Operator here?

@mattlord mattlord added Can't Repeat and removed Needs Triage This issue needs to be correctly labelled and triaged labels Feb 14, 2024
@wiebeytec
Copy link
Author

(We're not using Vitess Operator.)

I confirmed it was SQL_MODE value STRICT_TRANS_TABLES.

@mattlord
Copy link
Contributor

OK, I'm unable to repeat what you saw (I modified the test case above to set sql_mode instead of gtid_mode and I did not see any problems). I'm not sure what part I may be missing. I'm going to close this for now but if you have more info we can re-open it at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants