-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Overview of the Issue
When the atomic copy failed because of #17862, the result was that it got confused. It though it was done, so it ended, but then the calling context still saw a table list to do, and restarted itself. It then quickly errorred because of duplicate key reasons.
Rohit said:
The root cause seems to be that the source grpc
serr := vc.vr.sourceVStreamer.VStreamTables(ctx, func(resp *binlogdatapb.VStreamTablesResponse) errorreturned without an error or the context expiring which should only happen if all tables had finished streaming. It then found the additional uncopied tables in the _vt.copy_state and assumed it was just starting the atomic copy.
(in vcopier_atomic.go)
What's also interesting to note: is that as you can see below from my start parameters, the copy phase duration is 5 minutes. This translates into the source tablet (by GetVReplicationMaxExecutionTimeQueryHint() I guess) logging this:
Streaming query: select /*+ MAX_EXECUTION_TIME(300000) */ * from EventLog
So, even tough it tried to set 300000, that didn't work. It was at the 10 minute mark that it failed. See the log output later.
To summarize, I see these issues that need to be fixed:
- The
MAX_EXECUTION_TIME(300000)query hint doesn't seem to work. - The
max_execution_timeglobal setting may need to be verified before starting. - The amount of rows copied may need to be confirmed, at least in atomic mode. It restarting all the way from the beginning was a bug, but I'm glad it did, because it did prevent data loss. The only reason I noticed the error, was because it restarted all the way from the start. Had the table list been progressively popped so it continued with the next table, I would not have noticed.
Reproduction Steps
Perform an atomic copy of a large enough table with a value for max_execution_time on the source short enough for the source to abort the query.
We used MySQL 8.0.35 at AWS RDS as source (to move away from).
Tablet start params:
vttablet \
--topo_implementation etcd2 \
--topo_global_server_address 172.31.35.179:2379 \
--topo_global_root /vitess/global \
--vreplication_copy_phase_duration 0h5m0s \
--queryserver-config-transaction-cap 500 \
--logtostderr \
--tablet-path vic-eu-central-1-live-0000000600 \
--init_keyspace vrm2023 \
--init_shard 0 \
--init_tablet_type replica \
--tablet_hostname 172.31.39.97 \
--port 18600 \
--grpc_port 15600 \
--service_map grpc-queryservice,grpc-tabletmanager,grpc-updatestream \
--table-acl-config /opt/vitess/vttablet_acl.json \
--queryserver-config-strict-table-acl \
--enforce-tableacl-config \
--pid_file /var/lib/vitess/vt_0000000600/vttablet.pid \
--db_host rds-foobar.eu-central-1.rds.amazonaws.com \
--db_port 3306 \
--db_app_user vitess_dml \
--db_app_password password_removed \
--db_dba_user vitess_dml \
--db_dba_password password_removed \
--db_repl_user vitess_dml \
--db_repl_password password_removed \
--db_filtered_user vitess_dml \
--db_filtered_password password_removed \
--db_allprivs_user vitess_dml \
--db_allprivs_password password_removed \
--sql-max-length-errors 20 \
--sanitize_log_messages \
--queryserver-config-terse-errors \
--redact-debug-ui-queries \
--init_db_name_override legacy_mysql_db_name
Binary Version
vttablet version Version: 21.0.1 (Git revision 3d4f41db2fbc32611c7d2ea2af3dc68b9d962415 branch 'HEAD') built on Tue Dec 3 05:39:35 UTC 2024 by runner@fv-az2029-313 using go1.23.3 linux/amd64Operating System and Environment details
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.2 LTS"Log Fragments
-- First mention of (copious) log entry of moving 'eventLog' (you can see it's the first because #fields > 0):
feb 20 11:00:12 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:00:12.866786 1029525 vcopier_atomic.go:114] VStreamTablesResponse: received table eventLog, #fields 10, #rows 0, gtid MySQL56/1a654e8e-a2c3-11e7-ab96-06a079b41438:1-4845581054, lastpk <nil>
-- It thought it completed here 10 minutes later because of https://github.com/vitessio/vitess/issues/17862
feb 20 11:10:15 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:15.090405 1029525 vcopier_atomic.go:311] Deleting copy state for table eventLog
feb 20 11:10:15 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:15.131445 1029525 vcopier_atomic.go:304] Completed copy of all tables
feb 20 11:10:15 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:15.131467 1029525 vcopier_atomic.go:306] Returning from copyAll for vrm_all
feb 20 11:10:15 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:15.132133 1029525 vcopier_atomic.go:71] Starting copyAll for vrm_all
feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:36.573519 1029525 vcopier_atomic.go:148] starting copy phase with table SshIpToVncRelay
feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:36.575926 1029525 vcopier_atomic.go:285] VStreamTables failed: Duplicate entry....
feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: failed inserting rows
feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: task error
feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:36.575997 1029525 vcopier_atomic.go:286] Returning from copyAll for vrm_allSlack discussion
https://vitess.slack.com/archives/C0PQY0PTK/p1740042281348869