Skip to content

Commit 333d3ea

Browse files
committed
Bug#35392640: Group Replication primary with replica blocked by view change
In a GR setup, if a source of transactions exist besides the applier channel then the following can happen: - There are several transactions being applied locally, already certified and so associated to a ticket, lets say ticket 2. These transactions did not yet commit. Note that these can be local transaction or come from an async channel for example - A view happens that has ticket 3 and that has to wait on transactions from ticket 2. - The view change (VC1) enters the GR applier channel applier and gets stuck there waiting for the ticket change to 3. - Now there is another group change, and another view change (VC2) while the transactions from ticket 2 end their execution. Issue: There is a window where the last transaction from ticket 2 already marked itself as being executed but before popping the ticket, VC2 will pop the ticket instead but never notify any of the participants. VC1 stays waiting for the ticket to change forever and the worker can't be killed. Solution: Make the condition wait to break in periods of 1 second so the loop is responsive to changes to the loop condition. We also register a stage so the loop is more reponsive to kill signals. Change-Id: I86eb6d1e470d9728c540f2fbcfb4ba9357eba103
1 parent afb7aad commit 333d3ea

8 files changed

+344
-10
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
include/group_replication.inc [rpl_server_count=4]
2+
Warnings:
3+
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
4+
Note #### Storing MySQL user name or password information in the connection metadata repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START REPLICA; see the 'START REPLICA Syntax' in the MySQL Manual for more information.
5+
[connection server1]
6+
#
7+
# 1. Start GR on server 1. Create an asynchronous connection to server 3
8+
# Add some data to server 3 that will be replicated to server 1
9+
[connection server1]
10+
include/save_sysvars.inc [ "GLOBAL.group_replication_view_change_uuid" ]
11+
SET GLOBAL group_replication_view_change_uuid = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa";
12+
include/start_and_bootstrap_group_replication.inc
13+
CHANGE REPLICATION SOURCE TO SOURCE_HOST='127.0.0.1', SOURCE_USER='root', SOURCE_AUTO_POSITION=1, SOURCE_PORT=SERVER_3_PORT FOR CHANNEL 'ch1';
14+
Warnings:
15+
Note 1759 Sending passwords in plain text without SSL/TLS is extremely insecure.
16+
Note 1760 Storing MySQL user name or password information in the connection metadata repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START REPLICA; see the 'START REPLICA Syntax' in the MySQL Manual for more information.
17+
include/start_slave.inc [FOR CHANNEL 'ch1']
18+
# Add some data on 3 and sync
19+
[connection server3]
20+
CREATE TABLE t1 (c1 INT NOT NULL PRIMARY KEY) ENGINE=InnoDB;
21+
INSERT INTO t1 VALUES (1);
22+
include/sync_slave_sql_with_master.inc
23+
#
24+
# 2. Insert one last transaction on server 3 that will block on commit on server 1
25+
# Use a point that blocks the transaction after certification but before commit
26+
# Wait for the transaction to block
27+
[connection server1]
28+
# Adding debug point 'ordered_commit_blocked' to @@GLOBAL.debug
29+
[connection server3]
30+
INSERT INTO t1 VALUES (2);
31+
[connection server1]
32+
SET DEBUG_SYNC= "now WAIT_FOR signal.ordered_commit_waiting";
33+
# Removing debug point 'ordered_commit_blocked' from @@GLOBAL.debug
34+
#
35+
# 3. Join member 2 to the group
36+
# Wait for the VCLE to reach application where it will be stuck waiting for its ticket
37+
[connection server2]
38+
include/save_sysvars.inc [ "GLOBAL.group_replication_view_change_uuid" ]
39+
SET GLOBAL group_replication_view_change_uuid = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa";
40+
SET GLOBAL group_replication_group_name= "GROUP_REPLICATION_GROUP_NAME";
41+
START GROUP_REPLICATION;
42+
[connection server1]
43+
include/assert.inc ['There is a worker whose stage reports it is waiting on a ticket']
44+
#
45+
# 4. Unblock the transaction from the async channel, but stop it again before it pops the ticket
46+
# Adding debug point 'rpl_end_of_ticket_blocked' to @@GLOBAL.debug
47+
SET DEBUG_SYNC= "now SIGNAL signal.ordered_commit_continue";
48+
SET DEBUG_SYNC= "now WAIT_FOR signal.end_of_ticket_waiting";
49+
# Removing debug point 'rpl_end_of_ticket_blocked' from @@GLOBAL.debug
50+
#
51+
# 5. Join member 4. The new VCLE will pop the ticket with no broadcast
52+
# Wait for this new VCLE to be queued
53+
[connection server4]
54+
include/save_sysvars.inc [ "GLOBAL.group_replication_view_change_uuid" ]
55+
SET GLOBAL group_replication_view_change_uuid = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa";
56+
SET GLOBAL group_replication_group_name= "GROUP_REPLICATION_GROUP_NAME";
57+
START GROUP_REPLICATION;
58+
[connection server1]
59+
#
60+
# 6. Unblock the stuck ticket
61+
# All members should now be online
62+
SET DEBUG_SYNC= "now SIGNAL signal.end_of_ticket_continue";
63+
#
64+
# 7. Cleaning up
65+
[connection server3]
66+
DROP TABLE t1;
67+
include/sync_slave_sql_with_master.inc
68+
[connection server1]
69+
SET DEBUG_SYNC= 'RESET';
70+
include/stop_slave.inc
71+
CHANGE REPLICATION SOURCE TO SOURCE_AUTO_POSITION=0 FOR CHANNEL "ch1";
72+
include/stop_group_replication.inc
73+
include/restore_sysvars.inc
74+
[connection server2]
75+
include/stop_group_replication.inc
76+
include/restore_sysvars.inc
77+
[connection server4]
78+
include/stop_group_replication.inc
79+
include/restore_sysvars.inc
80+
include/group_replication_end.inc
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
!include ../my.cnf
2+
3+
[mysqld.1]
4+
skip-replica-start= FALSE
5+
6+
[mysqld.2]
7+
8+
[mysqld.3]
9+
10+
11+
[mysqld.4]
12+
13+
[ENV]
14+
15+
SERVER_MYPORT_3= @mysqld.3.port
16+
SERVER_MYSOCK_3= @mysqld.3.socket
17+
18+
SERVER_MYPORT_4= @mysqld.4.port
19+
SERVER_MYSOCK_4= @mysqld.4.socket
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
################################################################################
2+
# === Purpose ===
3+
#
4+
# This test checks that the waiting block for commit tickets is responsive in case
5+
# of missed signals and stop instructions.
6+
#
7+
# ==== Requirements ====
8+
#
9+
# When multiple views are logged in a member, the system should never be stuck waiting for
10+
# a signal in order to apply one of these views.
11+
#
12+
# === Implementation ====
13+
#
14+
# 0. There are 3 members that will form a group (server 1,2 and 4).
15+
# There is an asynchronous replication connection from server 3 to server 1
16+
# 1. Start GR on server 1. Create an asynchronous connection to server 3
17+
# Add some data to server 3 that will be replicated to server 1
18+
# 2. Insert one last transaction on server 3 that will block on commit on server 1
19+
# Use a point that blocks the transaction after certification but before commit
20+
# Wait for the transaction to block
21+
# 3. Join member 2 to the group
22+
# Wait for the VCLE to reach application where it will be stuck waiting for its ticket
23+
# 4. Unblock the transaction from the async channel, but stop it again before it pops the ticket
24+
# 5. Join member 4. The new VCLE will pop the ticket with no broadcast
25+
# Wait for this new VCLE to be queued
26+
# 6. Unblock the stuck ticket
27+
# All members should now be online
28+
# 7. Cleaning up
29+
#
30+
# === References ===
31+
#
32+
# Bug#35392640: Group Replication primary with replica blocked by view change
33+
#
34+
35+
--source include/have_debug_sync.inc
36+
--source include/have_group_replication_plugin.inc
37+
--let $rpl_skip_group_replication_start= 1
38+
--let $rpl_server_count= 4
39+
--source include/group_replication.inc
40+
41+
--echo #
42+
--echo # 1. Start GR on server 1. Create an asynchronous connection to server 3
43+
--echo # Add some data to server 3 that will be replicated to server 1
44+
45+
--let $rpl_connection_name= server1
46+
--source include/rpl_connection.inc
47+
48+
--let $sysvars_to_save = [ "GLOBAL.group_replication_view_change_uuid" ]
49+
--source include/save_sysvars.inc
50+
51+
SET GLOBAL group_replication_view_change_uuid = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa";
52+
--source include/start_and_bootstrap_group_replication.inc
53+
54+
# Async connection to channel 3
55+
--replace_result $SERVER_MYPORT_3 SERVER_3_PORT
56+
--eval CHANGE REPLICATION SOURCE TO SOURCE_HOST='127.0.0.1', SOURCE_USER='root', SOURCE_AUTO_POSITION=1, SOURCE_PORT=$SERVER_MYPORT_3 FOR CHANNEL 'ch1'
57+
58+
--let $rpl_channel_name='ch1'
59+
--source include/start_slave.inc
60+
--let $rpl_channel_name=
61+
62+
--echo # Add some data on 3 and sync
63+
64+
--let $rpl_connection_name= server3
65+
--source include/rpl_connection.inc
66+
67+
CREATE TABLE t1 (c1 INT NOT NULL PRIMARY KEY) ENGINE=InnoDB;
68+
INSERT INTO t1 VALUES (1);
69+
70+
--let $sync_slave_connection=server1
71+
--source include/sync_slave_sql_with_master.inc
72+
73+
--echo #
74+
--echo # 2. Insert one last transaction on server 3 that will block on commit on server 1
75+
--echo # Use a point that blocks the transaction after certification but before commit
76+
--echo # Wait for the transaction to block
77+
78+
--let $rpl_connection_name= server1
79+
--source include/rpl_connection.inc
80+
81+
# Block the last transaction from completing
82+
# Block it when it is already registered/certified but not committed.
83+
--let $debug_point = ordered_commit_blocked
84+
--source include/add_debug_point.inc
85+
86+
--let $rpl_connection_name= server3
87+
--source include/rpl_connection.inc
88+
89+
INSERT INTO t1 VALUES (2);
90+
91+
--let $rpl_connection_name= server1
92+
--source include/rpl_connection.inc
93+
94+
# Wait for the debug sync to be reached.
95+
SET DEBUG_SYNC= "now WAIT_FOR signal.ordered_commit_waiting";
96+
--source include/remove_debug_point.inc
97+
98+
--echo #
99+
--echo # 3. Join member 2 to the group
100+
--echo # Wait for the VCLE to reach application where it will be stuck waiting for its ticket
101+
102+
--let $rpl_connection_name= server2
103+
--source include/rpl_connection.inc
104+
105+
--let $sysvars_to_save = [ "GLOBAL.group_replication_view_change_uuid" ]
106+
--source include/save_sysvars.inc
107+
SET GLOBAL group_replication_view_change_uuid = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa";
108+
109+
# Start GR gets stuck on RECOVERY state
110+
--replace_result $group_replication_group_name GROUP_REPLICATION_GROUP_NAME
111+
--eval SET GLOBAL group_replication_group_name= "$group_replication_group_name"
112+
--source include/start_group_replication_command.inc
113+
114+
--let $rpl_connection_name= server1
115+
--source include/rpl_connection.inc
116+
117+
# Wait for the VCLE to reach application
118+
--let $wait_condition= SELECT COUNT(*) = 1 FROM performance_schema.replication_applier_status_by_worker WHERE channel_name = "group_replication_applier" AND applying_transaction= "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:2"
119+
--source include/wait_condition.inc
120+
121+
--let $assert_text= 'There is a worker whose stage reports it is waiting on a ticket'
122+
--let $assert_cond= [SELECT COUNT(*) AS count FROM performance_schema.threads WHERE name="thread/sql/replica_worker" AND processlist_state="Waiting for Binlog Group Commit ticket", count, 1] = 1
123+
--source include/assert.inc
124+
125+
--echo #
126+
--echo # 4. Unblock the transaction from the async channel, but stop it again before it pops the ticket
127+
128+
--let $debug_point = rpl_end_of_ticket_blocked
129+
--source include/add_debug_point.inc
130+
131+
SET DEBUG_SYNC= "now SIGNAL signal.ordered_commit_continue";
132+
133+
# Wait it to block after already acknowledging the transaction was processed, but before popping the ticket
134+
SET DEBUG_SYNC= "now WAIT_FOR signal.end_of_ticket_waiting";
135+
136+
--source include/remove_debug_point.inc
137+
138+
--echo #
139+
--echo # 5. Join member 4. The new VCLE will pop the ticket with no broadcast
140+
--echo # Wait for this new VCLE to be queued
141+
142+
--let $rpl_connection_name= server4
143+
--source include/rpl_connection.inc
144+
145+
--let $sysvars_to_save = [ "GLOBAL.group_replication_view_change_uuid" ]
146+
--source include/save_sysvars.inc
147+
SET GLOBAL group_replication_view_change_uuid = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa";
148+
149+
# The new View Change will cause a pop with no signal
150+
--replace_result $group_replication_group_name GROUP_REPLICATION_GROUP_NAME
151+
--eval SET GLOBAL group_replication_group_name= "$group_replication_group_name"
152+
--source include/start_group_replication_command.inc
153+
154+
--let $rpl_connection_name= server1
155+
--source include/rpl_connection.inc
156+
157+
# Wait for the VCLE for the member 4 join to be queued
158+
# Not stuck waiting for the ticket, the VCLE is still stuck waiting for the flush stage lock
159+
--let $wait_condition= SELECT COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE = 1 from performance_schema.replication_group_member_stats where member_id in (SELECT @@server_uuid)
160+
--source include/wait_condition.inc
161+
162+
--echo #
163+
--echo # 6. Unblock the stuck ticket
164+
--echo # All members should now be online
165+
166+
SET DEBUG_SYNC= "now SIGNAL signal.end_of_ticket_continue";
167+
168+
--let $wait_condition=SELECT COUNT(*)=3 FROM performance_schema.replication_group_members where MEMBER_STATE="ONLINE"
169+
--source include/wait_condition.inc
170+
171+
--echo #
172+
--echo # 7. Cleaning up
173+
174+
--let $rpl_connection_name= server3
175+
--source include/rpl_connection.inc
176+
177+
DROP TABLE t1;
178+
179+
--let $sync_slave_connection=server1
180+
--source include/sync_slave_sql_with_master.inc
181+
182+
--let $rpl_connection_name= server1
183+
--source include/rpl_connection.inc
184+
185+
SET DEBUG_SYNC= 'RESET';
186+
187+
--source include/stop_slave.inc
188+
CHANGE REPLICATION SOURCE TO SOURCE_AUTO_POSITION=0 FOR CHANNEL "ch1";
189+
190+
--source include/stop_group_replication.inc
191+
--source include/restore_sysvars.inc
192+
193+
--let $rpl_connection_name= server2
194+
--source include/rpl_connection.inc
195+
196+
--source include/stop_group_replication.inc
197+
--source include/restore_sysvars.inc
198+
199+
--let $rpl_connection_name= server4
200+
--source include/rpl_connection.inc
201+
202+
--source include/stop_group_replication.inc
203+
--source include/restore_sysvars.inc
204+
205+
--source include/group_replication_end.inc

sql/binlog.cc

+6
Original file line numberDiff line numberDiff line change
@@ -8923,6 +8923,12 @@ int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit) {
89238923
thd->thread_id()));
89248924

89258925
DEBUG_SYNC(thd, "bgc_before_flush_stage");
8926+
DBUG_EXECUTE_IF("ordered_commit_blocked", {
8927+
const char act[] =
8928+
"now signal signal.ordered_commit_waiting wait_for "
8929+
"signal.ordered_commit_continue";
8930+
assert(!debug_sync_set_action(current_thd, STRING_WITH_LEN(act)));
8931+
});
89268932

89278933
/*
89288934
Stage #0: ensure slave threads commit order as they appear in the slave's

sql/log_event.cc

+3-2
Original file line numberDiff line numberDiff line change
@@ -13300,8 +13300,9 @@ int Gtid_log_event::do_apply_event(Relay_log_info const *rli) {
1330013300
if (thd->rpl_thd_ctx.binlog_group_commit_ctx()
1330113301
.get_session_ticket()
1330213302
.is_set()) {
13303-
assert(thd->rpl_thd_ctx.binlog_group_commit_ctx().get_session_ticket() ==
13304-
bgc_group_ticket);
13303+
assert(
13304+
!(bgc_group_ticket >
13305+
thd->rpl_thd_ctx.binlog_group_commit_ctx().get_session_ticket()));
1330513306
}
1330613307
#endif
1330713308
/*

sql/mysqld.cc

+3-1
Original file line numberDiff line numberDiff line change
@@ -12163,6 +12163,7 @@ PSI_stage_info stage_rpl_failover_fetching_source_member_details= { 0, "Fetching
1216312163
PSI_stage_info stage_rpl_failover_updating_source_member_details= { 0, "Updating fetched source member details on receiver", 0, PSI_DOCUMENT_ME};
1216412164
PSI_stage_info stage_rpl_failover_wait_before_next_fetch= { 0, "Wait before trying to fetch next membership changes from source", 0, PSI_DOCUMENT_ME};
1216512165
PSI_stage_info stage_communication_delegation= { 0, "Connection delegated to Group Replication", 0, PSI_DOCUMENT_ME};
12166+
PSI_stage_info stage_wait_on_commit_ticket= { 0, "Waiting for Binlog Group Commit ticket", 0, PSI_DOCUMENT_ME};
1216612167
/* clang-format on */
1216712168

1216812169
extern PSI_stage_info stage_waiting_for_disk_space;
@@ -12266,7 +12267,8 @@ PSI_stage_info *all_server_stages[] = {
1226612267
&stage_rpl_failover_fetching_source_member_details,
1226712268
&stage_rpl_failover_updating_source_member_details,
1226812269
&stage_rpl_failover_wait_before_next_fetch,
12269-
&stage_communication_delegation};
12270+
&stage_communication_delegation,
12271+
&stage_wait_on_commit_ticket};
1227012272

1227112273
PSI_socket_key key_socket_tcpip;
1227212274
PSI_socket_key key_socket_unix;

sql/mysqld.h

+1
Original file line numberDiff line numberDiff line change
@@ -650,6 +650,7 @@ extern PSI_stage_info stage_rpl_failover_fetching_source_member_details;
650650
extern PSI_stage_info stage_rpl_failover_updating_source_member_details;
651651
extern PSI_stage_info stage_rpl_failover_wait_before_next_fetch;
652652
extern PSI_stage_info stage_communication_delegation;
653+
extern PSI_stage_info stage_wait_on_commit_ticket;
653654
#ifdef HAVE_PSI_STATEMENT_INTERFACE
654655
/**
655656
Statement instrumentation keys (sql).

0 commit comments

Comments
 (0)