Skip to content

Commit aaab65d

Browse files
BUG#34398622 gr_suspect_member_resumes_after_crash_mysql bug failed
Problem ======================== gr_suspect_member_resumes_after_crash_mysql is failing sporadically on our continuous integration platform on SLL runs Analysis ======================== MySQL connections used in XCom are not fully controlled by whom who creates them but by: - The Network Connection Manager that lies in XCom; - The server when it goes into a stop. In this case, it means that the GR plugin might be stopping, but the server is still accepting connections that are to be delegated to GCS. In a group with 2 nodes (A,B), a Node B XCom client that creates a connection to join a group is not aware that the new connection is not making any progress on Node A server-side, thus it blocks on a con_read that will never return with success. This lack of return derived from the fact that: - Plugin was stopping and it would never return anything on that connection; - Server side plugin would get stuck trying to acquire a read lock on a resource that is not available when the server is either stopping or starting. Adding to that, connections that are received when the plugin is stopping might end up in a limbo, because they were accepted but never registered on the Network Connection Manager from XCom. Fix ======================== The fix has several fronts, all of them on the server side: - We changed the sql_parse.cc code to return client errors during the connection delegation itself. This allows for a client to detect any of the errors described above, fail-fast and retry the connection. - Change the lock acquisition to a "try lock", returning an error if we can't obtain a lock to delegate the connection, when the plugin is executing an exclusive operation on gcs_operations.cc. - Changed the wait loop for a delegated server connection to check if the plugin is still enabled. If the plugin is tear down, those connections are terminated. Change-Id: I0528e1df9adcd985181a233e226b6352af4eae7c
1 parent 32bcec6 commit aaab65d

14 files changed

+284
-43
lines changed

mysql-test/suite/group_replication/include/gr_suspect_member_resumes_after_crash.inc

+1-1
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ call mtr.add_suppression("Error connecting to all peers. Member join failed. Loc
6363
call mtr.add_suppression("Unable to start MySQL Network Provider*.*");
6464
call mtr.add_suppression("Timeout while waiting for the group communication engine to be ready*.*");
6565
call mtr.add_suppression("The group communication engine is not ready for the member to join*.*");
66-
66+
call mtr.add_suppression(".*Failed to accept a MySQL connection for Group Replication. Group Replication plugin has an ongoing exclusive operation, like START, STOP or FORCE MEMBERS.*");
6767
set session sql_log_bin=1;
6868

6969

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
include/group_replication.inc
2+
Warnings:
3+
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
4+
Note #### Storing MySQL user name or password information in the connection metadata repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START REPLICA; see the 'START REPLICA Syntax' in the MySQL Manual for more information.
5+
[connection server1]
6+
#
7+
# 1. Bootstrap a group with M1.
8+
#
9+
#######
10+
[connection server1]
11+
include/start_and_bootstrap_group_replication.inc
12+
include/disable_binlog.inc
13+
call mtr.add_suppression(".*Failed to accept a MySQL connection for Group Replication. Group Replication plugin has an ongoing exclusive operation, like START, STOP or FORCE MEMBERS.*");
14+
include/restore_binlog.inc
15+
[connection server2]
16+
include/disable_binlog.inc
17+
call mtr.add_suppression("Timeout on wait for view after joining group");
18+
call mtr.add_suppression("Timeout while waiting for the group communication engine to be ready!");
19+
call mtr.add_suppression("The group communication engine is not ready for the member to join. Local port: *.*");
20+
call mtr.add_suppression("read failed");
21+
call mtr.add_suppression("The member was unable to join the group. Local port: *.*");
22+
call mtr.add_suppression("Error connecting to all peers. Member join failed. Local port: *.*");
23+
include/restore_binlog.inc
24+
#
25+
# 2. Enable fail_incoming_connection_ongoing_operation and try to
26+
# join M2.
27+
#
28+
#######
29+
[connection server1]
30+
# Adding debug point 'fail_incoming_connection_ongoing_operation' to @@GLOBAL.debug
31+
[connection server2]
32+
#
33+
# 3. Join M2 will fail. Assert that the error message exists
34+
# in the log of M1.
35+
#
36+
#######
37+
SET GLOBAL group_replication_group_name= "GROUP_REPLICATION_GROUP_NAME";
38+
START GROUP_REPLICATION;
39+
ERROR HY000: The server is not configured properly to be an active member of the group. Please see more details on error log.
40+
[connection server1]
41+
include/assert.inc ['Failed to accept a MySQL connection for Group Replication. Group Replication plugin has an ongoing exclusive operation, like START, STOP or FORCE MEMBERS.']
42+
#
43+
# 4. Start M2 with the send command
44+
#
45+
#######
46+
[connection server2]
47+
START GROUP_REPLICATION;;
48+
[connection server_1_1]
49+
#
50+
# 5. Sleep for 10 seconds and then clear
51+
# fail_incoming_connection_ongoing_operation
52+
#
53+
#######
54+
# Removing debug point 'fail_incoming_connection_ongoing_operation' from @@GLOBAL.debug
55+
#
56+
# 6. reap the start command and M2 must be able to join the group.
57+
#
58+
#######
59+
[connection server2]
60+
#
61+
# 7. Clean-up
62+
#
63+
######
64+
[connection server1]
65+
include/group_replication_end.inc

mysql-test/suite/group_replication/r/gr_suspect_member_resumes_after_crash.result

+1
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ call mtr.add_suppression("Error connecting to all peers. Member join failed. Loc
2424
call mtr.add_suppression("Unable to start MySQL Network Provider*.*");
2525
call mtr.add_suppression("Timeout while waiting for the group communication engine to be ready*.*");
2626
call mtr.add_suppression("The group communication engine is not ready for the member to join*.*");
27+
call mtr.add_suppression(".*Failed to accept a MySQL connection for Group Replication. Group Replication plugin has an ongoing exclusive operation, like START, STOP or FORCE MEMBERS.*");
2728
set session sql_log_bin=1;
2829

2930
############################################################

mysql-test/suite/group_replication/r/gr_suspect_member_resumes_after_crash_mysql.result

+1
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ call mtr.add_suppression("Error connecting to all peers. Member join failed. Loc
2424
call mtr.add_suppression("Unable to start MySQL Network Provider*.*");
2525
call mtr.add_suppression("Timeout while waiting for the group communication engine to be ready*.*");
2626
call mtr.add_suppression("The group communication engine is not ready for the member to join*.*");
27+
call mtr.add_suppression(".*Failed to accept a MySQL connection for Group Replication. Group Replication plugin has an ongoing exclusive operation, like START, STOP or FORCE MEMBERS.*");
2728
set session sql_log_bin=1;
2829

2930
############################################################
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
--loose-group_replication_communication_stack=MySQL
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
--loose-group_replication_communication_stack=MySQL
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
################################################################################
2+
# This test will emulate server-side errors on MySQL connections, such as
3+
# the plugin has an ongoing operation that does not allow connections to
4+
# be accepted.
5+
#
6+
# Test:
7+
# 0. The test requires two servers: M1 and M2.
8+
# 1. Bootstrap a group with M1.
9+
# 2. Enable fail_incoming_connection_ongoing_operation and try to join M2.
10+
# 3. Join M2 will fail. Assert that the error message exists in the log of M1.
11+
# 4. Start M2 with the send command
12+
# 5. Sleep for 10 seconds and then clear
13+
# fail_incoming_connection_ongoing_operation
14+
# 6. reap the start command and M2 must be able to join the group.
15+
# 7. Clean-up
16+
################################################################################
17+
--source include/big_test.inc
18+
--source include/have_group_replication_mysql_communication_stack.inc
19+
--source include/have_group_replication_plugin.inc
20+
--let $rpl_skip_group_replication_start= 1
21+
--source include/group_replication.inc
22+
23+
--echo #
24+
--echo # 1. Bootstrap a group with M1.
25+
--echo #
26+
--echo #######
27+
28+
--let $rpl_connection_name= server1
29+
--source include/rpl_connection.inc
30+
--source include/start_and_bootstrap_group_replication.inc
31+
--source include/disable_binlog.inc
32+
call mtr.add_suppression(".*Failed to accept a MySQL connection for Group Replication. Group Replication plugin has an ongoing exclusive operation, like START, STOP or FORCE MEMBERS.*");
33+
--source include/restore_binlog.inc
34+
35+
--let $rpl_connection_name= server2
36+
--source include/rpl_connection.inc
37+
--source include/disable_binlog.inc
38+
call mtr.add_suppression("Timeout on wait for view after joining group");
39+
call mtr.add_suppression("Timeout while waiting for the group communication engine to be ready!");
40+
call mtr.add_suppression("The group communication engine is not ready for the member to join. Local port: *.*");
41+
call mtr.add_suppression("read failed");
42+
call mtr.add_suppression("The member was unable to join the group. Local port: *.*");
43+
call mtr.add_suppression("Error connecting to all peers. Member join failed. Local port: *.*");
44+
--source include/restore_binlog.inc
45+
46+
--echo #
47+
--echo # 2. Enable fail_incoming_connection_ongoing_operation and try to
48+
--echo # join M2.
49+
--echo #
50+
--echo #######
51+
52+
--let $rpl_connection_name= server1
53+
--source include/rpl_connection.inc
54+
55+
--let $debug_point = fail_incoming_connection_ongoing_operation
56+
--source include/add_debug_point.inc
57+
58+
--let $rpl_connection_name= server2
59+
--source include/rpl_connection.inc
60+
61+
--echo #
62+
--echo # 3. Join M2 will fail. Assert that the error message exists
63+
--echo # in the log of M1.
64+
--echo #
65+
--echo #######
66+
67+
--replace_result $group_replication_group_name GROUP_REPLICATION_GROUP_NAME
68+
--eval SET GLOBAL group_replication_group_name= "$group_replication_group_name"
69+
70+
--error ER_GROUP_REPLICATION_CONFIGURATION
71+
START GROUP_REPLICATION;
72+
73+
--let $rpl_connection_name= server1
74+
--source include/rpl_connection.inc
75+
76+
--let $assert_text= 'Failed to accept a MySQL connection for Group Replication. Group Replication plugin has an ongoing exclusive operation, like START, STOP or FORCE MEMBERS.'
77+
--let $assert_cond= "[SELECT COUNT(*) as count FROM performance_schema.error_log WHERE error_code=\'MY-014081\' AND data LIKE \"%Failed to accept a MySQL connection for Group Replication%\", count, 1]" >= "1"
78+
--source include/assert.inc
79+
80+
--echo #
81+
--echo # 4. Start M2 with the send command
82+
--echo #
83+
--echo #######
84+
85+
--let $rpl_connection_name= server2
86+
--source include/rpl_connection.inc
87+
88+
--send START GROUP_REPLICATION;
89+
90+
--let $rpl_connection_name= server_1_1
91+
--source include/rpl_connection.inc
92+
93+
--echo #
94+
--echo # 5. Sleep for 10 seconds and then clear
95+
--echo # fail_incoming_connection_ongoing_operation
96+
--echo #
97+
--echo #######
98+
--sleep 10
99+
100+
--let $debug_point = fail_incoming_connection_ongoing_operation
101+
--source include/remove_debug_point.inc
102+
103+
--echo #
104+
--echo # 6. reap the start command and M2 must be able to join the group.
105+
--echo #
106+
--echo #######
107+
--let $rpl_connection_name= server2
108+
--source include/rpl_connection.inc
109+
110+
--reap
111+
112+
--echo #
113+
--echo # 7. Clean-up
114+
--echo #
115+
--echo ######
116+
117+
--let $rpl_connection_name= server1
118+
--source include/rpl_connection.inc
119+
120+
--source include/group_replication_end.inc
121+

plugin/group_replication/src/gcs_mysql_network_provider.cc

+2-2
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,8 @@
3434
#include "sql/rpl_group_replication.h"
3535

3636
// Forward declaration of Group Replication callback...
37-
void handle_group_replication_incoming_connection(THD *thd, int fd,
38-
SSL *ssl_ctx);
37+
int handle_group_replication_incoming_connection(THD *thd, int fd,
38+
SSL *ssl_ctx);
3939

4040
bool Gcs_mysql_network_provider_auth_interface_impl::get_credentials(
4141
std::string &username, std::string &password) {

plugin/group_replication/src/gcs_operations.cc

+21-5
Original file line numberDiff line numberDiff line change
@@ -814,12 +814,28 @@ Gcs_mysql_network_provider *Gcs_operations::get_mysql_network_provider() {
814814
DBUG_TRACE;
815815
Gcs_mysql_network_provider *result = nullptr;
816816

817-
gcs_operations_lock->rdlock();
818-
if (gcs_interface != nullptr && gcs_mysql_net_provider != nullptr &&
819-
gcs_interface->is_initialized()) {
820-
result = gcs_mysql_net_provider.get();
817+
auto fail_incoming_connection_ongoing_operation_log = []() {
818+
LogPluginErr(ERROR_LEVEL,
819+
ER_GRP_RPL_MYSQL_NETWORK_PROVIDER_SERVER_ERROR_COMMAND_ERR,
820+
"Group Replication plugin has an ongoing exclusive operation, "
821+
"like START, STOP or FORCE MEMBERS");
822+
};
823+
824+
DBUG_EXECUTE_IF("fail_incoming_connection_ongoing_operation", {
825+
fail_incoming_connection_ongoing_operation_log();
826+
return result;
827+
});
828+
829+
Checkable_rwlock::Guard g(*gcs_operations_lock,
830+
Checkable_rwlock::TRY_READ_LOCK);
831+
if (g.is_rdlocked()) {
832+
if (gcs_interface != nullptr && gcs_mysql_net_provider != nullptr &&
833+
gcs_interface->is_initialized()) {
834+
result = gcs_mysql_net_provider.get();
835+
}
836+
} else {
837+
fail_incoming_connection_ongoing_operation_log();
821838
}
822-
gcs_operations_lock->unlock();
823839

824840
return result;
825841
}

plugin/group_replication/src/plugin.cc

+12-6
Original file line numberDiff line numberDiff line change
@@ -366,18 +366,23 @@ bool get_allow_single_leader() {
366366
* @param thd THD object of the connection
367367
* @param fd File descriptor of the connections
368368
* @param ssl_ctx SSL data of the connection
369+
*
370+
* @return int Returns 1 in case of any error. 0 otherwise.
369371
*/
370-
void handle_group_replication_incoming_connection(THD *thd, int fd,
371-
SSL *ssl_ctx) {
372+
373+
int handle_group_replication_incoming_connection(THD *thd, int fd,
374+
SSL *ssl_ctx) {
372375
auto *new_connection = new Network_connection(fd, ssl_ctx);
373376
new_connection->has_error = false;
377+
int error_return = 1;
374378

375-
Gcs_mysql_network_provider *mysql_provider =
376-
gcs_module->get_mysql_network_provider();
377-
378-
if (mysql_provider) {
379+
if (auto mysql_provider = gcs_module->get_mysql_network_provider();
380+
mysql_provider) {
379381
mysql_provider->set_new_connection(thd, new_connection);
382+
error_return = 0;
380383
}
384+
385+
return error_return;
381386
}
382387

383388
/**
@@ -1789,6 +1794,7 @@ bool attempt_rejoin() {
17891794
*/
17901795
DBUG_EXECUTE_IF("group_replication_fail_rejoin", goto end;);
17911796
view_change_notifier->start_view_modification();
1797+
17921798
join_state =
17931799
gcs_module->join(*events_handler, *events_handler, view_change_notifier);
17941800
if (join_state == GCS_OK) {

share/messages_to_error_log.txt

+3
Original file line numberDiff line numberDiff line change
@@ -12334,6 +12334,9 @@ ER_THREAD_STILL_ALIVE
1233412334
ER_NUM_THREADS_STILL_ALIVE
1233512335
eng "Waiting for forceful disconnection of %ld thread(s) to end."
1233612336

12337+
ER_GRP_RPL_MYSQL_NETWORK_PROVIDER_SERVER_ERROR_COMMAND_ERR
12338+
eng "Failed to accept a MySQL connection for Group Replication. %s. Please retry."
12339+
1233712340
# DO NOT add server-to-client messages here;
1233812341
# they go in messages_to_clients.txt
1233912342
# in the same directory as this file.

sql/rpl_group_replication.h

+1-1
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ bool get_group_replication_view_change_uuid(std::string &uuid);
8787
bool is_group_replication_member_secondary();
8888

8989
// Callback definition for socket donation
90-
typedef void (*gr_incoming_connection_cb)(THD *thd, int fd, SSL *ssl_ctx);
90+
typedef int (*gr_incoming_connection_cb)(THD *thd, int fd, SSL *ssl_ctx);
9191
void set_gr_incoming_connection(gr_incoming_connection_cb x);
9292

9393
#endif /* RPL_GROUP_REPLICATION_INCLUDED */

0 commit comments

Comments
 (0)