Skip to content

Commit b383a77

Browse files
committed
Bug#35671897 Concurrent Binlog rotate and DDL commit hit safe_mutex
assertion Problem: Handling an incident while concurrent transactions are committed to the binlog cause eternal MySQL Server lockup. Analysis: An incident occurs in one of two ways: 1. When a transaction which can't be rolled back fails to write to the binlog cache. 2. When a session want to record a fatal error from another subsystem. The incident will write an Incident_log_event which contains message and error code that describes the problem. The Incident_log_event should always be preceded by a Gtid_log_event, this makes it possible for binlog consumer to skip over the incident in same way as other statements/transactions. After having written the Incident_log_event, the binlog should be rotated and checked for purge. The reason for always rotating and checking the log for purge is that it will potentially free up disk space and thus solve the cause of the incident. The main reason for the lockup is that the session rotates and purges the binary log from outside the binlog group commit (BGC), while other concurrent sessions may potentially be in the middle of an ongoing binlog flush. The normal procedure is that the binary log is written in the flush stage of BGC, and rotations/purges at the end of the commit stage of BGC. This is known to work and well tested; the concurrent use of another execution flow was the reason for the bug. Therefore we fix the bug by following this normal procedure also for incidents. Solution: Defer handling of incident until inside binlog group commit (BGC). The flush stage will check whether any binlog cache is marked as having an incident, in such case it will replace the content of the binlog cache(s) with an Incident_log_event which will then be written to the binlog. As part of handling the incident it will also request that binlog rotation and old file purge logic are performed when that stage is reached. In addition the Incident_log_event will cause dependency tracker to mark all subsequent transactions to depend on it. The patch has following changes: 1. The existing interface to set an incident on the binlog cache is extended to take a message describing the incident, this is used by session threads that experience failure writing to binlog cache. 2. The interface for recording an fatal error in another subsystem is changed with a new function that marks the binlog cache as having an incident and then commits it. 3. The patch slightly change behaviour so that an incident will always rotate the binlog and execute the purge logic, this was not the case earlier when failing to write to the binlog cache during commit - in all other cases it would rotate and purge. This is done by extending the Binlog_group_commit_ctx with an indicator of whether binlog max size was seen to be exceeded during flush and an indicator whether a session want to force rotate. These members are then aggregated over all committed sessions before the binlog rotate+purge commence at end of BGC. This new mechanism is using a somewhat branchless strategy. Testing: New tests are added which provoked the lockup before fix. Existing tests are modified to test that binlog are always rotated and adapt to slight changes in error message which is now more generically phrased. The test for checking functionality handling the case when binlog cache does not exist are removed since it was proven that such case did not exist - neither in old or new implementation. The upgrade compatibility test which intentionally writes an Incident_log_event without preceding Gtid is modified to use a more generic DBUG point. Change-Id: I731727242c40227ffe816fc4f86af59b5520936f
1 parent 692c764 commit b383a77

32 files changed

+668
-619
lines changed

mysql-test/common/rpl/rpl_binlog_max_cache_size.test

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,7 @@
3030
call mtr.add_suppression("Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT");
3131
call mtr.add_suppression("Option binlog_cache_size ");
3232
call mtr.add_suppression("Replica SQL for channel '': The incident LOST_EVENTS occurred on the source.");
33-
call mtr.add_suppression("The content of the statement cache is corrupted while writing a rollback record of the transaction to the binary log. An incident event has been written to the binary log which will stop the replicas.");
34-
call mtr.add_suppression("Error happend while resetting the transaction cache for a rolled back transaction or a single statement not inside a transaction. An incident event has been written to the binary log which will stop the replicas.");
33+
call mtr.add_suppression("Non-transactional changes were not written to the binlog. An incident event has been written to the binary log which will stop the replicas.");
3534

3635
let $old_max_binlog_cache_size= query_get_value(SHOW VARIABLES LIKE "max_binlog_cache_size", Value, 1);
3736
let $old_binlog_cache_size= query_get_value(SHOW VARIABLES LIKE "binlog_cache_size", Value, 1);

mysql-test/include/rpl/assert_binlog_events.inc

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
# --let $event_sequence= SEQUENCE_OF_EVENTS
88
# [--let $event_separator= CHAR]
99
# [--let $invert= 1]
10+
# [--let $somewhere= 0]
1011
# [--let $limit= [OFFSET,] ROW_COUNT]
1112
# [--let $binlog_file= FILE]
1213
# [--let $binlog_position= POSITION]
@@ -92,6 +93,12 @@
9293
# the specification. If $invert is set, this scripts instead asserts
9394
# that SHOW BINLOG EVENTS does *not* match the specification.
9495
#
96+
# $somewhere
97+
# By default, this script asserts that SHOW BINLOG EVENTS matches
98+
# the specification from start to end. If $somewhere is set, this scripts
99+
# instead asserts that SHOW BINLOG EVENTS *somewhere* contains
100+
# the specification.
101+
#
95102
# $limit
96103
# Pass this as the LIMIT clause of SHOW BINLOG EVENTS.
97104
#
@@ -197,6 +204,7 @@ while ($_abe_verdict != 'ok')
197204
--let _ABE_FILE= $output_file
198205
--let _ABE_EVENT_SEQUENCE= $event_sequence
199206
--let _ABE_INVERT= $invert
207+
--let _ABE_SOMEWHERE= $somewhere
200208
--let _ABE_DEBUG= $rpl_debug
201209
--let _ABE_EVENT_SEPARATOR= $event_separator
202210
--let _ABE_INCLUDE_HEADER_EVENTS= $include_header_events
@@ -207,6 +215,7 @@ while ($_abe_verdict != 'ok')
207215
my $event_sequence = $ENV{'_ABE_EVENT_SEQUENCE'};
208216
my $file = $ENV{'_ABE_FILE'};
209217
my $invert = $ENV{'_ABE_INVERT'};
218+
my $somewhere = $ENV{'_ABE_SOMEWHERE'};
210219
my $debug = $ENV{'_ABE_DEBUG'};
211220
my $include_header_events = $ENV{'_ABE_INCLUDE_HEADER_EVENTS'};
212221
my $event_separator= $ENV{'_ABE_EVENT_SEPARATOR'};
@@ -243,12 +252,15 @@ while ($_abe_verdict != 'ok')
243252
# (e.g., 'Query' instead of 'Query/xyz').
244253
$event_sequence =~ s{\n}{(?:/[^\n]*)?\n}g;
245254

246-
# Allow 'Type' without 'Info' at the end. Require match until the
247-
# end. Allow missing \n at the end.
248-
$event_sequence .= "(?:/[^\n]*)?\n?" . '$';
255+
# Allow 'Type' without 'Info' at the end. Allow missing \n at the end.
256+
$event_sequence .= "(?:/[^\n]*)?\n?";
249257

250-
# Require match from the beginning.
251-
$event_sequence = '^' . $event_sequence;
258+
if (!$somewhere)
259+
{
260+
# Require match from the beginning or starting position. Require match until
261+
# the end.
262+
$event_sequence = '^' . $event_sequence . '$';
263+
}
252264

253265
if ($debug)
254266
{
@@ -330,6 +342,7 @@ while ($_abe_verdict != 'ok')
330342
--echo binlog_file=<$binlog_file>
331343
--echo statement=$statement
332344
--echo invert=$invert
345+
--echo somewhere=$somewhere
333346
--echo include_header_events=$include_header_events
334347
--echo include_trx_payload_events=$include_trx_payload_events
335348
--echo event_separator=$_ABE_EVENT_SEPARATOR

mysql-test/suite/binlog/r/binlog_incident.result

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
RESET BINARY LOGS AND GTIDS;
22
call mtr.add_suppression("Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT");
33
call mtr.add_suppression("Replica SQL for channel '': The incident LOST_EVENTS occurred on the source.");
4-
call mtr.add_suppression("The content of the statement cache is corrupted while writing a rollback record of the transaction to the binary log. An incident event has been written to the binary log which will stop the replicas.");
4+
call mtr.add_suppression("Non-transactional changes were not written to the binlog. An incident event has been written to the binary log which will stop the replicas.");
55
SET GLOBAL max_binlog_stmt_cache_size = 4096;
66
Warnings:
77
Warning 1745 Option binlog_stmt_cache_size (32768) is greater than max_binlog_stmt_cache_size (4096); setting binlog_stmt_cache_size equal to max_binlog_stmt_cache_size.
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
call mtr.add_suppression("Non-transactional changes were not written to the binlog. An incident event has been written to the binary log which will stop the replicas.");
2+
SET GLOBAL max_binlog_stmt_cache_size = 4096;
3+
Warnings:
4+
Warning 1745 Option binlog_stmt_cache_size (32768) is greater than max_binlog_stmt_cache_size (4096); setting binlog_stmt_cache_size equal to max_binlog_stmt_cache_size.
5+
SET GLOBAL binlog_stmt_cache_size = 4096;
6+
CREATE TABLE t1(c1 INT PRIMARY KEY, data TEXT(30000)) ENGINE=MyIsam;
7+
CREATE TABLE t2 (a INT PRIMARY KEY, b VARCHAR(400)) engine=InnoDB;
8+
CREATE PROCEDURE dmlload (seconds int)
9+
begin
10+
set @x=time_to_sec(current_time()) + seconds;
11+
repeat
12+
start transaction;
13+
insert into t2 values (2, repeat('I', 400));
14+
commit;
15+
start transaction;
16+
update t2 set b=repeat('Z', 400) where a=2;
17+
commit;
18+
start transaction;
19+
delete from t2 where a=2;
20+
commit;
21+
until @x <= time_to_sec(current_time())
22+
end repeat;
23+
end%
24+
RESET BINARY LOGS AND GTIDS;
25+
# Start running background load for 5 seconds
26+
call dmlload(5);
27+
# i: 1
28+
Got one of the listed errors
29+
# i: 2
30+
Got one of the listed errors
31+
# i: 3
32+
Got one of the listed errors
33+
# i: 4
34+
Got one of the listed errors
35+
# i: 5
36+
Got one of the listed errors
37+
# i: 6
38+
Got one of the listed errors
39+
# i: 7
40+
Got one of the listed errors
41+
# i: 8
42+
Got one of the listed errors
43+
# i: 9
44+
Got one of the listed errors
45+
# i: 10
46+
Got one of the listed errors
47+
# i: 11
48+
Got one of the listed errors
49+
# i: 12
50+
Got one of the listed errors
51+
# i: 13
52+
Got one of the listed errors
53+
# i: 14
54+
Got one of the listed errors
55+
# i: 15
56+
Got one of the listed errors
57+
# i: 16
58+
Got one of the listed errors
59+
# i: 17
60+
Got one of the listed errors
61+
# i: 18
62+
Got one of the listed errors
63+
# i: 19
64+
Got one of the listed errors
65+
# i: 20
66+
Got one of the listed errors
67+
# i: 21
68+
Got one of the listed errors
69+
# i: 22
70+
Got one of the listed errors
71+
# i: 23
72+
Got one of the listed errors
73+
# i: 24
74+
Got one of the listed errors
75+
# i: 25
76+
Got one of the listed errors
77+
# i: 26
78+
Got one of the listed errors
79+
# i: 27
80+
Got one of the listed errors
81+
# i: 28
82+
Got one of the listed errors
83+
# i: 29
84+
Got one of the listed errors
85+
# i: 30
86+
Got one of the listed errors
87+
# i: 31
88+
Got one of the listed errors
89+
# i: 32
90+
Got one of the listed errors
91+
# i: 33
92+
Got one of the listed errors
93+
# i: 34
94+
Got one of the listed errors
95+
# i: 35
96+
Got one of the listed errors
97+
# i: 36
98+
Got one of the listed errors
99+
# i: 37
100+
Got one of the listed errors
101+
# i: 38
102+
Got one of the listed errors
103+
# i: 39
104+
Got one of the listed errors
105+
# i: 40
106+
Got one of the listed errors
107+
# i: 41
108+
Got one of the listed errors
109+
# i: 42
110+
Got one of the listed errors
111+
# i: 43
112+
Got one of the listed errors
113+
# i: 44
114+
Got one of the listed errors
115+
# i: 45
116+
Got one of the listed errors
117+
# i: 46
118+
Got one of the listed errors
119+
# i: 47
120+
Got one of the listed errors
121+
# i: 48
122+
Got one of the listed errors
123+
# i: 49
124+
Got one of the listed errors
125+
# i: 50
126+
Got one of the listed errors
127+
# There should now be 50 incidents in the binlog
128+
# Collect background load result
129+
# Check that binlogs contain Gtid + Incident
130+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
131+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
132+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
133+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
134+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
135+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
136+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
137+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
138+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
139+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
140+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
141+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
142+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
143+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
144+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
145+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
146+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
147+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
148+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
149+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
150+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
151+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
152+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
153+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
154+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
155+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
156+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
157+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
158+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
159+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
160+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
161+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
162+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
163+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
164+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
165+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
166+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
167+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
168+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
169+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
170+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
171+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
172+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
173+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
174+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
175+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
176+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
177+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
178+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
179+
include/rpl/assert_binlog_events.inc [!Gtid_or_anon # Incident]
180+
DROP TABLE t1;
181+
DROP TABLE t2;
182+
DROP PROCEDURE dmlload;
183+
SET GLOBAL max_binlog_stmt_cache_size= ORIGINAL_VALUE;
184+
SET GLOBAL binlog_stmt_cache_size= ORIGINAL_VALUE;

mysql-test/suite/binlog/r/binlog_write_error.result

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,3 @@
1-
CALL mtr.add_suppression("The content of the statement cache is corrupted "
2-
"while writing a rollback record of the transaction "
3-
"to the binary log. An incident event has been "
4-
"written to the binary log which will stop the "
5-
"slaves.");
61
#
72
# Initialization
83
#

mysql-test/suite/binlog/r/compression_incident_event.result

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,18 @@ SET SESSION binlog_transaction_compression = ON;
33
CREATE TABLE t1(c1 INT PRIMARY KEY, data TEXT(30000));
44
RESET BINARY LOGS AND GTIDS;
55
TRUNCATE performance_schema.binary_log_transaction_compression_stats;
6+
include/rpl/log_file_max_number.inc
67
# Adding debug point 'binlog_compression_inject_incident' to @@GLOBAL.debug
78
INSERT INTO t1 VALUES (1, REPEAT ('a', 1000));
9+
Check that Incident has been written
10+
Incident
811
# Removing debug point 'binlog_compression_inject_incident' from @@GLOBAL.debug
9-
include/assert.inc [Expected event type Query]
10-
include/assert.inc [Expected event type Table_map]
11-
include/assert.inc [Expected event type Write_rows]
12-
include/assert.inc [Expected event type Xid]
12+
include/assert.inc [Expected event type Anonymous_Gtid]
1313
include/assert.inc [Expected event type Incident]
1414
include/assert.inc [Number of rows in performance_schema.binary_log_transaction_compression_stats = 0 for compression_type != NONE (because of incident event)]
1515
include/assert.inc [Number of rows in performance_schema.binary_log_transaction_compression_stats = 1 for compression_type = NONE]
16+
include/rpl/log_file_max_number.inc
17+
include/assert.inc [incident rotated binlog]
1618
DROP TABLE t1;
1719
SET SESSION binlog_transaction_compression = SAVED;
1820
RESET BINARY LOGS AND GTIDS;

mysql-test/suite/binlog/t/binlog_incident.test

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ source include/have_log_bin.inc;
1414
RESET BINARY LOGS AND GTIDS;
1515
call mtr.add_suppression("Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT");
1616
call mtr.add_suppression("Replica SQL for channel '': The incident LOST_EVENTS occurred on the source.");
17-
call mtr.add_suppression("The content of the statement cache is corrupted while writing a rollback record of the transaction to the binary log. An incident event has been written to the binary log which will stop the replicas.");
17+
call mtr.add_suppression("Non-transactional changes were not written to the binlog. An incident event has been written to the binary log which will stop the replicas.");
1818

1919
--let $old_max_binlog_stmt_cache_size= query_get_value(SHOW VARIABLES LIKE "max_binlog_stmt_cache_size", Value, 1)
2020
--let $old_binlog_stmt_cache_size= query_get_value(SHOW VARIABLES LIKE "binlog_stmt_cache_size", Value, 1)

0 commit comments

Comments
 (0)