Clarification on `aurora_enable_repl_bin_log_filtering` in Aurora

Hi!

In the [RDS documentation](https://github.com/github/gh-ost/blob/master/doc/rds.md#binlog-filtering) it recommends disabling `aurora_enable_repl_bin_log_filtering` when using GH-OST in Aurora MySQL. Looking through the history I can see a few references to this potentially causing issues with cutovers, so I was looking to see if there was a reproducible case which can demonstrate this? 

## `aurora_enable_repl_bin_log_filtering` Background
In Aurora, the writer instance in a cluster sends redo log records to the aurora storage volume and reader nodes in the cluster. Since Aurora MySQL stores binary logs in the cluster volume, redo log records are also be generated for binary logs under the hood. This is all transparent to users, and the binary logs are presented like they would be in community MySQL. When  `aurora_enable_repl_bin_log_filtering` is enabled, redo logs for binary log records will still be sent to Aurora storage, but they will not be sent to aurora readers within the same cluster. Without filtering, these redo logs will be sent to reader instances and discarded since binary logs are not accessible from readers. This can lead to unnecessary amplification of network traffic on the writer/reader so the recommendation is to leave `aurora_enable_repl_bin_log_filtering` enabled. Binlog filtering is always enabled in Aurora MySQL version 3, so the parameter was [removed in the Aurora 3 major version](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Reference.ParameterGroups.html#AuroraMySQL.Reference.Parameters.Cluster).  This setting should not be confused with MySQL binlog replication filtering. 

Since binary logs are not accessible on readers, and `aurora_enable_repl_bin_log_filtering` only affects internal transportation of redo, I highly suspect this may not be the read cause of the cutover issues seen. 

# Theory
What I suspect is the real cause is the `aurora_binlog_replication_max_yield_seconds` parameter. 

## `aurora_binlog_replication_max_yield_seconds ` Background
`aurora_binlog_replication_max_yield_seconds ` was introduced along with `aurora_binlog_read_buffer_size ` to improve read performance of binary log consumer threads. The idea here was, to increase the IO read size( `aurora_binlog_read_buffer_size `) for each binlog read req made by a consumer thread to improve IO efficiency/throughput of binlog consumer threads. Instead of reading binary logs from aurora storage in 8K chunks, you could configure the read buffer size of each binlog read request. The drawback was that if you were reading in larger chunks, it could lead to more contention with foreground transactions. To allow users configure this tradeoff `aurora_binlog_replication_max_yield_seconds ` was introduced. Instead of constantly contending with foreground transactions on the active binary log you could configure binary log consumers to "backoff" or "yield" for a number of seconds defined by `aurora_binlog_replication_max_yield_seconds `. 

**Note:** In Aurora version 2.10 the binlog IO cache was introduced(more information [here](https://aws.amazon.com/blogs/database/introducing-binlog-i-o-cache-in-amazon-aurora-mysql-to-improve-binlog-performance/)) which removed the need for the above yield second configurations. The binary log I/O cache aims to minimize read I/O from the Aurora storage layer by keeping the most recent binary log change events in its circular cache on your writer DB instance. This enhancement is enabled by default, so should not require any additional configurations. 
from [docs](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Replication.MySQL.html#binlog-optimization-binlog-io-cache):
> You don't need to adjust any configuration parameters to turn on this optimization. In particular, if you adjust the configuration parameter aurora_binlog_replication_max_yield_seconds to a nonzero value in earlier Aurora MySQL versions, set it back to zero for Aurora MySQL 2.10 and higher.

What I suspect here is that if `aurora_binlog_replication_max_yield_seconds` is configured to a non-zero value, the heartbeat in GH-OST cutovers will never reach zero for long enough to allow the cutover succeed, as it will read an event, sleep("yield"), repeat. 

# Test
I done some basic testing on my side to validate this and it seems to be the case, but would love to hear from others. With yield seconds set to `60`, my heartbeatLag does not decrease:
```
mysql> select @@aurora_binlog_replication_max_yield_seconds,@@aurora_enable_repl_bin_log_filtering,@@aurora_version;
+-----------------------------------------------+----------------------------------------+------------------+
| @@aurora_binlog_replication_max_yield_seconds | @@aurora_enable_repl_bin_log_filtering | @@aurora_version |
+-----------------------------------------------+----------------------------------------+------------------+
|                                            60 |                                      1 | 2.11.2           |
+-----------------------------------------------+----------------------------------------+------------------+
1 row in set (0.00 sec)

$ gh-ost --max-load=Threads_running=25 --critical-load=Threads_running=1000 --chunk-size=1000 --max-lag-millis=1500 --user="user" --password="password" --host=<HOST> --allow-on-master --database="sysbench" --table="sbtest1" --verbose --alter="engine=innodb" --switch-to-rbr --allow-master-master --cut-over=default --exact-rowcount --concurrent-rowcount --default-retries=120 --panic-flag-file=/tmp/ghost.panic.flag --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag --execute

[2024/06/18 22:04:39] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000070, 4)
Copy: 247740/247740 100.0%; Applied: 1596183; Backlog: 1000/1000; Time: 43m0s(total), 9s(copy); streamer: mysql-bin-changelog.000070:4755374; Lag: 0.00s, HeartbeatLag: 62.91s, State: postponing cut-over; ETA: due
Copy: 247740/247740 100.0%; Applied: 1638082; Backlog: 0/1000; Time: 43m30s(total), 9s(copy); streamer: mysql-bin-changelog.000070:134208037; Lag: 0.00s, HeartbeatLag: 30.61s, State: postponing cut-over; ETA: due
Copy: 247740/247740 100.0%; Applied: 1638082; Backlog: 0/1000; Time: 44m0s(total), 9s(copy); streamer: mysql-bin-changelog.000070:134208037; Lag: 0.00s, HeartbeatLag: 60.61s, State: postponing cut-over; ETA: due
[2024/06/18 22:05:45] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000071, 4)
[2024/06/18 22:05:45] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000071, 4)
2024-06-18 22:05:45 INFO rotate to next log from mysql-bin-changelog.000071:134221467 to mysql-bin-changelog.000071
2024-06-18 22:05:45 INFO rotate to next log from mysql-bin-changelog.000071:0 to mysql-bin-changelog.000071
Copy: 247740/247740 100.0%; Applied: 1676342; Backlog: 1000/1000; Time: 44m30s(total), 9s(copy); streamer: mysql-bin-changelog.000071:124450137; Lag: 0.00s, HeartbeatLag: 31.31s, State: postponing cut-over; ETA: due
Copy: 247740/247740 100.0%; Applied: 1680895; Backlog: 0/1000; Time: 45m0s(total), 9s(copy); streamer: mysql-bin-changelog.000071:134204651; Lag: 0.00s, HeartbeatLag: 56.01s, State: postponing cut-over; ETA: due
[2024/06/18 22:06:48] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000072, 4)
2024-06-18 22:06:48 INFO rotate to next log from mysql-bin-changelog.000072:134218538 to mysql-bin-changelog.000072
2024-06-18 22:06:48 INFO rotate to next log from mysql-bin-changelog.000072:0 to mysql-bin-changelog.000072
[2024/06/18 22:06:48] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000072, 4)
Copy: 247740/247740 100.0%; Applied: 1713305; Backlog: 1000/1000; Time: 45m30s(total), 9s(copy); streamer: mysql-bin-changelog.000072:109020755; Lag: 0.00s, HeartbeatLag: 35.41s, State: postponing cut-over; ETA: due
Copy: 247740/247740 100.0%; Applied: 1723592; Backlog: 0/1000; Time: 46m0s(total), 9s(copy); streamer: mysql-bin-changelog.000072:134205251; Lag: 0.00s, HeartbeatLag: 51.61s, State: postponing cut-over; ETA: due
[2024/06/18 22:07:53] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000073, 4)
[2024/06/18 22:07:53] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000073, 4)
2024-06-18 22:07:53 INFO rotate to next log from mysql-bin-changelog.000073:134220372 to mysql-bin-changelog.000073
2024-06-18 22:07:53 INFO rotate to next log from mysql-bin-changelog.000073:0 to mysql-bin-changelog.000073
Copy: 247740/247740 100.0%; Applied: 1748532; Backlog: 1000/1000; Time: 46m30s(total), 9s(copy); streamer: mysql-bin-changelog.000073:89642975; Lag: 0.00s, HeartbeatLag: 41.61s, State: postponing cut-over; ETA: due
```

Shortly after I set `aurora_binlog_replication_max_yield_seconds` back to zero in my parameter group, the cutover succeeded:
```
mysql> select @@aurora_binlog_replication_max_yield_seconds,@@aurora_enable_repl_bin_log_filtering,@@aurora_version;
+-----------------------------------------------+----------------------------------------+------------------+
| @@aurora_binlog_replication_max_yield_seconds | @@aurora_enable_repl_bin_log_filtering | @@aurora_version |
+-----------------------------------------------+----------------------------------------+------------------+
|                                             0 |                                      1 | 2.11.2           |
+-----------------------------------------------+----------------------------------------+------------------+
1 row in set (0.00 sec)

2024-06-18 22:11:57 INFO Done waiting for events up to lock; duration=683.348399ms
# Migrating `sysbench`.`sbtest1`; Ghost table is `sysbench`.`_sbtest1_gho`
# Migrating ip-172-17-5-175:3306; inspecting ip-172-17-5-175:3306; executing on i-09fce0c3dab1a3eb2.ec2.internal
# Migration started at Tue Jun 18 21:21:39 +0000 2024
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=25; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.sysbench.sbtest1.sock
Copy: 247740/247740 100.0%; Applied: 1924559; Backlog: 0/1000; Time: 50m17s(total), 9s(copy); streamer: mysql-bin-changelog.000077:117908084; Lag: 0.00s, HeartbeatLag: 0.10s, State: migrating; ETA: due
2024-06-18 22:11:57 INFO Setting RENAME timeout as 3 seconds
2024-06-18 22:11:57 INFO Session renaming tables is 301
2024-06-18 22:11:57 INFO Issuing and expecting this to block: rename /* gh-ost */ table `sysbench`.`sbtest1` to `sysbench`.`_sbtest1_del`, `sysbench`.`_sbtest1_gho` to `sysbench`.`sbtest1`
2024-06-18 22:11:58 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2024-06-18 22:11:58 INFO Checking session lock: gh-ost.312.lock
2024-06-18 22:11:58 INFO Connection holding lock on original table still exists
2024-06-18 22:11:58 INFO Will now proceed to drop magic table and unlock tables
2024-06-18 22:11:58 INFO Dropping magic cut-over table
2024-06-18 22:11:58 INFO Releasing lock from `sysbench`.`sbtest1`, `sysbench`.`_sbtest1_del`
2024-06-18 22:11:58 INFO Tables unlocked
2024-06-18 22:11:58 INFO Tables renamed
2024-06-18 22:11:58 INFO Lock & rename duration: 1.721119655s. During this time, queries on `sbtest1` were blocked
[2024/06/18 22:11:58] [info] binlogsyncer.go:164 syncer is closing...
[2024/06/18 22:11:58] [error] binlogstreamer.go:77 close sync with err: sync is been closing...
2024-06-18 22:11:58 INFO Closed streamer connection. err=<nil>
[2024/06/18 22:11:58] [info] binlogsyncer.go:179 syncer is closed
2024-06-18 22:11:58 INFO Dropping table `sysbench`.`_sbtest1_ghc`
2024-06-18 22:11:58 INFO Table dropped
2024-06-18 22:11:58 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2024-06-18 22:11:58 INFO -- drop table `sysbench`.`_sbtest1_del`
2024-06-18 22:11:58 INFO Done migrating `sysbench`.`sbtest1`
2024-06-18 22:11:58 INFO Removing socket file: /tmp/gh-ost.sysbench.sbtest1.sock
2024-06-18 22:11:58 INFO Tearing down inspector
2024-06-18 22:11:58 INFO Tearing down applier
2024-06-18 22:11:58 INFO Tearing down streamer
2024-06-18 22:11:58 INFO Tearing down throttler
# Done
```

Would love to hear what you think. 

Thanks,
Marc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on `aurora_enable_repl_bin_log_filtering` in Aurora #1424

`aurora_enable_repl_bin_log_filtering` Background

Theory

`aurora_binlog_replication_max_yield_seconds` Background

Test

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on aurora_enable_repl_bin_log_filtering in Aurora #1424

Description

aurora_enable_repl_bin_log_filtering Background

Theory

aurora_binlog_replication_max_yield_seconds Background

Test

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Clarification on `aurora_enable_repl_bin_log_filtering` in Aurora #1424

`aurora_enable_repl_bin_log_filtering` Background

`aurora_binlog_replication_max_yield_seconds` Background