Description
This comes out of an issue we kept running into causing great pain and outages
We would run with
gh-ost --allow-on-master -assume-rbr \
-critical-load Threads_running=700 \
-critical-load-hibernate-seconds 60 \
-max-load Threads_running=100 \
-nice-ratio 0.2 \
-chunk-size 5000 \
-ask-pass \
-database=sbtest \
-table=sbtest1 \
-user sbtest \
-host=XXX \
-postpone-cut-over-flag-file /tmp/gh-ost-postpone-cut-over-flag-sbtest1.log \
-skip-foreign-key-checks -throttle-additional-flag-file /tmp/gh-ost-throttle_flag-sbtest1.log \
--alter "add column c1 int(11) NOT NULL DEFAULT '1';" \
-verbose -initially-drop-old-table -execute 2>&1 | tee /tmp/gh-ost-sbtest1.log
Then once row copy was complete, we would be in migrating stage for a long time applying binlogs. At this point heartbeat lag would be 10-30 seconds. We thought if we increased max-lag-millis
from 1500 to 10000 this would give us less throttling and speed up binlog reading and applies (silly us!)
Heartbeat lag would drop below 10 seconds, we would remove the cutover file and then run into
"ERROR Timeout while waiting for events up to lock"
which made sense since 10 seconds > cutover lock timeout of 6 seconds
Our ask is that we edit the documentation to point out this important effect of the seemingly innocent parameter as evident here https://github.com/github/gh-ost/blob/master/go/logic/migrator.go#L504
if heartbeatLag > maxLagMillisecondsThrottle || heartbeatLag > cutOverLockTimeout {
this.migrationContext.Log.Debugf("current HeartbeatLag (%.2fs) is too high, it needs to be less than both --max-lag-millis (%.2fs) and --cut-over-lock-timeout-seconds (%.2fs) to continue", heartbeatLag.Seconds(), maxLagMillisecondsThrottle.Seconds(), cutOverLockTimeout.Seconds())
return true, nil
}
I also have 2 questions,
- after this error gh-ost went back to the migrating stage, supposedly to do rollbacks which caused an availability issue and we had to kill the process. What is it rolling back, since no cutover was done, should it not simply abandon things and release locks?
- we obviously didn't wait for rollback to complete but should the metadata locks not released prior to the rollback and just after this error occurs? In our case both the new and old tables remained locked till we killed the process.
For context we are on AWS Aurora and the high hearbeat lag is a side effect of aurora_binlog_replication_max_yield_seconds
set to non-zero