max-lag-millis parameter description is misleading when running on master

This comes out of an issue we kept running into causing great pain and outages

We would run with 
```
gh-ost --allow-on-master -assume-rbr \
-critical-load Threads_running=700 \
-critical-load-hibernate-seconds 60 \
-max-load Threads_running=100 \
-nice-ratio 0.2 \
-chunk-size 5000 \
-ask-pass \
-database=sbtest \
-table=sbtest1 \
-user sbtest \
-host=XXX \
-postpone-cut-over-flag-file /tmp/gh-ost-postpone-cut-over-flag-sbtest1.log \
-skip-foreign-key-checks -throttle-additional-flag-file /tmp/gh-ost-throttle_flag-sbtest1.log \
--alter "add column c1 int(11) NOT NULL DEFAULT '1';" \
-verbose -initially-drop-old-table -execute 2>&1 | tee /tmp/gh-ost-sbtest1.log
```
Then once row copy was complete, we would be in migrating stage for a long time applying binlogs. At this point heartbeat lag would be 10-30 seconds. We thought if we increased `max-lag-millis` from 1500 to 10000 this would give us less throttling and speed up binlog reading and applies (silly us!) 

Heartbeat lag would drop below 10 seconds, we would remove the cutover file and then run into 
` "ERROR Timeout while waiting for events up to lock"`

which made sense since 10 seconds > cutover lock timeout of 6 seconds

Our ask is that we edit the documentation to point out this important effect of the seemingly innocent parameter as evident here https://github.com/github/gh-ost/blob/master/go/logic/migrator.go#L504 
```
if heartbeatLag > maxLagMillisecondsThrottle || heartbeatLag > cutOverLockTimeout {
				this.migrationContext.Log.Debugf("current HeartbeatLag (%.2fs) is too high, it needs to be less than both --max-lag-millis (%.2fs) and --cut-over-lock-timeout-seconds (%.2fs) to continue", heartbeatLag.Seconds(), maxLagMillisecondsThrottle.Seconds(), cutOverLockTimeout.Seconds())
				return true, nil
			}
```
I also have 2 questions, 

1. after this error gh-ost went back to the migrating stage, supposedly to do rollbacks which caused an availability issue and we had to kill the process. What is it rolling back, since no cutover was done, should it not simply abandon things and release locks? 
2. we obviously didn't wait for rollback to complete but should the metadata locks not released prior to the rollback and just after this error occurs? In our case both the new and old tables remained locked till we killed the process. 

For context we are on AWS Aurora and the high hearbeat lag is a side effect of `aurora_binlog_replication_max_yield_seconds` set to non-zero 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

max-lag-millis parameter description is misleading when running on master #1387

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

max-lag-millis parameter description is misleading when running on master #1387

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions