New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not change type or health map for health reasons. #1685
Conversation
This only fixes the go unit tests, not the integration tests yet.
And workers don't change used tablets to spare, but back to rdonly.
Pass in keyspace and shard to re-init tablets when re-starting them.
We now use the tablet type from the tablet record.
It is used if init_tablet_type is not set. It doesn't drive health check, that is on by default now.
Added a new error type to vt/health to mean 'replication is stopped and I have no idea what the value could be'. Mysql health module now tries to extrapolate value if possible. Added unit tests for Mysql health modules. If vttablet can't know what the replication lag is, it returns a high value, but keeps the server healthy.
Reviewed 56 of 58 files at r1, 7 of 7 files at r2. go/vt/mysqlctl/health.go, line 34 [r2] (raw file):
We should note in the comment that this is a worst case extrapolation. It's possible that it caught up in between polls, but we'll assume the worst. go/vt/mysqlctl/health.go, line 36 [r2] (raw file):
This would be cleaner if you make go/vt/tabletmanager/action_agent.go, line 187 [r1] (raw file):
This will break some internal monitoring (vtcoproc). That probably needs to be fixed prior to importing this change, since the coproc is rolled out separately. go/vt/tabletmanager/action_agent.go, line 199 [r1] (raw file):
I think removing this will break internal dashboards and possibly some monitoring (based on my code search for "target-tablet-type"). Are you planning to fix that along with the import? Or should we remove this later after we've migrated the monitoring to some other variable? go/vt/tabletmanager/healthcheck.go, line 176 [r1] (raw file):
Shouldn't this be: if IsRunningQueryService() && (BinlogPlayerMap == nil || !BinlogPlayerMap.isRunningFilteredReplication()) { ? go/vt/tabletmanager/healthcheck.go, line 362 [r2] (raw file):
Without this call, we'll stop going into lameduck when transitioning healthy -> not healthy (for replica or batch). That was added to help with vtgate EPM. I think in this new spareless world, the right place to add that lameduck back is before This call also used to detect when the MySQL port changes (due to mysqld restart) when transitioning between healthy and not healthy. Not sure if we rely on that? go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):
Backup() is under RPCWrapLockAction, so it will already call refreshTablet() after this returns. go/vt/tabletmanager/rpc_replication.go, line 287 [r2] (raw file):
Comment is missing RPCWrap type specification. go/vt/tabletmanager/rpc_replication.go, line 330 [r2] (raw file):
Comment is out of date. go/vt/vtctl/vtctl.go, line 172 [r2] (raw file):
Don't forget to regenerate the doc. proto/topodata.proto, line 94 [r1] (raw file):
Since this proto is stored in topo, and we have beta users, I think we should do this field removal the "safe" way according to the protobuf guide:
proto/topodata.proto, line 223 [r1] (raw file):
Same here. Comments from Reviewable |
The aggregator was not preserving the special error case, now it does, and has unit tests to prove it. The integration tests were not fully resetting replication (using 'reset slave' and not 'reset slave all'). The difference is 'reset slave' clears the values from 'show slave status', whereas 'reset slave all' makes the status show nothing (as is the startup case).
Review status: 59 of 64 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. go/vt/mysqlctl/health.go, line 34 [r2] (raw file):
|
Lameduck is clearer now. Re-adding a lameduck case: when going unhealthy. All lameduck mode is now conditioned on serving_state_grace_period > 0. Adding reserved obsolete comments to proto for the fields I removed.
The logic was wrong in the replication delay plugin: when mysqld is not running, or we can't get the result of 'show slave status', we can't extrapolate replication delay, as it probably means mysql is down. Also fixing a few tests to not use target_tablet_type any more.
Now always use enable_replication_lag_check. Fix the tests in this commit to all work properly.
Reviewed 22 of 22 files at r3, 4 of 10 files at r4. go/vt/health/health.go, line 107 [r3] (raw file):
Perhaps a comment for posterity:
go/vt/tabletmanager/healthcheck.go, line 371 [r4] (raw file):
Previously, the fact that we don't call https://github.com/youtube/vitess/blob/master/go/vt/servenv/run.go#L42 After servenv lameduck, the queryservice is stopped from a https://github.com/youtube/vitess/blob/master/go/cmd/vttablet/vttablet.go#L90 go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):
|
We used to store the TabletControl in agent, and use it to make the decision if we should be running query service in healthcheck. Now remember the decision from changeCallback, simpler. Also change the tablet display a bit to show the new information.
Review status: 55 of 69 files reviewed at latest revision, 6 unresolved discussions. go/vt/health/health.go, line 107 [r3] (raw file):
|
In our tests, we want default non-master tablets to be NOT_SERVING, as their replication is most likely not setup.
after you regenerate the vtctl doc.
|
Independently of grace period, we need to enter lameduck (and possibly exit right away), to communicate to vtgate. There is then the servenv lameduck that will make sure the state change gets sent to vtgate.
Signed-off-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com> Co-authored-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>
WIP, not ready for review yet.
This change is