Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not change type or health map for health reasons. #1685

Merged
merged 32 commits into from May 11, 2016

Conversation

alainjobart
Copy link
Contributor

@alainjobart alainjobart commented May 6, 2016

WIP, not ready for review yet.


This change is Reviewable

This only fixes the go unit tests, not the integration tests yet.
And workers don't change used tablets to spare, but back to rdonly.
Pass in keyspace and shard to re-init tablets when re-starting them.
We now use the tablet type from the tablet record.
It is used if init_tablet_type is not set.
It doesn't drive health check, that is on by default now.
Added a new error type to vt/health to mean 'replication is stopped and
I have no idea what the value could be'.
Mysql health module now tries to extrapolate value if possible.
Added unit tests for Mysql health modules.
If vttablet can't know what the replication lag is, it returns a high
value, but keeps the server healthy.
@enisoc
Copy link
Member

enisoc commented May 9, 2016

Reviewed 56 of 58 files at r1, 7 of 7 files at r2.
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


go/vt/mysqlctl/health.go, line 34 [r2] (raw file):

  if err != nil {
      if !mrl.lastKnownTime.IsZero() {
          // we can extrapolate

We should note in the comment that this is a worst case extrapolation. It's possible that it caught up in between polls, but we'll assume the worst.


go/vt/mysqlctl/health.go, line 36 [r2] (raw file):

          // we can extrapolate
          elapsed := mrl.now().Sub(mrl.lastKnownTime)
          return elapsed + time.Duration(mrl.lastKnownValue)*time.Second, nil

This would be cleaner if you make lastKnownValue be of type time.Duration. Then you do the conversion in only one place (when setting lastKnownValue) and at the end you return mrl.lastKnownValue.


go/vt/tabletmanager/action_agent.go, line 187 [r1] (raw file):

      SchemaOverrides:     schemaOverrides,
      History:             history.New(historyLength),
      lastHealthMapCount:  stats.NewInt("LastHealthMapCount"),

This will break some internal monitoring (vtcoproc). That probably needs to be fixed prior to importing this change, since the coproc is rolled out separately.


go/vt/tabletmanager/action_agent.go, line 199 [r1] (raw file):

  // Publish and set the TargetTabletType. Not a global var
  // since it should never be changed.
  statsTabletType := stats.NewString("TargetTabletType")

I think removing this will break internal dashboards and possibly some monitoring (based on my code search for "target-tablet-type"). Are you planning to fix that along with the import? Or should we remove this later after we've migrated the monitoring to some other variable?


go/vt/tabletmanager/healthcheck.go, line 176 [r1] (raw file):

  shouldBeServing := false
  runUpdateStream := true
  if topo.IsRunningQueryService(tablet.Type) && agent.BinlogPlayerMap != nil && !agent.BinlogPlayerMap.isRunningFilteredReplication() {

Shouldn't this be:

if IsRunningQueryService() && (BinlogPlayerMap == nil || !BinlogPlayerMap.isRunningFilteredReplication()) {

?


go/vt/tabletmanager/healthcheck.go, line 362 [r2] (raw file):

  // Note that this is where we might block for *gracePeriod, depending on the
  // type of state change. See changeCallback() for details.
  if err := agent.refreshTablet(agent.batchCtx, "healthcheck"); err != nil {

Without this call, we'll stop going into lameduck when transitioning healthy -> not healthy (for replica or batch). That was added to help with vtgate EPM. I think in this new spareless world, the right place to add that lameduck back is before disallowQueries() above? I'm not totally sure yet.

This call also used to detect when the MySQL port changes (due to mysqld restart) when transitioning between healthy and not healthy. Not sure if we rely on that?


go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):

  // let's update our internal state (start query service and other things)
  if err := agent.refreshTablet(ctx, "after backup"); err != nil {

Backup() is under RPCWrapLockAction, so it will already call refreshTablet() after this returns.


go/vt/tabletmanager/rpc_replication.go, line 287 [r2] (raw file):

// SetMaster sets replication master, and waits for the
// reparent_journal table entry up to context timeout

Comment is missing RPCWrap type specification.


go/vt/tabletmanager/rpc_replication.go, line 330 [r2] (raw file):

  }

  // change our type to spare if we used to be the master

Comment is out of date.


go/vt/vtctl/vtctl.go, line 172 [r2] (raw file):

              "Reloads the tablet record on the specified tablet."},
          {"RunHealthCheck", commandRunHealthCheck,
              "<tablet alias> <target tablet type>",

Don't forget to regenerate the doc.


proto/topodata.proto, line 94 [r1] (raw file):

  // tablet health information
  map<string, string> health_map = 11;

Since this proto is stored in topo, and we have beta users, I think we should do this field removal the "safe" way according to the protobuf guide:

// OBSOLETE: tablet health information
// map<string, string> health_map = 11;
reserved 11;

proto/topodata.proto, line 223 [r1] (raw file):

  // The health entries.
  map<string, string> health_map = 4;

Same here.


Comments from Reviewable

The aggregator was not preserving the special error case, now it does,
and has unit tests to prove it.

The integration tests were not fully resetting replication (using 'reset
slave' and not 'reset slave all'). The difference is 'reset slave'
clears the values from 'show slave status', whereas 'reset slave all'
makes the status show nothing (as is the startup case).
@alainjobart
Copy link
Contributor Author

Review status: 59 of 64 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


go/vt/mysqlctl/health.go, line 34 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

We should note in the comment that this is a worst case extrapolation. It's possible that it caught up in between polls, but we'll assume the worst.


Done.


go/vt/mysqlctl/health.go, line 36 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

This would be cleaner if you make lastKnownValue be of type time.Duration. Then you do the conversion in only one place (when setting lastKnownValue) and at the end you return mrl.lastKnownValue.


Done.


go/vt/tabletmanager/action_agent.go, line 199 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

I think removing this will break internal dashboards and possibly some monitoring (based on my code search for "target-tablet-type"). Are you planning to fix that along with the import? Or should we remove this later after we've migrated the monitoring to some other variable?


Let's talk about this one, I can leave it as is with the current type easily.


go/vt/tabletmanager/healthcheck.go, line 176 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Shouldn't this be:

if IsRunningQueryService() && (BinlogPlayerMap == nil || !BinlogPlayerMap.isRunningFilteredReplication()) {

?


yes it should be!


go/vt/tabletmanager/healthcheck.go, line 362 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Without this call, we'll stop going into lameduck when transitioning healthy -> not healthy (for replica or batch). That was added to help with vtgate EPM. I think in this new spareless world, the right place to add that lameduck back is before disallowQueries() above? I'm not totally sure yet.

This call also used to detect when the MySQL port changes (due to mysqld restart) when transitioning between healthy and not healthy. Not sure if we rely on that?


I am not sure here. If we're unhealthy, we can't serve anyway, so is there any value in going into lameduck? Ah yes, if droid tells us it's being drained. I will add it then.

Also adding the mysql port change when going from non-serving to serving, good catch.


go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Backup() is under RPCWrapLockAction, so it will already call refreshTablet() after this returns.


we need the health check to run after the tablet is refreshed though, otherwise it won't get the right type.

I am thinking of re-running health check after a refresh state, so it broadcasts the right state. Then we can leave the restarting of the query service to only be in health check. What do you think?


go/vt/tabletmanager/rpc_replication.go, line 287 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Comment is missing RPCWrap type specification.


Done.


go/vt/tabletmanager/rpc_replication.go, line 330 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Comment is out of date.


Done.


proto/topodata.proto, line 94 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Since this proto is stored in topo, and we have beta users, I think we should do this field removal the "safe" way according to the protobuf guide:

// OBSOLETE: tablet health information
// map<string, string> health_map = 11;
reserved 11;
```</details>
Done.

proto/topodata.proto, line 223 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Same here.


Done.


Comments from Reviewable

Lameduck is clearer now.
Re-adding a lameduck case: when going unhealthy.
All lameduck mode is now conditioned on serving_state_grace_period > 0.
Adding reserved obsolete comments to proto for the fields I removed.
The logic was wrong in the replication delay plugin: when mysqld is not
running, or we can't get the result of 'show slave status', we can't
extrapolate replication delay, as it probably means mysql is down.

Also fixing a few tests to not use target_tablet_type any more.
Now always use enable_replication_lag_check. Fix the tests in this
commit to all work properly.
@enisoc
Copy link
Member

enisoc commented May 10, 2016

Reviewed 22 of 22 files at r3, 4 of 10 files at r4.
Review status: 63 of 69 files reviewed at latest revision, 6 unresolved discussions, some commit checks broke.


go/vt/health/health.go, line 107 [r3] (raw file):

      switch s.err {
      case ErrSlaveNotRunning:
          err = ErrSlaveNotRunning

Perhaps a comment for posterity:

// Return the ErrSlaveNotRunning sentinel value, only if there are no other errors.

go/vt/tabletmanager/healthcheck.go, line 371 [r4] (raw file):

  }

  // in any case, we're done now

Previously, the fact that we don't call disallowQueries() in terminateHealthChecks() for a master was intentional. In case of a master doing a graceful restart, we want it to continue serving during servenv lameduck, because there's nowhere else for master queries to go:

https://github.com/youtube/vitess/blob/master/go/vt/servenv/run.go#L42

After servenv lameduck, the queryservice is stopped from a servenv.OnClose() hook:

https://github.com/youtube/vitess/blob/master/go/cmd/vttablet/vttablet.go#L90


go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):

Previously, alainjobart (Alain Jobart) wrote…

we need the health check to run after the tablet is refreshed though, otherwise it won't get the right type.

I am thinking of re-running health check after a refresh state, so it broadcasts the right state. Then we can leave the restarting of the query service to only be in health check. What do you think?


SGTM


Comments from Reviewable

We used to store the TabletControl in agent, and use it to make the
decision if we should be running query service in healthcheck.
Now remember the decision from changeCallback, simpler.

Also change the tablet display a bit to show the new information.
@alainjobart
Copy link
Contributor Author

Review status: 55 of 69 files reviewed at latest revision, 6 unresolved discussions.


go/vt/health/health.go, line 107 [r3] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Perhaps a comment for posterity:

// Return the ErrSlaveNotRunning sentinel value, only if there are no other errors.
```</details>
Done.

go/vt/tabletmanager/action_agent.go, line 187 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

This will break some internal monitoring (vtcoproc). That probably needs to be fixed prior to importing this change, since the coproc is rolled out separately.


So we realized coproc is now obsolete, and will go away.


go/vt/tabletmanager/healthcheck.go, line 371 [r4] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Previously, the fact that we don't call disallowQueries() in terminateHealthChecks() for a master was intentional. In case of a master doing a graceful restart, we want it to continue serving during servenv lameduck, because there's nowhere else for master queries to go:

https://github.com/youtube/vitess/blob/master/go/vt/servenv/run.go#L42

After servenv lameduck, the queryservice is stopped from a servenv.OnClose() hook:

https://github.com/youtube/vitess/blob/master/go/cmd/vttablet/vttablet.go#L90


I reverted back, and added a comment.


go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

SGTM


I ended up just simplifying the logic and remembering it. This is a bigger change that I'm not willing to make just yet.


Comments from Reviewable

In our tests, we want default non-master tablets to be NOT_SERVING, as
their replication is most likely not setup.
@enisoc
Copy link
Member

enisoc commented May 10, 2016

:lgtm: after you regenerate the vtctl doc.

Previously, alainjobart (Alain Jobart) wrote…

Do not change type or health map for health reasons.

WIP, not ready for review yet.


Reviewed 5 of 10 files at r4, 9 of 9 files at r5.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks broke.


Comments from Reviewable

Approved with PullApprove

Independently of grace period, we need to enter lameduck (and possibly
exit right away), to communicate to vtgate. There is then the servenv
lameduck that will make sure the state change gets sent to vtgate.
@alainjobart alainjobart merged commit 4c55019 into vitessio:master May 11, 2016
@alainjobart alainjobart deleted the sparenomore branch July 7, 2016 08:21
frouioui pushed a commit to planetscale/vitess that referenced this pull request Nov 21, 2023
Signed-off-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>
Co-authored-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants