Do not change type or health map for health reasons. #1685

alainjobart · 2016-05-06T17:38:04Z

WIP, not ready for review yet.

This change is

This only fixes the go unit tests, not the integration tests yet.

And workers don't change used tablets to spare, but back to rdonly.

Pass in keyspace and shard to re-init tablets when re-starting them.

We now use the tablet type from the tablet record.

It is used if init_tablet_type is not set. It doesn't drive health check, that is on by default now.

Added a new error type to vt/health to mean 'replication is stopped and I have no idea what the value could be'. Mysql health module now tries to extrapolate value if possible. Added unit tests for Mysql health modules. If vttablet can't know what the replication lag is, it returns a high value, but keeps the server healthy.

enisoc · 2016-05-09T23:10:59Z

Reviewed 56 of 58 files at r1, 7 of 7 files at r2.
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

go/vt/mysqlctl/health.go, line 34 [r2] (raw file):

  if err != nil {
      if !mrl.lastKnownTime.IsZero() {
          // we can extrapolate

We should note in the comment that this is a worst case extrapolation. It's possible that it caught up in between polls, but we'll assume the worst.

go/vt/mysqlctl/health.go, line 36 [r2] (raw file):

          // we can extrapolate
          elapsed := mrl.now().Sub(mrl.lastKnownTime)
          return elapsed + time.Duration(mrl.lastKnownValue)*time.Second, nil

This would be cleaner if you make lastKnownValue be of type time.Duration. Then you do the conversion in only one place (when setting lastKnownValue) and at the end you return mrl.lastKnownValue.

go/vt/tabletmanager/action_agent.go, line 187 [r1] (raw file):

      SchemaOverrides:     schemaOverrides,
      History:             history.New(historyLength),
      lastHealthMapCount:  stats.NewInt("LastHealthMapCount"),

This will break some internal monitoring (vtcoproc). That probably needs to be fixed prior to importing this change, since the coproc is rolled out separately.

go/vt/tabletmanager/action_agent.go, line 199 [r1] (raw file):

  // Publish and set the TargetTabletType. Not a global var
  // since it should never be changed.
  statsTabletType := stats.NewString("TargetTabletType")

I think removing this will break internal dashboards and possibly some monitoring (based on my code search for "target-tablet-type"). Are you planning to fix that along with the import? Or should we remove this later after we've migrated the monitoring to some other variable?

go/vt/tabletmanager/healthcheck.go, line 176 [r1] (raw file):

  shouldBeServing := false
  runUpdateStream := true
  if topo.IsRunningQueryService(tablet.Type) && agent.BinlogPlayerMap != nil && !agent.BinlogPlayerMap.isRunningFilteredReplication() {

Shouldn't this be:

if IsRunningQueryService() && (BinlogPlayerMap == nil || !BinlogPlayerMap.isRunningFilteredReplication()) {

?

go/vt/tabletmanager/healthcheck.go, line 362 [r2] (raw file):

  // Note that this is where we might block for *gracePeriod, depending on the
  // type of state change. See changeCallback() for details.
  if err := agent.refreshTablet(agent.batchCtx, "healthcheck"); err != nil {

Without this call, we'll stop going into lameduck when transitioning healthy -> not healthy (for replica or batch). That was added to help with vtgate EPM. I think in this new spareless world, the right place to add that lameduck back is before disallowQueries() above? I'm not totally sure yet.

This call also used to detect when the MySQL port changes (due to mysqld restart) when transitioning between healthy and not healthy. Not sure if we rely on that?

go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):

  // let's update our internal state (start query service and other things)
  if err := agent.refreshTablet(ctx, "after backup"); err != nil {

Backup() is under RPCWrapLockAction, so it will already call refreshTablet() after this returns.

go/vt/tabletmanager/rpc_replication.go, line 287 [r2] (raw file):

// SetMaster sets replication master, and waits for the
// reparent_journal table entry up to context timeout

Comment is missing RPCWrap type specification.

go/vt/tabletmanager/rpc_replication.go, line 330 [r2] (raw file):

  }

  // change our type to spare if we used to be the master

Comment is out of date.

go/vt/vtctl/vtctl.go, line 172 [r2] (raw file):

              "Reloads the tablet record on the specified tablet."},
          {"RunHealthCheck", commandRunHealthCheck,
              "<tablet alias> <target tablet type>",

Don't forget to regenerate the doc.

proto/topodata.proto, line 94 [r1] (raw file):

  // tablet health information
  map<string, string> health_map = 11;

Since this proto is stored in topo, and we have beta users, I think we should do this field removal the "safe" way according to the protobuf guide:

// OBSOLETE: tablet health information
// map<string, string> health_map = 11;
reserved 11;

proto/topodata.proto, line 223 [r1] (raw file):

  // The health entries.
  map<string, string> health_map = 4;

Same here.

Comments from Reviewable

The aggregator was not preserving the special error case, now it does, and has unit tests to prove it. The integration tests were not fully resetting replication (using 'reset slave' and not 'reset slave all'). The difference is 'reset slave' clears the values from 'show slave status', whereas 'reset slave all' makes the status show nothing (as is the startup case).

alainjobart · 2016-05-10T16:03:26Z

Review status: 59 of 64 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

go/vt/mysqlctl/health.go, line 34 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

We should note in the comment that this is a worst case extrapolation. It's possible that it caught up in between polls, but we'll assume the worst.

Done.

go/vt/mysqlctl/health.go, line 36 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

This would be cleaner if you make lastKnownValue be of type time.Duration. Then you do the conversion in only one place (when setting lastKnownValue) and at the end you return mrl.lastKnownValue.

Done.

go/vt/tabletmanager/action_agent.go, line 199 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

I think removing this will break internal dashboards and possibly some monitoring (based on my code search for "target-tablet-type"). Are you planning to fix that along with the import? Or should we remove this later after we've migrated the monitoring to some other variable?

Let's talk about this one, I can leave it as is with the current type easily.

go/vt/tabletmanager/healthcheck.go, line 176 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Shouldn't this be:

if IsRunningQueryService() && (BinlogPlayerMap == nil || !BinlogPlayerMap.isRunningFilteredReplication()) {

?

yes it should be!

go/vt/tabletmanager/healthcheck.go, line 362 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Without this call, we'll stop going into lameduck when transitioning healthy -> not healthy (for replica or batch). That was added to help with vtgate EPM. I think in this new spareless world, the right place to add that lameduck back is before disallowQueries() above? I'm not totally sure yet.

This call also used to detect when the MySQL port changes (due to mysqld restart) when transitioning between healthy and not healthy. Not sure if we rely on that?

I am not sure here. If we're unhealthy, we can't serve anyway, so is there any value in going into lameduck? Ah yes, if droid tells us it's being drained. I will add it then.

Also adding the mysql port change when going from non-serving to serving, good catch.

go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Backup() is under RPCWrapLockAction, so it will already call refreshTablet() after this returns.

we need the health check to run after the tablet is refreshed though, otherwise it won't get the right type.

I am thinking of re-running health check after a refresh state, so it broadcasts the right state. Then we can leave the restarting of the query service to only be in health check. What do you think?

go/vt/tabletmanager/rpc_replication.go, line 287 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Comment is missing RPCWrap type specification.

Done.

go/vt/tabletmanager/rpc_replication.go, line 330 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Comment is out of date.

Done.

proto/topodata.proto, line 94 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Since this proto is stored in topo, and we have beta users, I think we should do this field removal the "safe" way according to the protobuf guide:
// OBSOLETE: tablet health information
// map<string, string> health_map = 11;
reserved 11;
```</details>
Done.

proto/topodata.proto, line 223 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Same here.

Done.

Comments from Reviewable

Lameduck is clearer now. Re-adding a lameduck case: when going unhealthy. All lameduck mode is now conditioned on serving_state_grace_period > 0. Adding reserved obsolete comments to proto for the fields I removed.

The logic was wrong in the replication delay plugin: when mysqld is not running, or we can't get the result of 'show slave status', we can't extrapolate replication delay, as it probably means mysql is down. Also fixing a few tests to not use target_tablet_type any more.

Now always use enable_replication_lag_check. Fix the tests in this commit to all work properly.

enisoc · 2016-05-10T18:34:40Z

Reviewed 22 of 22 files at r3, 4 of 10 files at r4.
Review status: 63 of 69 files reviewed at latest revision, 6 unresolved discussions, some commit checks broke.

go/vt/health/health.go, line 107 [r3] (raw file):

      switch s.err {
      case ErrSlaveNotRunning:
          err = ErrSlaveNotRunning

Perhaps a comment for posterity:

// Return the ErrSlaveNotRunning sentinel value, only if there are no other errors.

go/vt/tabletmanager/healthcheck.go, line 371 [r4] (raw file):

  }

  // in any case, we're done now

Previously, the fact that we don't call disallowQueries() in terminateHealthChecks() for a master was intentional. In case of a master doing a graceful restart, we want it to continue serving during servenv lameduck, because there's nowhere else for master queries to go:

https://github.com/youtube/vitess/blob/master/go/vt/servenv/run.go#L42

After servenv lameduck, the queryservice is stopped from a servenv.OnClose() hook:

https://github.com/youtube/vitess/blob/master/go/cmd/vttablet/vttablet.go#L90

go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):

Previously, alainjobart (Alain Jobart) wrote…

we need the health check to run after the tablet is refreshed though, otherwise it won't get the right type.

I am thinking of re-running health check after a refresh state, so it broadcasts the right state. Then we can leave the restarting of the query service to only be in health check. What do you think?

SGTM

Comments from Reviewable

We used to store the TabletControl in agent, and use it to make the decision if we should be running query service in healthcheck. Now remember the decision from changeCallback, simpler. Also change the tablet display a bit to show the new information.

alainjobart · 2016-05-10T21:26:13Z

Review status: 55 of 69 files reviewed at latest revision, 6 unresolved discussions.

go/vt/health/health.go, line 107 [r3] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Perhaps a comment for posterity:

// Return the ErrSlaveNotRunning sentinel value, only if there are no other errors.
```</details>
Done.

go/vt/tabletmanager/action_agent.go, line 187 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

This will break some internal monitoring (vtcoproc). That probably needs to be fixed prior to importing this change, since the coproc is rolled out separately.

So we realized coproc is now obsolete, and will go away.

go/vt/tabletmanager/healthcheck.go, line 371 [r4] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Previously, the fact that we don't call disallowQueries() in terminateHealthChecks() for a master was intentional. In case of a master doing a graceful restart, we want it to continue serving during servenv lameduck, because there's nowhere else for master queries to go:

https://github.com/youtube/vitess/blob/master/go/vt/servenv/run.go#L42

After servenv lameduck, the queryservice is stopped from a servenv.OnClose() hook:

https://github.com/youtube/vitess/blob/master/go/cmd/vttablet/vttablet.go#L90

I reverted back, and added a comment.

go/vt/tabletmanager/rpc_backup.go, line 61 [r2] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

SGTM

I ended up just simplifying the logic and remembering it. This is a bigger change that I'm not willing to make just yet.

Comments from Reviewable

In our tests, we want default non-master tablets to be NOT_SERVING, as their replication is most likely not setup.

enisoc · 2016-05-10T23:51:59Z

after you regenerate the vtctl doc.

Previously, alainjobart (Alain Jobart) wrote…

Do not change type or health map for health reasons.

WIP, not ready for review yet.

Reviewed 5 of 10 files at r4, 9 of 9 files at r5.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks broke.

Comments from Reviewable

Independently of grace period, we need to enter lameduck (and possibly exit right away), to communicate to vtgate. There is then the servenv lameduck that will make sure the state change gets sent to vtgate.

Signed-off-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com> Co-authored-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>

First pass at not going spare if unhealthy.

f33951b

This only fixes the go unit tests, not the integration tests yet.

googlebot added the cla: yes label May 6, 2016

alainjobart added 13 commits May 6, 2016 11:21

Fixing tabletmanager.py with new healthcheck.

e97391d

Fixing this file too.

0a6d9de

Now ChangeType also runs a health check.

bc6af81

And workers don't change used tablets to spare, but back to rdonly.

Re-init tablets to fix vtgatev2_test.py.

fc89dab

Pass in keyspace and shard to re-init tablets when re-starting them.

Removing tablet type parameter to RunHealthCheck.

859c558

We now use the tablet type from the tablet record.

Removing targetTabletType from health check proto.

5c7b385

Now making target_tablet_type obsolete.

8ecfa8a

It is used if init_tablet_type is not set. It doesn't drive health check, that is on by default now.

No more healthmap in tablet.

6d815ad

Better doc for tablet type, better unit tests.

25bc282

Now run a health check after master demotion.

1071e4e

TER now goes to REPLICA as well. Updating tests.

1161251

Experimental server type now runs query service.

e1118ee

alainjobart added 3 commits May 10, 2016 09:08

Lots of fixes after Anthony's comments.

1a5f6b1

Lameduck is clearer now. Re-adding a lameduck case: when going unhealthy. All lameduck mode is now conditioned on serving_state_grace_period > 0. Adding reserved obsolete comments to proto for the fields I removed.

Removing target_tablet_type from tests.

9f93aaa

Now always use enable_replication_lag_check. Fix the tests in this commit to all work properly.

alainjobart added 4 commits May 10, 2016 11:35

Fixing this test after changes.

67a9196

Addressing couple more review comments.

7d2b3bb

Fixing tests after last change.

2702ec2

Changing expectations on default state.

84939d6

In our tests, we want default non-master tablets to be NOT_SERVING, as their replication is most likely not setup.

Fixing this integration test.

ec97892

alainjobart added 8 commits May 11, 2016 07:34

Now always entering lameduck on shutdown.

c4d287e

Independently of grace period, we need to enter lameduck (and possibly exit right away), to communicate to vtgate. There is then the servenv lameduck that will make sure the state change gets sent to vtgate.

Fixing this test after state change.

6e55faf

Also fixing this test.

1132202

Re-generating the doc.

552ba12

Merge branch 'master' into sparenomore

c10a84f

Adding result of merge (I think)

8902d03

Nox fixing these tests.

839b600

Fixing the last 3 failing tests.

9656240

alainjobart merged commit 4c55019 into vitessio:master May 11, 2016

alainjobart deleted the sparenomore branch July 7, 2016 08:21

frouioui pushed a commit to planetscale/vitess that referenced this pull request Nov 21, 2023

backport of 1632 (vitessio#1685)

853f49a

Signed-off-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com> Co-authored-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not change type or health map for health reasons. #1685

Do not change type or health map for health reasons. #1685

alainjobart commented May 6, 2016 •

edited by enisoc

enisoc commented May 9, 2016

alainjobart commented May 10, 2016

enisoc commented May 10, 2016

alainjobart commented May 10, 2016

enisoc commented May 10, 2016 •

edited

Do not change type or health map for health reasons.

Do not change type or health map for health reasons. #1685

Do not change type or health map for health reasons. #1685

Conversation

alainjobart commented May 6, 2016 • edited by enisoc

enisoc commented May 9, 2016

alainjobart commented May 10, 2016

enisoc commented May 10, 2016

alainjobart commented May 10, 2016

enisoc commented May 10, 2016 • edited

Do not change type or health map for health reasons.

alainjobart commented May 6, 2016 •

edited by enisoc

enisoc commented May 10, 2016 •

edited