Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

healthcheck: update healthy tablets correctly when a stream returns an error or times out #7654

Merged
merged 3 commits into from
Mar 13, 2021

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Mar 10, 2021

Description

There are error conditions from a tablet health check that require the healthy tablet list to be updated.

Related Issue(s)

Fixes #7472
Fixes #7177

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

Testing

Unit tests were added for each case. For #7472 I also followed the provided testing procedure to reproduce the problem and confirmed that the fix works.

mysql> show vitess_tablets;
+-------+----------+-------+------------+---------+------------------+------------------+----------------------+
| Cell  | Keyspace | Shard | TabletType | State   | Alias            | Hostname         | MasterTermStartTime  |
+-------+----------+-------+------------+---------+------------------+------------------+----------------------+
| zone1 | commerce | 0     | MASTER     | SERVING | zone1-0000000100 | localhost        | 2021-03-10T01:24:33Z |
| zone1 | commerce | 0     | REPLICA    | SERVING | zone1-0000000101 | localhost        |                      |
| zone1 | commerce | 0     | RDONLY     | SERVING | zone1-0000000102 | localhost        |                      |
+-------+----------+-------+------------+---------+------------------+------------------+----------------------+
3 rows in set (0.00 sec)
----
kill rdonly vttablet
----
mysql> show vitess_tablets;
+-------+----------+-------+------------+-------------+------------------+------------------+----------------------+
| Cell  | Keyspace | Shard | TabletType | State       | Alias            | Hostname         | MasterTermStartTime  |
+-------+----------+-------+------------+-------------+------------------+------------------+----------------------+
| zone1 | commerce | 0     | MASTER     | SERVING     | zone1-0000000100 | localhost        | 2021-03-10T01:24:33Z |
| zone1 | commerce | 0     | REPLICA    | SERVING     | zone1-0000000101 | localhost        |                      |
| zone1 | commerce | 0     | RDONLY     | NOT_SERVING | zone1-0000000102 | localhost        |                      |
+-------+----------+-------+------------+-------------+------------------+------------------+----------------------+
3 rows in set (0.00 sec)

mysql> use @rdonly;
Database changed
mysql> select * from product;
ERROR 1105 (HY000): target: commerce.0.rdonly: no healthy tablet available for 'keyspace:"commerce" shard:"0" tablet_type:RDONLY '

…n error or times out

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi requested review from sougou and systay March 10, 2021 01:48
…r timeout from healthcheck stream

Signed-off-by: deepthi <deepthi@planetscale.com>
@@ -404,17 +404,17 @@ func (hc *HealthCheckImpl) deleteTablet(tablet *topodata.Tablet) {
}
}

func (hc *HealthCheckImpl) updateHealth(th *TabletHealth, shr *query.StreamHealthResponse, currentTarget *query.Target, trivialNonMasterUpdate bool, isMasterUpdate bool, isMasterChange bool) {
func (hc *HealthCheckImpl) updateHealth(th *TabletHealth, currentTarget *query.Target, trivialUpdate bool, isPrimaryUp bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to isPrimaryUpdate? Otherwise, I was reading it as "is primary up".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is supposed to be "is primary up". The old parameters (isMasterChange and isMasterUpdate) were actually unnecessary because all the information they pass in is available within the scope of this func. However, we do need to know whether the primary should be marked unhealthy (if there is an error/timeout on the healthcheck connection), so I introduced this parameter.

topoproto.TabletAliasString(hc.healthy[targetKey][0].Tablet.Alias),
shr.TabletExternallyReparentedTimestamp,
hc.healthy[targetKey][0].MasterTermStartTime)
if th.Target.TabletType == topodata.TabletType_MASTER {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would presume that isPrimaryUp would be true only if the tablet type was MASTER. Is there a case where this is not the case?

Copy link
Member Author

@deepthi deepthi Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I suppose this is more of a consistency check - to make sure that we still behave correctly if a caller passes in isPrimaryUp true for a non-master tablet type.

}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this section was the reason you added the MASTER check, it may read better if you explicitly checked for it here. Something like th.Target.TabletType == topodata.TabletType_MASTER && !isPrimaryUp. But I'm not sure if that means the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MASTER check was already there but in the form of a boolean (isMasterUpdate).
I can break this into two separate if blocks instead of an if .. if .. else if that is easier to read/understand. What you are suggesting will be equivalent to how it is written today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern was the deep nesting which made it non-obvious. A cascading if (or switch) may read better. Try it. If it doesn't improve it, we can keep this as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think the switch made it better.

Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, in case no further improvements are possible.

}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern was the deep nesting which made it non-obvious. A cascading if (or switch) may read better. Try it. If it doesn't improve it, we can keep this as is.

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi added this to In progress in Cluster Management via automation Mar 13, 2021
@deepthi deepthi moved this from In progress to Reviewer approved in Cluster Management Mar 13, 2021
@deepthi deepthi merged commit 58fc9a1 into vitessio:master Mar 13, 2021
Cluster Management automation moved this from Reviewer approved to Done Mar 13, 2021
@deepthi deepthi deleted the ds-fix-7472-7177 branch March 13, 2021 01:09
@askdba askdba added this to the v10.0 milestone Mar 18, 2021
hc.healthy[targetKey][0].MasterTermStartTime)
} else {
// Just replace it.
hc.healthy[targetKey][0] = th
}
}
case isPrimary && !isPrimaryUp:
// No healthy master tablet
hc.healthy[targetKey] = []*TabletHealth{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepthi / @sougou I think this might not behave correctly in case of an unplanned failover with an external reparent, please let me know if my thinking is correct here #7906

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense. This should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
4 participants