Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

healthcheck: attempt to update primary only if the current tablet is serving #8121

Merged
merged 1 commit into from
May 14, 2021

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented May 13, 2021

Description

Fixes #8120

Related Issue(s)

Checklist

  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

…serving

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi requested a review from 5antelope May 13, 2021 22:58
@deepthi
Copy link
Member Author

deepthi commented May 13, 2021

Before logs:

I0513 16:12:29.116383   29038 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-102 serving false => true for commerce/0 (RDONLY) reason: healthCheck update
I0513 16:12:51.023130   29038 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-100 serving true => false for commerce/0 (MASTER) reason: healthCheck update
E0513 16:12:51.465923   29038 healthcheck.go:478] Adding 1 to MasterPromoted counter for target: keyspace:"commerce" shard:"0" tablet_type:REPLICA , tablet: zone1-0000000101, tabletType: MASTER
W0513 16:12:56.023628   29038 healthcheck.go:441] not marking healthy master zone1-0000000100 as Up for commerce/0 because its MasterTermStartTime is smaller than the highest known timestamp from previous MASTERs zone1-0000000101: 1620947548 < 1620947571 
I0513 16:13:01.480794   29038 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-100 serving false => true for commerce/0 (REPLICA) reason: healthCheck update

After logs:

I0513 15:59:34.332740   25307 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-102 serving false => true for commerce/0 (RDONLY) reason: healthCheck update
I0513 16:07:51.464763   25307 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-100 serving true => false for commerce/0 (MASTER) reason: healthCheck update
E0513 16:07:52.192940   25307 healthcheck.go:478] Adding 1 to MasterPromoted counter for target: keyspace:"commerce" shard:"0" tablet_type:REPLICA , tablet: zone1-0000000101, tabletType: MASTER
I0513 16:08:02.202323   25307 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-100 serving false => true for commerce/0 (REPLICA) reason: healthCheck update

Copy link
Member

@5antelope 5antelope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -200,7 +200,7 @@ func (thc *tabletHealthCheck) processResponse(hc *HealthCheckImpl, shr *query.St
thc.setServingState(serving, reason)

// notify downstream for master change
hc.updateHealth(thc.SimpleCopy(), prevTarget, trivialUpdate, true)
hc.updateHealth(thc.SimpleCopy(), prevTarget, trivialUpdate, thc.Serving)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: thc.Serving should not be false in practice (?) might want to add a warning log here when we change master to a tablet that is still down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be true or false, and this gets called for all tablet types. It is just that we only do anything with that value if the tablet_type is MASTER.
The healthy list update code inside the call to updateHealth is pretty solid, I think. It handles the true/false cases correctly. It will never replace with a down master.
Does this address your concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it. so this code actually "incorrectly" always claim isPrimaryUp=true in health update
this looks good

@systay systay added Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Bug and removed Type: Enhancement Logical improvement (somewhere between a bug and feature) labels May 14, 2021
@deepthi deepthi merged commit 3539be7 into vitessio:master May 14, 2021
@deepthi deepthi deleted the ds-hc-fix-8120 branch May 14, 2021 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spurious warning from vtgate healthcheck
3 participants