Tablet throttle: support "/throttle/check-self" available on all tablets #7319

shlomi-noach · 2021-01-19T12:01:46Z

Description

We add a new throttler endpoint, /throttle/check-self, which is available on all tablets, and which tests the throttle state of the specific tablet (or of its backend MySQL server).

Originally, we only considered throttling on Primary tablets: the idea of throttling was that we would write a lot of data onto the shard's primary, and throttle based on replication lag on the shard's replicas.

@rohit-nayak-ps suggested a new use case: VReplication can create a substantial load on the source by reading too intensively. e.g. if Vreplication is configured to read from a Replica, it can cause so much IO overhead that the replica would lag as result.

Hence, we now introduce a lag-based throttle on all tablet types. VReplication will e.g. call /throttle/check-self on the replica it reads from, to validate it's not putting too much read load.

I'm applying the same replication lag evaluation even on a Primary tablet. This is an interesting scenario: how do you measure the load you create on a Primary tablet? I see three possibilities:

measure Threads_running. I've known this to be a good indicator for load on primary; when it is loaded, transactions take longer time to commit, and this causes a spike in concurrent queries.
measure History List Length (part of InnoDB's SHOW INNODB STATUS). A transactions that is open for a long time will cause this MVCC to keep more and more rows out of the main index tree, and that in turn causes more IO. The history list length number indicates how much of "stalled" versions we have. I'm used to measuring it in millions.
evaluate replication lag, yes, on the Primary. The thing is that we inject a heartbeat on the primary itself. That means we are able to compare the heartbeat timestamp on the primary, with current time, same as on replicas. The interesting part is that if the primary stalls on writes, then we will see an imaginary "lag" on the primary itself.

So, for now, I'm using the latter as the indicator for self-throttling on primaries, same as on replicas.

Added endtoend tests to confirm behavior, and documentation will have to be updated, as well.

I will next look into how to use this functionality in VReplication.

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required TODO

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

…ets that only checks the replication lag on the tablet's backend MySQL server Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

… keyword Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

…-throttle-lag Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

…-throttle-lag Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach · 2021-01-24T11:09:28Z

Ready for review! This is a dependency for #7324

jmoldow · 2021-01-24T19:10:58Z

go/vt/vttablet/tabletserver/throttle/throttler.go

@@ -308,6 +321,37 @@ func (throttler *Throttler) createThrottlerUser(ctx context.Context) (password s
 	return password, nil
 }

+// readSelfMySQLThrottleMetric reads the mysql metric from thi very tablet's backend mysql.


Suggested change

// readSelfMySQLThrottleMetric reads the mysql metric from thi very tablet's backend mysql.

// readSelfMySQLThrottleMetric reads the mysql metric from this very tablet's backend mysql.

jmoldow · 2021-01-24T19:35:18Z

go/vt/vttablet/tabletserver/throttle/throttler.go

@@ -71,7 +72,7 @@ var (
 	sqlGrantThrottlerUser = []string{
 		`GRANT SELECT ON _vt.heartbeat TO %s`,
 	}
-	replicationLagQuery = `select unix_timestamp(now(6))-max(ts/1000000000) from _vt.heartbeat`
+	replicationLagQuery = `select unix_timestamp(now(6))-max(ts/1000000000) as replication_lag from _vt.heartbeat`


Is this query semantically guaranteed to work, or could there possibly be implementation-specific race conditions in a sluggish scenario?

At least for MySQL (https://dev.mysql.com/doc/refman/5.6/en/date-and-time-functions.html#function_now):

NOW() returns a constant time that indicates the time at which the statement began to execute.

What I don't know is whether a SELECT, outside of a transaction, with autocommit=1, will use a consistent snapshot for its read; and if so, whether that snapshot will be at the same time, or after, the value of NOW().

Suppose there is no consistent snapshot, or there is but it comes from a timepoint later than NOW(). Then couldn't the query compute a timestamp, then wait a while due to sluggishness, then read max(ts) and see a value that is close enough to NOW(), and return 200 ok.

Whereas in the same scenario, if there was a consistent snapshot taken at NOW(), if there was lag, you would definitely see it, and not return 200 ok.

I'm just wondering if, to be guaranteed to detect sluggishness, whether this works, or if you need a different query such that the timestamp is guaranteed to be computed after max(ts). e.g. something like

select max(ts/1000000000) as ts_max_heartbeat from _vt.heartbeat; select unix_timestamp(SYSDATE(6)) as ts_now;

Here there's the risk that you could get false-negatives since the timestamp is computed after the heartbeat. But if things aren't sluggish and there isn't replica lag, hopefully the delay should be so minimal as to be inconsequential.

I'm far from an expert here, so I don't know if this line of reasoning is valid or not.

Nice thinking! Experience with this type of query throughout the years (and specifically in my previous employer, as this is adapted from https://github.com/github/freno/), shows that this is not an issue.

One nice thing about this mechanism is that the 1sec threshold is not an absolute restriction. It is OK if we pass the threshold (indeed, the logic pretty much ensures we will -- e.g. we will be able to write to the server when we're at 0.99sec lag, which means our own write will quite possibly push lag beyond 1sec).

So, back to the scenario you depicted, let's say replica is lagging by 2sec and has slugishness of 1sec, which leads us to believe the lag is 0.99sec. This means we push forward. This means in the near future lag will be pushed to be above 2sec, at which time the 1sec slugishness is not sluggish enough to make us think the lag is < 1sec. So if the system remains at 1sec slugishness, we're at worst case lagging at 2sec.

Just to illustrate how crazy time evaluations are, you could use SYSDATE() and still evaluate the wrong thing if the time is evaluated first, then slugishness begins, then evaluation of table row takes place.

Having said all that, I don't have a strong opinion either way; just my experience that this logic hasn't created any harm in my past experience.

Thank you, I appreciate the explanation.

The case I'm still wondering about is the primary.

In your scenario you discussed a replica, where replication lag and sluggishness are both at play. Sluggishness can cause false positives, but if replication lag gets high enough, it won't matter.

On the primary, there is no replication lag to consider, only sluggishness. If the sluggishness of the select from _vt.heartbeat is equal to the sluggishness of insert into _vt.heartbeat, wouldn't they cancel out and always appear to never be any sluggishness? I guess the question is, is that possible, or in a high-sluggishness scenario, would sluggishness for an insert almost always be greater than the sluggishness of a select?

Anyway, I'm not concerned if you're not concerned.

The case I'm still wondering about is the primary.

I see what you mean, and your point is valid. I still don't know if slugishness on the primary won't affect the timestamp in the same way (that is, CURDATE is computed when MySQL is ready to insert, and then the INSERT stalls).

That is to say, I don't know, I'm speculating here; and because I haven't known it to be an issue (and I've worked on sluggish systems) I assume this is not a big risk.

But point taken, let me look into it, I don't think there's any risk in converting the insert to use curdate(), in which case -- why not.

But point taken, let me look into it, I don't think there's any risk in converting the insert to use curdate(), in which case -- why not.

Do you mean now() or systime()?

Why would you switch the insert code (which, from what I can tell, currently gets the timestamp from go)? Wouldn't that potentially make the heartbeat less suitable for detecting sluggishness, since there's more of a chance that the timestamp might be computed close to when the insert commits, even in the face of sluggishness?

I'd guess that keeping what you have now - which has been used a lot in production already without known problems, and where in the worst case scenario it's equivalent to today's behavior which is to have no check-self at all - is probably better than introducing something we're less sure of that might introduce new false negatives.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

rohit-nayak-ps

lgtm
some nits

rohit-nayak-ps · 2021-01-25T20:33:15Z

go/test/endtoend/tabletmanager/throttler/throttler_test.go

@@ -156,19 +175,44 @@ func TestLag(t *testing.T) {
 		time.Sleep(2 * time.Second)


Are these durations used in the tests (2 seconds, 10 seconds, 5 seconds) based on values configured elsewhere? Just wondering if these can become flaky if a configuration is changed elsewhere.

Excellent question. There are some hard coded intervals here:

vitess/go/vt/vttablet/tabletserver/throttle/throttler.go

Lines 37 to 41 in 627f092

leaderCheckInterval = 5 * time.Second

mysqlCollectInterval = 100 * time.Millisecond

mysqlDormantCollectInterval = 5 * time.Second

mysqlRefreshInterval = 10 * time.Second

mysqlAggregateInterval = 100 * time.Millisecond

Basically:

after a cluster starts, it takes ~5sec for throttler to know about all replicas and collect data from those replicas and aggregate that data -- meaning an API /throttler/check can provide a reliable answer. Hence waiting for 10sec (can reduce that to e.g.7s but I feel 10sec is much safer against flakyness.

user apps will cache throttler results for some 250ms (e.g. see in the vreplication PRs), so a 1sec sleep can ensure the cache is cleared

The default lag threshold is 1s, hence the 2 * time.Second sleep after StopReplication(), to make sure when we next check for lag, there is a throttle-grade lag. Having said that, you are right that this is overridable -- so I just added

"-throttle_threshold", "1s",

to this test's VtTabletExtraArgs to ensure the threshold is 1s.

The sleep for 5 * time.Second after StartReplication is just heuristic to allow the replication to catch up, and does not depend on the throttler configuration. Again, catch up is likely to happen in less than 1s but I feel 5s is great against flakyness.

The sleep for 10 * time.Second after ChangeTabletType is because it will take that amount of time for the throttler to identify the new roster. It's hard coded in mysqlRefreshInterval = 10 * time.Second, and now I've actually uppsed the test to sleep for 12 * time.Second to avoid flakyness.

I've moreover now made these numbers as constants in the test. Now the waits are named, and its clearer what each wait means.

new code:

const ( throttlerInitWait = 10 * time.Second accumulateLagWait = 2 * time.Second mysqlRefreshIntervalWait = 12 * time.Second replicationCatchUpWait = 5 * time.Second )

...

time.Sleep(mysqlRefreshIntervalWait)

go/vt/vttablet/tabletserver/throttle/mysql/mysql_throttle_metric.go

go/vt/vttablet/tabletserver/throttle/throttler.go

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach added 2 commits January 18, 2021 21:29

tablet throttler is Open on non-primary

88cf2da

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Adding /throttle/check-read: a specific check available on all vttabl…

cc6bffe

…ets that only checks the replication lag on the tablet's backend MySQL server Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach requested review from harshit-gangal, sougou and systay as code owners January 19, 2021 12:01

shlomi-noach requested review from rohit-nayak-ps and a team January 19, 2021 12:02

shlomi-noach mentioned this pull request Jan 20, 2021

vstreamer to throttle on source endpoint #7324

Merged

8 tasks

shlomi-noach added 3 commits January 20, 2021 07:40

removed debug messages

18d62c9

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

fix replication lag query: apparently 'lag' is now introduced asa new…

45a00ce

… keyword Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Merge remote-tracking branch 'upstream/master' into throttler-replica…

76f84bb

…-throttle-lag Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach mentioned this pull request Jan 24, 2021

Tracking issue: VReplication throttling #7362

Open

shlomi-noach added 2 commits January 24, 2021 11:42

run self checks in throttler

b7ce76f

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Merge remote-tracking branch 'upstream/master' into throttler-replica…

a7dc1d0

…-throttle-lag Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

This was referenced Jan 24, 2021

VReplication: throttle on target tablet #7364

Merged

Doc updates; tablet throttler: check-self vitessio/website#686

Merged

jmoldow reviewed Jan 24, 2021

View reviewed changes

shlomi-noach added 2 commits January 25, 2021 13:42

support 'TickNow()'

57de1c1

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

speed up of throttler refresh tick upon opening and upon becoming leader

046e36b

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

rohit-nayak-ps approved these changes Jan 25, 2021

View reviewed changes

shlomi-noach added 3 commits January 26, 2021 08:39

typos

e2f078d

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

formalize wait time in test

8e40d6b

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

change of name

e20e0f7

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach merged commit 1514987 into vitessio:master Jan 26, 2021

shlomi-noach deleted the throttler-replica-throttle-lag branch January 26, 2021 07:34

askdba added Component: Cluster management Component: VReplication labels Jan 26, 2021

askdba added this to the v9.0 milestone Jan 26, 2021

shlomi-noach mentioned this pull request Jan 27, 2021

Documenting VReplication throttling vitessio/website#689

Merged

eseokoh mentioned this pull request May 31, 2021

Release Note of v9.0.0 includes changes of v10.0 #8212

Closed

ajm188 mentioned this pull request Jul 13, 2021

slack vitess v10.pre tinyspeck/vitess#228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tablet throttle: support "/throttle/check-self" available on all tablets #7319

Tablet throttle: support "/throttle/check-self" available on all tablets #7319

shlomi-noach commented Jan 19, 2021 •

edited by rohit-nayak-ps

shlomi-noach commented Jan 24, 2021

jmoldow Jan 24, 2021

jmoldow Jan 24, 2021

shlomi-noach Jan 24, 2021

jmoldow Jan 24, 2021

shlomi-noach Jan 25, 2021

jmoldow Jan 25, 2021

rohit-nayak-ps left a comment

rohit-nayak-ps Jan 25, 2021

shlomi-noach Jan 26, 2021 •

edited

shlomi-noach Jan 26, 2021

	// readSelfMySQLThrottleMetric reads the mysql metric from thi very tablet's backend mysql.
	// readSelfMySQLThrottleMetric reads the mysql metric from this very tablet's backend mysql.

		@@ -156,19 +175,44 @@ func TestLag(t *testing.T) {
		time.Sleep(2 * time.Second)

	leaderCheckInterval = 5 * time.Second
	mysqlCollectInterval = 100 * time.Millisecond
	mysqlDormantCollectInterval = 5 * time.Second
	mysqlRefreshInterval = 10 * time.Second
	mysqlAggregateInterval = 100 * time.Millisecond

Tablet throttle: support "/throttle/check-self" available on all tablets #7319

Tablet throttle: support "/throttle/check-self" available on all tablets #7319

Conversation

shlomi-noach commented Jan 19, 2021 • edited by rohit-nayak-ps

Description

Checklist

Deployment Notes

Impacted Areas in Vitess

shlomi-noach commented Jan 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohit-nayak-ps left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shlomi-noach Jan 26, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shlomi-noach commented Jan 19, 2021 •

edited by rohit-nayak-ps

shlomi-noach Jan 26, 2021 •

edited