-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throttler: Expose Tablet's Config & Leverage to Deflake Tests #12737
Conversation
For example, we stop replication, wait a few seconds, then expect there to be lag. But vtorc could repair replication during that wait and then the lag is gone. Signed-off-by: Matt Lord <mattalord@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
Signed-off-by: Matt Lord <mattalord@gmail.com>
becac96
to
4ebc2fe
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
4ebc2fe
to
d303444
Compare
go/test/endtoend/tabletmanager/throttler_topo/throttler_test.go
Outdated
Show resolved
Hide resolved
e20fc72
to
73420fa
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
73420fa
to
c623dc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you @mattlord !
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
de6ddce
to
f7303ba
Compare
Which seemed to revolve around NOT sleeping long enough after starting all the sleep queries. Signed-off-by: Matt Lord <mattalord@gmail.com>
f7303ba
to
3a793f8
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's been some changes since my earlier review, so I dismissed it and am submitting this new review. A couple of those changes are a bit unclear to me and I'm grateful if you could answer my inline questions.
t.Run("enabling throttler with low threshold", func(t *testing.T) { | ||
_, err := onlineddl.UpdateThrottlerTopoConfig(clusterInstance, true, false, unreasonablyLowThreshold.Seconds(), "", false) | ||
t.Run("enabling throttler with very low threshold", func(t *testing.T) { | ||
_, err := throttler.UpdateThrottlerTopoConfig(clusterInstance, true, false, unreasonablyLowThreshold.Seconds(), useDefaultQuery, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// Wait for the throttler to be enabled everywhere with the new config. | ||
for _, tablet := range clusterInstance.Keyspaces[0].Shards[0].Vttablets { | ||
throttler.WaitForThrottlerStatusEnabled(t, tablet, true, &throttler.Config{Query: throttler.DefaultQuery, Threshold: unreasonablyLowThreshold.Seconds()}, throttlerEnabledTimeout) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// Wait for the throttler to be disabled everywhere. | ||
for _, tablet := range clusterInstance.Keyspaces[0].Shards[0].Vttablets { | ||
throttler.WaitForThrottlerStatusEnabled(t, tablet, false, nil, throttlerEnabledTimeout) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
go/test/endtoend/tabletmanager/throttler_topo/throttler_test.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Matt Lord <mattalord@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And adjust timing Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@@ -84,7 +84,7 @@ func registerThrottlerFlags(fs *pflag.FlagSet) { | |||
|
|||
fs.DurationVar(&throttleThreshold, "throttle_threshold", throttleThreshold, "Replication lag threshold for default lag throttling") | |||
fs.StringVar(&throttleMetricQuery, "throttle_metrics_query", throttleMetricQuery, "Override default heartbeat/lag metric. Use either `SELECT` (must return single row, single value) or `SHOW GLOBAL ... LIKE ...` queries. Set -throttle_metrics_threshold respectively.") | |||
fs.Float64Var(&throttleMetricThreshold, "throttle_metrics_threshold", throttleMetricThreshold, "Override default throttle threshold, respective to -throttle_metrics_query") | |||
fs.Float64Var(&throttleMetricThreshold, "throttle_metrics_threshold", throttleMetricThreshold, "Override default throttle threshold, respective to --throttle_metrics_query") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not actually relevant to the changes in this PR - what does respective to --throttle_metrics_query
mean?
maybe a question more for @shlomi-noach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes hand-in-hand with --throttle_metrics_query
. If you define a --throttle_metrics_query
, then you should also say what's the --throttle_metrics_threshold
at which the throttle will engage. Not sure how to phrase it otherwise.
This is because enabling heartbeats with --heartbeat_enable also results in the replication reporter being enabled: https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237 Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
I was unable to backport this Pull Request to the following branches: |
…io#12737) * Flakes: effectively disable vtorc for deterministic behavior For example, we stop replication, wait a few seconds, then expect there to be lag. But vtorc could repair replication during that wait and then the lag is gone. Signed-off-by: Matt Lord <mattalord@gmail.com> * Wait for the throttler to be up and running everywhere Signed-off-by: Matt Lord <mattalord@gmail.com> * Expose tablet's throttler config and leverage to deflake tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Apply various corrections Signed-off-by: Matt Lord <mattalord@gmail.com> * Be more explicit about VTOrc behavior changes Signed-off-by: Matt Lord <mattalord@gmail.com> * Note received throttler response when it is unexpected Signed-off-by: Matt Lord <mattalord@gmail.com> * Fixes from local testing Signed-off-by: Matt Lord <mattalord@gmail.com> * Nits from self review Signed-off-by: Matt Lord <mattalord@gmail.com> * Use assert.Equalf on failed assertions Signed-off-by: Matt Lord <mattalord@gmail.com> * Ummm, duh. Signed-off-by: Matt Lord <mattalord@gmail.com> * Try to get rid of last bit of flakiness Which seemed to revolve around NOT sleeping long enough after starting all the sleep queries. Signed-off-by: Matt Lord <mattalord@gmail.com> * Nits from self review Signed-off-by: Matt Lord <mattalord@gmail.com> * Address review comments Signed-off-by: Matt Lord <mattalord@gmail.com> * Adjust test for behavior and comment it And adjust timing Signed-off-by: Matt Lord <mattalord@gmail.com> * Align both stale hearbeat checks Signed-off-by: Matt Lord <mattalord@gmail.com> * Remove no longer needed flag This is because enabling heartbeats with --heartbeat_enable also results in the replication reporter being enabled: https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237 Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct comment Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct comment part II: electric boogaloo Signed-off-by: Matt Lord <mattalord@gmail.com> * Revert one other minor unnecessary change. Signed-off-by: Matt Lord <mattalord@gmail.com> --------- Signed-off-by: Matt Lord <mattalord@gmail.com>
…e Tests (#12791) * Throttler: Expose Tablet's Config & Leverage to Deflake Tests (#12737) * Flakes: effectively disable vtorc for deterministic behavior For example, we stop replication, wait a few seconds, then expect there to be lag. But vtorc could repair replication during that wait and then the lag is gone. Signed-off-by: Matt Lord <mattalord@gmail.com> * Wait for the throttler to be up and running everywhere Signed-off-by: Matt Lord <mattalord@gmail.com> * Expose tablet's throttler config and leverage to deflake tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Apply various corrections Signed-off-by: Matt Lord <mattalord@gmail.com> * Be more explicit about VTOrc behavior changes Signed-off-by: Matt Lord <mattalord@gmail.com> * Note received throttler response when it is unexpected Signed-off-by: Matt Lord <mattalord@gmail.com> * Fixes from local testing Signed-off-by: Matt Lord <mattalord@gmail.com> * Nits from self review Signed-off-by: Matt Lord <mattalord@gmail.com> * Use assert.Equalf on failed assertions Signed-off-by: Matt Lord <mattalord@gmail.com> * Ummm, duh. Signed-off-by: Matt Lord <mattalord@gmail.com> * Try to get rid of last bit of flakiness Which seemed to revolve around NOT sleeping long enough after starting all the sleep queries. Signed-off-by: Matt Lord <mattalord@gmail.com> * Nits from self review Signed-off-by: Matt Lord <mattalord@gmail.com> * Address review comments Signed-off-by: Matt Lord <mattalord@gmail.com> * Adjust test for behavior and comment it And adjust timing Signed-off-by: Matt Lord <mattalord@gmail.com> * Align both stale hearbeat checks Signed-off-by: Matt Lord <mattalord@gmail.com> * Remove no longer needed flag This is because enabling heartbeats with --heartbeat_enable also results in the replication reporter being enabled: https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237 Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct comment Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct comment part II: electric boogaloo Signed-off-by: Matt Lord <mattalord@gmail.com> * Revert one other minor unnecessary change. Signed-off-by: Matt Lord <mattalord@gmail.com> --------- Signed-off-by: Matt Lord <mattalord@gmail.com> * Post cherry-pick fixup Signed-off-by: Matt Lord <mattalord@gmail.com> --------- Signed-off-by: Matt Lord <mattalord@gmail.com>
Description
The primary thing that this PR does is expose the tablet's current throttler config — query and threshold — via its
/throttler/status
http endpoint. This allows operators and test creators to confirm and wait for specific configurations to be live on all tablets.The secondary thing that this PR does is disable
vtorc
recoveries in the throttler tests when needed (may be other reasons as yet unknown or that unintentionally arise later) because we stop replication, wait a few seconds, then expect there to be lag. Butvtorc
could repair replication during that wait and then the lag is gone. We had previously explicitly disabled the legacy tablet repair by using the--disable_active_reparents
flag, now in this PR we also disable anyvtorc
repairs.Finally, a minor thing that this PR does is enable the tablet throttler configuration via the topo in the local examples. This does not alter the default behavior, but it allows users to test out this new feature without having to modify any files or restart any processes. This will help with local testing of the new feature as we move it towards GA (currently experimental).
The relevant
throttler_topo
workflow now passed 15 times in a row: https://github.com/vitessio/vitess/actions/runs/4557359892?pr=12737The failures now seem due to some odd/unexpected behavior of the throttler itself during test runs. We'll track any bugs down over time -- which will now be much easier with the updated tests.
Related Issue(s)
Checklist