Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttler: Expose Tablet's Config & Leverage to Deflake Tests #12737

Merged
merged 19 commits into from
Mar 30, 2023

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Mar 27, 2023

Description

The primary thing that this PR does is expose the tablet's current throttler config — query and threshold — via its /throttler/status http endpoint. This allows operators and test creators to confirm and wait for specific configurations to be live on all tablets.

The secondary thing that this PR does is disable vtorc recoveries in the throttler tests when needed (may be other reasons as yet unknown or that unintentionally arise later) because we stop replication, wait a few seconds, then expect there to be lag. But vtorc could repair replication during that wait and then the lag is gone. We had previously explicitly disabled the legacy tablet repair by using the --disable_active_reparents flag, now in this PR we also disable any vtorc repairs.

Finally, a minor thing that this PR does is enable the tablet throttler configuration via the topo in the local examples. This does not alter the default behavior, but it allows users to test out this new feature without having to modify any files or restart any processes. This will help with local testing of the new feature as we move it towards GA (currently experimental).

The relevant throttler_topo workflow now passed 15 times in a row: https://github.com/vitessio/vitess/actions/runs/4557359892?pr=12737

The failures now seem due to some odd/unexpected behavior of the throttler itself during test runs. We'll track any bugs down over time -- which will now be much easier with the updated tests.

Related Issue(s)

Checklist

For example, we stop replication, wait a few seconds, then expect
there to be lag. But vtorc could repair replication during that
wait and then the lag is gone.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 27, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 27, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@mattlord mattlord added Flakes Component: Build/CI Type: Testing and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 27, 2023
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title Flakes: effectively disable vtorc for deterministic behavior in throttler e2e tests Throttler: Expose Tablet's Config & Leverage to Deflake Tests Mar 28, 2023
@mattlord mattlord force-pushed the throttler_vtorc_flakes branch 7 times, most recently from becac96 to 4ebc2fe Compare March 28, 2023 04:54
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you @mattlord !

@shlomi-noach shlomi-noach marked this pull request as ready for review March 28, 2023 06:33
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review March 28, 2023 19:28
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Which seemed to revolve around NOT sleeping long enough
after starting all the sleep queries.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@shlomi-noach shlomi-noach self-requested a review March 29, 2023 10:17
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's been some changes since my earlier review, so I dismissed it and am submitting this new review. A couple of those changes are a bit unclear to me and I'm grateful if you could answer my inline questions.

t.Run("enabling throttler with low threshold", func(t *testing.T) {
_, err := onlineddl.UpdateThrottlerTopoConfig(clusterInstance, true, false, unreasonablyLowThreshold.Seconds(), "", false)
t.Run("enabling throttler with very low threshold", func(t *testing.T) {
_, err := throttler.UpdateThrottlerTopoConfig(clusterInstance, true, false, unreasonablyLowThreshold.Seconds(), useDefaultQuery, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Wait for the throttler to be enabled everywhere with the new config.
for _, tablet := range clusterInstance.Keyspaces[0].Shards[0].Vttablets {
throttler.WaitForThrottlerStatusEnabled(t, tablet, true, &throttler.Config{Query: throttler.DefaultQuery, Threshold: unreasonablyLowThreshold.Seconds()}, throttlerEnabledTimeout)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Wait for the throttler to be disabled everywhere.
for _, tablet := range clusterInstance.Keyspaces[0].Shards[0].Vttablets {
throttler.WaitForThrottlerStatusEnabled(t, tablet, false, nil, throttlerEnabledTimeout)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

And adjust timing

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
examples/common/scripts/vttablet-up.sh Show resolved Hide resolved
@@ -84,7 +84,7 @@ func registerThrottlerFlags(fs *pflag.FlagSet) {

fs.DurationVar(&throttleThreshold, "throttle_threshold", throttleThreshold, "Replication lag threshold for default lag throttling")
fs.StringVar(&throttleMetricQuery, "throttle_metrics_query", throttleMetricQuery, "Override default heartbeat/lag metric. Use either `SELECT` (must return single row, single value) or `SHOW GLOBAL ... LIKE ...` queries. Set -throttle_metrics_threshold respectively.")
fs.Float64Var(&throttleMetricThreshold, "throttle_metrics_threshold", throttleMetricThreshold, "Override default throttle threshold, respective to -throttle_metrics_query")
fs.Float64Var(&throttleMetricThreshold, "throttle_metrics_threshold", throttleMetricThreshold, "Override default throttle threshold, respective to --throttle_metrics_query")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not actually relevant to the changes in this PR - what does respective to --throttle_metrics_query mean?
maybe a question more for @shlomi-noach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes hand-in-hand with --throttle_metrics_query. If you define a --throttle_metrics_query, then you should also say what's the --throttle_metrics_threshold at which the throttle will engage. Not sure how to phrase it otherwise.

This is because enabling heartbeats with --heartbeat_enable
also results in the replication reporter being enabled:
https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord merged commit 1a0c3fe into vitessio:main Mar 30, 2023
112 checks passed
@mattlord mattlord deleted the throttler_vtorc_flakes branch March 30, 2023 17:53
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 30, 2023

I was unable to backport this Pull Request to the following branches: release-16.0.

mattlord added a commit to planetscale/vitess that referenced this pull request Mar 30, 2023
…io#12737)

* Flakes: effectively disable vtorc for deterministic behavior

For example, we stop replication, wait a few seconds, then expect
there to be lag. But vtorc could repair replication during that
wait and then the lag is gone.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Wait for the throttler to be up and running everywhere

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Expose tablet's throttler config and leverage to deflake tests

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Apply various corrections

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Be more explicit about VTOrc behavior changes

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Note received throttler response when it is unexpected

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Fixes from local testing

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Nits from self review

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use assert.Equalf on failed assertions

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Ummm, duh.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Try to get rid of last bit of flakiness

Which seemed to revolve around NOT sleeping long enough
after starting all the sleep queries.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Nits from self review

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Address review comments

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Adjust test for behavior and comment it

And adjust timing

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Align both stale hearbeat checks

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Remove no longer needed flag

This is because enabling heartbeats with --heartbeat_enable
also results in the replication reporter being enabled:
https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Correct comment

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Correct comment part II: electric boogaloo

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Revert one other minor unnecessary change.

Signed-off-by: Matt Lord <mattalord@gmail.com>

---------

Signed-off-by: Matt Lord <mattalord@gmail.com>
rohit-nayak-ps pushed a commit that referenced this pull request Mar 31, 2023
…e Tests (#12791)

* Throttler: Expose Tablet's Config & Leverage to Deflake Tests (#12737)

* Flakes: effectively disable vtorc for deterministic behavior

For example, we stop replication, wait a few seconds, then expect
there to be lag. But vtorc could repair replication during that
wait and then the lag is gone.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Wait for the throttler to be up and running everywhere

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Expose tablet's throttler config and leverage to deflake tests

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Apply various corrections

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Be more explicit about VTOrc behavior changes

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Note received throttler response when it is unexpected

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Fixes from local testing

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Nits from self review

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use assert.Equalf on failed assertions

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Ummm, duh.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Try to get rid of last bit of flakiness

Which seemed to revolve around NOT sleeping long enough
after starting all the sleep queries.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Nits from self review

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Address review comments

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Adjust test for behavior and comment it

And adjust timing

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Align both stale hearbeat checks

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Remove no longer needed flag

This is because enabling heartbeats with --heartbeat_enable
also results in the replication reporter being enabled:
https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Correct comment

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Correct comment part II: electric boogaloo

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Revert one other minor unnecessary change.

Signed-off-by: Matt Lord <mattalord@gmail.com>

---------

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Post cherry-pick fixup

Signed-off-by: Matt Lord <mattalord@gmail.com>

---------

Signed-off-by: Matt Lord <mattalord@gmail.com>
@maxenglander maxenglander mentioned this pull request Jan 18, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Build/CI Flakes Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants