Skip to content

Feature Request: allow VTOrc to start with recoveries disabled #18007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
timvaillancourt opened this issue Mar 22, 2025 · 2 comments · May be fixed by #18005
Open

Feature Request: allow VTOrc to start with recoveries disabled #18007

timvaillancourt opened this issue Mar 22, 2025 · 2 comments · May be fixed by #18005
Labels
Component: VTorc Vitess Orchestrator integration Type: Feature

Comments

@timvaillancourt
Copy link
Contributor

Feature Description

As a user of VTOrc, I would like to be able to start VTOrc will all recoveries disabled. It's currently possible to do this using the HTTP API, but there is a short period where recoveries will be enabled between the time VTOrc starts up and this API is called (needs to be done per instance)

This issue proposes a new flag --allow-recovery is added to achieve this

Use Case(s)

A user that would like no VTOrc recoveries to happen

We run a patch that adds this functionality, and it was instrumental in the rollout of VTOrc in Slack's production. This feature allowed us to validate (and later optimize) the topo and discovery performance in advance of switching over to VTOrc. This "dry-run-like" mode also allowed us to gain confidence in what VTOrc would do when enabled for the first time on a keyspace

@timvaillancourt timvaillancourt added the Needs Triage This issue needs to be correctly labelled and triaged label Mar 22, 2025
@timvaillancourt timvaillancourt linked a pull request Mar 22, 2025 that will close this issue
5 tasks
@deepthi
Copy link
Member

deepthi commented Mar 23, 2025

I'm curious to hear whether other people see the need for this. How exactly did you use the dry-run-like mode?

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Mar 24, 2025

@deepthi I think this is mostly useful for initial migration to VTOrc, for performance tuning an existing install or for setting up a new keyspace/cluster

In our case, the "dry-run" (or perhaps "discover-only") mode was used to:

  1. Understand what problems VTOrc would fix it (if it were made active) while our old solution remained in charge. Using the logs, this helped us gain confidence of what existing problems would be solved when VTOrc took over, what volume of reparents might happen all at once when it starts (potentially risking scatters and/or the topo), etc
    • VTOrc logs explain what VTOrc "would have done" when discoveries are disabled
  2. Verify the topo and discovery performance before VTOrc takes over, potentially struggling with the workload. When we first cut a large keyspace over to VTOrc before the v22 optimizations (and this functionality), the instance struggled significantly and would have taken a long time to respond to unplanned events, and we intentionally avoided two systems being active
    • During the "dry-run" style tuning I used metrics outputted from VTOrc to measure performance wins, the most important metrics being DiscoveriesInstancePollSecondsExceeded, queue sizes and CPU capacity
    • Another use case: I plan to run a dedicated VTOrc for further performance tunings, pointed at the same --clusters_to_watch as our busiest VTOrc pool/group, but with discoveries disabled. This will allow us to compare the impact of future fixes
    • I don't expect many VTOrc users to submit performance fixes, but they may want to tune the existing flags and/or # of tablets in advance of cutover

@frouioui frouioui added Type: Feature Component: VTorc Vitess Orchestrator integration and removed Needs Triage This issue needs to be correctly labelled and triaged labels Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants