Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Exploding Go-routines in VTOrc #15511

Closed
GuptaManan100 opened this issue Mar 18, 2024 · 0 comments · Fixed by #15580
Closed

Bug Report: Exploding Go-routines in VTOrc #15511

GuptaManan100 opened this issue Mar 18, 2024 · 0 comments · Fixed by #15580
Labels
Component: VTorc Vitess Orchestrator integration Type: Bug

Comments

@GuptaManan100
Copy link
Member

Overview of the Issue

If a tablet is Unreachable, then VTOrc will keep running into the analysis UnreachablePrimary. FullStatus calls will keep timing out because the tablet is unreachable.

For UnreachablePrimary failure types, we run runEmergentOperations. We do this without acquiring a topo lock because all this function tries to do is reload the tablet information in a fast path. This function creates a go routine to reload the said tablet information - go emergentlyReadTopologyInstance(analysisEntry.AnalyzedInstanceAlias, analysisEntry.Analysis)

This is problematic because this means we are running a new go routine to reload the tablet information every second! Each of these go routines tries to run FullStatus RPC. This just leads to us exploding the go-routines which can cause VTOrc to OOM.

The go-routines do end up in a steady state eventually because even if we are creating a new go routine every second, all the go routines spawned 15 seconds ago would finish, so we'll end up with a steady state number. In my testing it was something like this -

55 @ 0x104223058 0x104259ccc 0x104d25b0c 0x10425d4f4
#	0x104259ccb	time.Sleep+0xfb								runtime/time.go:195
#	0x104d25b0b	vitess.io/vitess/go/vt/vtorc/logic.emergentlyReadTopologyInstance+0x9b	vitess.io/vitess/go/vt/vtorc/logic/topology_recovery.go:322

This is still not the desired behaviour wherein we have so many go-routines all trying to call FullStatus. This increases the network traffic as well.

Reproduction Steps

In the testing framework I was able to reproduce this by making FullStatus slow using a time.Sleep and then making the primary tablet unreachable.

Binary Version

main

Operating System and Environment details

-

Log Fragments

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant