Skip to content

fix: recover debug-gather goroutine panics and free run state on stop#93

Merged
joshiste merged 1 commit into
mainfrom
fix/debug-goroutine-recover-and-leak
Jun 26, 2026
Merged

fix: recover debug-gather goroutine panics and free run state on stop#93
joshiste merged 1 commit into
mainfrom
fix/debug-goroutine-recover-and-leak

Conversation

@joshiste

Copy link
Copy Markdown
Member

Addresses the extension-debug audit MAJOR findings.

1. Unrecovered goroutine panic crashes the whole extension

Start spawns a detached goroutine running RunSteadybitDebug(...) with no panic recovery. A panic there is fatal to the entire extension process, taking down every other in-flight action it is serving.

Fix: recover in the goroutine, log the failure, and store DebugRun{Finished: true} so Status completes (instead of polling forever) — Status already handles a missing result zip gracefully (empty artifacts).

2. debugRuns map-entry leak

Entries are stored in Prepare/Start but never deleted, so the sync.Map grows by one entry per execution for the process lifetime.

Fix: debugRuns.Delete(executionId) in Stop.

Testing

  • go build ./..., go vet, gofmt clean.

/simplify

3-agent pass (efficiency N/A for a recover+delete): all clean — no safe-go/recover helper exists to reuse (exthttp.PanicRecovery is HTTP-handler-only), the two Store paths are mutually exclusive, and both fixes are at the right altitude.

Follow-up (deferred, single root cause)

RunSteadybitDebug takes no context, so two residuals remain and are best fixed together by making it cancellable + awaited in Stop before os.RemoveAll:

  • if the gather goroutine finishes after Stop, its debugRuns.Store re-inserts an entry that won't be deleted (re-leak for the aborted/slow case);
  • Stop's os.RemoveAll(WorkingDir) can race the still-running goroutine writing into that dir.

Neither is a regression (the goroutine already wrote to WorkingDir); they need a cancellation redesign of RunSteadybitDebug. Happy to do that as a follow-up.

The debug action spawns a detached goroutine that runs RunSteadybitDebug
with no panic recovery; a panic there crashed the whole extension
process, taking down every other in-flight action. Recover in the
goroutine, log the failure, and mark the run finished so Status
completes instead of polling forever.

Also delete the per-execution entry from the debugRuns map on stop;
entries were stored in Prepare/Start but never removed, leaking one
entry per execution for the lifetime of the process.
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown

Claude finished @joshiste's task in 1m 4s —— View job


Review

  • Gather context
  • Check for unresolved prior review threads (none found)
  • Read changed files
  • Review PR changes

extdebug/action.go — panic recovery

The defer recover() is correctly placed as the first statement inside the goroutine body, so it covers the entire call to RunSteadybitDebug. On a panic, it logs and stores DebugRun{Finished: true} with an empty ResultZip. Status already guards the artifact path with os.Stat, so the empty-zip path resolves cleanly — good.

One subtle correctness point worth confirming: the if r != nil guard means the defer is a no-op on a clean return, so the normal-path Store at lines 127–130 still runs exclusively. No double-store on the happy path.

extdebug/action.go — map deletion in Stop

debugRuns.Delete before os.RemoveAll is the right order: the entry is gone before the working directory is torn down, so a concurrent Status call won't try to read from a half-deleted directory.

Acknowledged residuals (not regressions)

The PR description correctly identifies two residuals that require a cancellable RunSteadybitDebug:

  1. If the goroutine outlives Stop, its final Store re-inserts the entry → one-entry re-leak per early-stopped run.
  2. os.RemoveAll in Stop races writes from a still-running goroutine.

Neither is introduced by this PR; both need context propagation to fix properly. Deferring is the right call.

CHANGELOG.md

Entry is clear and accurate.


Overall: Both fixes are minimal, correct, and at the right altitude. The residuals are honestly documented and do not affect correctness for the normal (run-to-completion) path. LGTM.

@sonarqubecloud

Copy link
Copy Markdown

@joshiste joshiste merged commit 7297e6a into main Jun 26, 2026
14 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant