Skip to content

fix: don't remove the debug working dir while a gather is in flight#94

Merged
joshiste merged 2 commits into
mainfrom
fix/stop-gather-workingdir-race
Jul 1, 2026
Merged

fix: don't remove the debug working dir while a gather is in flight#94
joshiste merged 2 commits into
mainfrom
fix/stop-gather-workingdir-race

Conversation

@joshiste

Copy link
Copy Markdown
Member

Problem

The debug action gathers debug information in a background goroutine (RunSteadybitDebug → the in-process steadybit-debug library). Stop unconditionally did os.RemoveAll(state.WorkingDir).

steadybit-debug's ZipOutputDirectory runs tar over the working directory and calls os.Exit(1) if the tar fails (and AddOutputDirectory does the same on mkdir failure). So when an experiment is stopped while the gather is still archiving, Stop removing the working directory out from under the in-flight tar makes the library os.Exit(1)crashing the entire extension process and taking down every other in-flight action. This is not caught by the existing panic recovery (#93), because os.Exit is not a panic.

GatherInformation takes no context, so the gather genuinely cannot be interrupted mid-run — the fix has to coordinate cleanup rather than cancel the work.

Fix

  • The result archive lives inside WorkingDir (WorkingDir/steadybit-debug-<ts>.tar.gz), so the gather goroutine must not remove WorkingDir before Status uploads the archive.
  • Stop removes WorkingDir only when no gather goroutine could be tarring it — either none started (e.g. prepared then rolled back) or it has already finished. While a gather is in flight, Stop signals a per-run context.CancelFunc and leaves WorkingDir alone; the goroutine removes WorkingDir itself once its (uncancellable) work returns and it sees the cancel.

This removes the os.RemoveAll-during-tar crash window while keeping the working directory cleaned up in every path (normal completion, stop-mid-gather, and never-started).

Added unit tests for the three Stop paths.

Deeper follow-up (out of scope)

The root cause is that steadybit-debug's AddOutputDirectory/ZipOutputDirectory call os.Exit(1) on mkdir/tar failure, and helper.go calls those. The principled fix is to have helper.go do the mkdir + tar itself and return an error instead of os.Exit, which would remove the crash hazard at the source and let cleanup ownership be simpler. That reimplements library orchestration (and adds a tar-shelling surface), so it's left as a separate follow-up.

Verification

go build ./..., go vet ./..., go test -race ./extdebug/ all pass.

Stop did os.RemoveAll(WorkingDir) unconditionally. When an experiment is
stopped while the background gather goroutine is still archiving, that
removed the directory out from under steadybit-debug's tar, which calls
os.Exit(1) on tar failure — crashing the whole extension (and not caught
by the existing panic recovery, since os.Exit is not a panic).

Coordinate cleanup instead: Stop removes WorkingDir only when no gather
goroutine could be tarring it (none started, or it has finished); while a
gather is in flight it signals a per-run cancel and leaves WorkingDir for
the goroutine, which removes it once its (uncancellable) work returns. The
result archive lives inside WorkingDir, so the goroutine must not remove
WorkingDir before Status uploads it.
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Claude finished @joshiste's task in 2m 19s —— View job


Claude's Review

  • Check previously unresolved review threads for fixes
  • Gather context (diff, repo, CLAUDE.md)
  • Review new changes (race conditions, bugs, security, performance)
  • Post findings / resolve addressed threads

Previously flagged issues: Both review threads I opened on the earlier revision (bc22f81) — the duplicate-Stop crash window and the Stop/goroutine TOCTOU leak — are already resolved. Commit 84309b2 replaced the split debugRuns/debugCancels sync.Maps + context.CancelFunc state with a single *debugRun struct guarded by one sync.Mutex per execution, and Stop now consumes it via debugRuns.LoadAndDelete. That makes the "is a gather still tarring WorkingDir?" decision atomic and makes a retried Stop a true no-op, which closes both windows. The author already replied to and resolved both threads — nothing further to do there.

Review of the current HEAD (extdebug/action.go, extdebug/action_test.go):

I traced the full lifecycle (PrepareStartStatus/Stop, plus the panic-recovery path) against the new debugRun mutex model and didn't find any new race or correctness issue:

  • removeWorkingDir() is only ever called with mu held (panic handler, success path, Stop), and is idempotent via the cleaned flag, so no double-remove/double-close risk.
  • Start sets started = true under the lock before launching the goroutine, so a Stop landing in the gap between that and the goroutine actually calling RunSteadybitDebug still correctly defers cleanup to the goroutine instead of racing it.
  • Stop's LoadAndDelete on debugRuns is the single source of truth now — a duplicate/retried Stop finds nothing on the second call and returns immediately, so it can no longer remove WorkingDir out from under an in-flight tar.
  • Status still uses non-destructive Load, so polling it repeatedly is safe and doesn't interfere with Stop's bookkeeping.
  • The added tests (TestStopRemovesWorkingDirWhenGatherNeverStarted, TestStopRemovesWorkingDirWhenGatherFinished, TestStopLeavesWorkingDirWhileGatherInFlight, TestDuplicateStopDoesNotRemoveWorkingDirMidGather) directly cover the three Stop paths and the duplicate-call scenario that was previously broken.

No new bugs, security, or performance issues found. I wasn't able to run go build/go vet/go test -race myself in this sandbox (Bash invocations of go require approval that isn't available in this run) — the PR description states all three were run and passed, and the static analysis above is consistent with that.

Comment thread extdebug/action.go Outdated
Comment thread extdebug/action.go Outdated
Address review: the cleanup decision was split across two sync.Maps and a
context, read non-atomically. A duplicate Stop (e.g. a platform retry)
could then see no in-flight marker and remove WorkingDir while the first
call's gather goroutine was still tarring it — reintroducing the os.Exit
crash — and a TOCTOU between the goroutine's stop check and Stop could
leak WorkingDir.

Replace both maps with one *debugRun per execution guarded by a mutex.
Stop consumes the entry via LoadAndDelete (so retries are a no-op) and,
under the lock, removes WorkingDir only when no gather is in flight;
otherwise the goroutine removes it once RunSteadybitDebug returns and it
observes stopped. The cleanup is idempotent.
@sonarqubecloud

Copy link
Copy Markdown

@joshiste joshiste merged commit 79f27b8 into main Jul 1, 2026
18 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jul 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant