fix: don't remove the debug working dir while a gather is in flight#94
Conversation
Stop did os.RemoveAll(WorkingDir) unconditionally. When an experiment is stopped while the background gather goroutine is still archiving, that removed the directory out from under steadybit-debug's tar, which calls os.Exit(1) on tar failure — crashing the whole extension (and not caught by the existing panic recovery, since os.Exit is not a panic). Coordinate cleanup instead: Stop removes WorkingDir only when no gather goroutine could be tarring it (none started, or it has finished); while a gather is in flight it signals a per-run cancel and leaves WorkingDir for the goroutine, which removes it once its (uncancellable) work returns. The result archive lives inside WorkingDir, so the goroutine must not remove WorkingDir before Status uploads it.
|
Claude finished @joshiste's task in 2m 19s —— View job Claude's Review
Previously flagged issues: Both review threads I opened on the earlier revision (bc22f81) — the duplicate- Review of the current HEAD ( I traced the full lifecycle (
No new bugs, security, or performance issues found. I wasn't able to run |
Address review: the cleanup decision was split across two sync.Maps and a context, read non-atomically. A duplicate Stop (e.g. a platform retry) could then see no in-flight marker and remove WorkingDir while the first call's gather goroutine was still tarring it — reintroducing the os.Exit crash — and a TOCTOU between the goroutine's stop check and Stop could leak WorkingDir. Replace both maps with one *debugRun per execution guarded by a mutex. Stop consumes the entry via LoadAndDelete (so retries are a no-op) and, under the lock, removes WorkingDir only when no gather is in flight; otherwise the goroutine removes it once RunSteadybitDebug returns and it observes stopped. The cleanup is idempotent.
|



Problem
The debug action gathers debug information in a background goroutine (
RunSteadybitDebug→ the in-processsteadybit-debuglibrary).Stopunconditionally didos.RemoveAll(state.WorkingDir).steadybit-debug'sZipOutputDirectoryrunstarover the working directory and callsos.Exit(1)if the tar fails (andAddOutputDirectorydoes the same on mkdir failure). So when an experiment is stopped while the gather is still archiving,Stopremoving the working directory out from under the in-flighttarmakes the libraryos.Exit(1)— crashing the entire extension process and taking down every other in-flight action. This is not caught by the existing panic recovery (#93), becauseos.Exitis not a panic.GatherInformationtakes no context, so the gather genuinely cannot be interrupted mid-run — the fix has to coordinate cleanup rather than cancel the work.Fix
WorkingDir(WorkingDir/steadybit-debug-<ts>.tar.gz), so the gather goroutine must not removeWorkingDirbeforeStatusuploads the archive.StopremovesWorkingDironly when no gather goroutine could be tarring it — either none started (e.g. prepared then rolled back) or it has already finished. While a gather is in flight,Stopsignals a per-runcontext.CancelFuncand leavesWorkingDiralone; the goroutine removesWorkingDiritself once its (uncancellable) work returns and it sees the cancel.This removes the
os.RemoveAll-during-tarcrash window while keeping the working directory cleaned up in every path (normal completion, stop-mid-gather, and never-started).Added unit tests for the three
Stoppaths.Deeper follow-up (out of scope)
The root cause is that
steadybit-debug'sAddOutputDirectory/ZipOutputDirectorycallos.Exit(1)on mkdir/tar failure, andhelper.gocalls those. The principled fix is to havehelper.godo the mkdir +taritself and return an error instead ofos.Exit, which would remove the crash hazard at the source and let cleanup ownership be simpler. That reimplements library orchestration (and adds atar-shelling surface), so it's left as a separate follow-up.Verification
go build ./...,go vet ./...,go test -race ./extdebug/all pass.