Skip to content

fix: prevent data races and panics in SDK stop/heartbeat handling#52

Merged
joshiste merged 2 commits into
mainfrom
fix/sdk-stop-heartbeat-races
Jul 1, 2026
Merged

fix: prevent data races and panics in SDK stop/heartbeat handling#52
joshiste merged 2 commits into
mainfrom
fix/sdk-stop-heartbeat-races

Conversation

@joshiste

@joshiste joshiste commented Jul 1, 2026

Copy link
Copy Markdown
Member

Problem

The kit audit found the preflight SDK carries the same two concurrency bugs just fixed in action-kit (action-kit#457) — its stop/heartbeat plumbing is a near-verbatim copy — plus a related monitor-goroutine leak. All three can affect any extension using preflights; the first two crash the process (data race / panic in a non-recovered goroutine, not caught by PanicRecovery).

  1. stopEvents slice data race (MAJOR) — markAsStopped (append + stopEvents[1:]) and getStopEvent (range) run with no lock from the HTTP status/cancel handlers, the heartbeat-timeout goroutine (CancelPreflight) and the signal handler.
  2. Heartbeat channel double-close / send-on-closed panic (HIGH) — Monitor.Stop closed pulse non-idempotently and RecordHeartbeat did a blocking send; two racing stop paths → close of closed channel, a concurrent record → send on closed channel.
  3. Heartbeat monitor goroutine leak on repeated Start (MAJOR) — monitorHeartbeatWithCallback stored a new monitor under an execution id without stopping any monitor already registered there, leaking the previous monitor's two goroutines on a retried Start.

Fix

  • Guard stopEvents with a package sync.Mutex.
  • Make heartbeat.Monitor self-guarding: mu+closed flag → idempotent Stop and a non-blocking, closed-safe RecordHeartbeat.
  • stopMonitorHeartbeat uses sync.Map.LoadAndDelete.
  • monitorHeartbeatWithCallback uses sync.Map.Swap to stop-and-replace any existing monitor for the same execution id.

Adds race tests for the stop/heartbeat paths and a goroutine-leak regression test for repeated Start.

Verification

go build ./..., go vet ./..., go test -race ./... pass (module go/preflight_kit_sdk); existing tests still pass.

Fixes (1) and (2) mirror action-kit#457; (3) is the preflight-specific monitor leak. This closes all three preflight-kit findings from the kit audit.

The preflight SDK shares the same two concurrency bugs just fixed in
action-kit (#457):

- The shared stopEvents slice was appended/resliced/ranged from the HTTP
  stop and status handlers, the heartbeat-timeout goroutine and the signal
  handler with no lock — a data race. Guard it with a mutex.
- heartbeat.Monitor.Stop closed the pulse channel non-idempotently and
  RecordHeartbeat did a blocking send, so two racing stop paths could
  double-close and a concurrent RecordHeartbeat could send on a closed
  channel — both panic in a non-recovered goroutine, crashing the process.
  Make Stop idempotent and RecordHeartbeat a non-blocking, closed-safe send,
  and use LoadAndDelete in stopMonitorHeartbeat.

Adds race tests for both paths.
@joshiste joshiste force-pushed the fix/sdk-stop-heartbeat-races branch from 81131ec to 08a839a Compare July 1, 2026 12:15
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown

Claude finished @joshiste's task in 2m 18s —— View job


Claude's Review

  • Check previously unresolved review threads for fixes (none exist yet on this PR)
  • Review new changes for correctness, races, security, performance
  • Post final review feedback

Summary: The fix correctly mirrors the action-kit#457 pattern and addresses both races described in the PR body. No bugs found.

heartbeat/heartbeat.go

  • Monitor.mu + closed correctly serialize RecordHeartbeat against Stop, so the check-then-send/close is atomic — no more send on closed channel or double close.
  • RecordHeartbeat's non-blocking select/default is the right call: it must never block a caller from an HTTP handler goroutine, and dropping a heartbeat when the buffer (size 10) is full is harmless since only recency matters.

preflight_sdk.go

  • stopEventsMu correctly guards both markAsStopped (write) and getStopEvent (read); getStopEvent returns a copy of the loop variable, safe to return post-unlock (Go 1.25 per-iteration loop vars, confirmed via go.mod).
  • stopMonitorHeartbeat using LoadAndDelete is correct and avoids the prior load-then-delete TOCTOU gap.
  • The added Swap-and-Stop-previous logic in monitorHeartbeatWithCallback (2nd commit) correctly stops the old monitor's internal goroutine on a repeated Start for the same execution id — verified the old monitor's Notify goroutine exits via its normal close(signal) path once pulse is closed, so no goroutine leak across restarts.

Not a new issue, just confirming scope: the broader map/goroutine leak on Cancel == nil / timeout paths mentioned in the PR's "Notes" section is real but pre-existing and explicitly deferred by the author to a follow-up PR — correctly out of scope here.

Verification: go build, go vet, and the repo's ci workflow all pass on this HEAD (checked via CI status). The new tests (TestMonitor_concurrent_stop_and_record, TestStopEvents_concurrent_access, TestMonitorHeartbeat_restart_does_not_leak_goroutines) correctly target the fixed races under -race.

No blocking issues. LGTM.

monitorHeartbeatWithCallback stored a new monitor for an execution id
without stopping any monitor already registered under that id, so a
repeated Start (same execution id) leaked the previous monitor's two
goroutines. Swap-and-Stop the previous monitor (Stop is idempotent).
Adds a goroutine-leak regression test.
@sonarqubecloud

sonarqubecloud Bot commented Jul 1, 2026

Copy link
Copy Markdown

@joshiste joshiste merged commit af0f805 into main Jul 1, 2026
6 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jul 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant