Harden job logging and timeout cleanup by wesm · Pull Request #476 · roborev-dev/roborev

wesm · 2026-03-08T14:23:00Z

Summary

This PR hardens two failure paths that show up under resource pressure:

Per-job log files could be missing even though the job appeared in the TUI.
Jobs could appear to run well past job_timeout_minutes because the worker context timed out but the agent subprocess/pipe handling did not always unwind promptly.

The primary goal here is reliability, not changing policy. The existing default 30 minute timeout remains in place; this change makes that timeout and per-job logging more dependable.

What can go wrong today

Scenario 1: log file creation fails when the job starts

Per-job logging was best-effort. We attempted to open ~/.roborev/logs/jobs/<id>.log once at job start. If that failed because of a transient filesystem or resource problem, roborev logged a warning and permanently disabled disk logging for that job.

On a resource-constrained machine, that can happen if:

the machine is under RAM pressure and filesystem operations stall or fail transiently
the data directory is temporarily unavailable or in a bad state
writes start failing mid-job and never recover into a reopened log file

The visible result is a job in the TUI with no corresponding job log file, even though the job itself may finish.

Scenario 2: the job timeout fires, but the review does not actually unwind

The worker already applies a hard timeout via context.WithTimeout(...), defaulting to 30 minutes. The bug is that some agent subprocess paths could still remain stuck in process wait / pipe handling after the context deadline.

That means a user can observe a review job apparently "running" for longer than the configured timeout, especially when subprocesses are slow to terminate or inherited pipes remain open.

On low-resource machines this is more likely because:

child processes can stall while under memory pressure
cleanup can take longer
stdout pipe / background process behavior can keep Wait() from returning promptly

What this PR changes

1. Retryable job log writer

Replace the old one-shot safeWriter behavior with a retrying jobLogWriter:

it keeps trying to open/reopen the per-job log file after transient failures
it buffers a bounded amount of output in memory while disk logging is unavailable
it flushes buffered content once logging becomes available again
if the buffer overflows, it records how many bytes were dropped instead of silently disabling logging forever

This means a transient log-file open/write failure no longer dooms the entire job to have no disk log.

2. Centralized subprocess wait configuration and timeout cleanup

Add shared subprocess helpers so all agent adapters use the same wait-delay handling.

For streaming adapters, close the stdout pipe when the job context is done. This helps break cases where the context deadline has fired but the reader / wait path is still blocked behind lingering pipe state.

After cmd.Wait() completes, return context.DeadlineExceeded when the job context has expired instead of surfacing a generic subprocess error.

3. Explicit timeout handling in the worker

When an agent returns a deadline error, the worker now records a stable job error like:

agent timeout after 30m0s

and sends it through the normal retry/failover path.

This makes timeout failures clearer in the database, TUI, and hooks, and avoids jobs looking like ambiguous generic agent failures.

Why this fixes the reported behavior

The reported symptoms were:

not every displayed job had a log file
some review jobs seemed to hang for over an hour
the machine had limited RAM

This PR directly addresses those failure modes:

transient resource issues no longer permanently disable per-job disk logging
timed-out agent jobs are much more likely to unwind promptly instead of lingering past the configured timeout

Validation

go fmt ./...
go vet ./...
go test ./...

Notes from local investigation

On the investigated machine there are more jobs in reviews.db than files in ~/.roborev/logs/jobs, but the mismatch is historical rather than current: jobs from the last 7 days all had logs. The older missing logs line up with the existing log-retention/cleanup behavior and prior best-effort logging behavior. This PR is aimed at preventing new gaps under transient resource pressure and making timeout enforcement behave consistently.

roborev-ci · 2026-03-08T14:27:30Z

roborev: Combined Review (`87f5db6`)

Verdict: This PR hardens subprocess timeout handling and improves job log reliability, but introduces a few medium-severity issues regarding goroutine leaks and logging resilience.

Medium

Goroutine leak risk in context pipe-closer
File: internal/agent/process.go
:15
closeOnContextDone spawns a goroutine blocked on ctx.Done(). If a caller uses a context without a deadline or cancel function (e.g.,
context.Background()), ctx.Done() returns nil and the goroutine will block forever. Furthermore, if a cancellable context is provided but never actually cancelled, the goroutine can still live indefinitely after the function returns.
Suggested fix: Add if ctx.Done() == nil { return } before starting the goroutine. To handle the case where a context isn't cancelled, make closeOnContextDone return a stop() function and select on both ctx.Done() and a done channel, allowing callers to defer stop().

io.MultiWriter halts on client stream failure, dropping local logs
File: internal/daemon/worker.go:448
io.MultiWriter writes sequentially and stops on
the first error. Because outputWriter is passed first, if the client stream fails or disconnects, io.MultiWriter halts and the rest of the agent's output is dropped before it can reach the resilient jobLog writer.
Suggested fix: Swap the order to io. MultiWriter(jobLog, outputWriter). Since jobLog.Write handles its own errors and always returns len(p), nil, placing it first ensures the local log captures everything even if the client stream fails.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-03-08T14:41:22Z

roborev: Combined Review (`0602214`)

Verdict: One medium-severity issue identified regarding partial write handling; no other significant issues
or security flaws were found.

Medium

File: internal/daemon/joblog.go:147, internal/daemon/joblog.go:199
Problem: Partial writes are not handled correctly. Both direct writes and buffered flushes ignore the returned n on
error; if Write writes a prefix and returns an error, the full payload remains/re-enters the buffer and can be written again, causing duplicated log bytes.
Suggested fix: Handle n, err := w.f.Write(...) explicitly and keep only the unwritten suffix buffered (
p[n:]), and in flushBufferedLocked trim the written prefix before returning on error.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-03-08T15:14:47Z

roborev: Combined Review (`d4434f5`)

Verdict: One Medium severity issue identified regarding an io.Writer contract violation in the new job log writer; otherwise, the changes are clean.

Medium

Partial-write path returns wrong byte count
- Files:
  internal/daemon/joblog.go (Lines 153, 164)
- Description: In jobLogWriter.Write, p is sliced (p = p[n:]) after a partial write, then the function returns len(p), nil. That returns
  the remaining length, not the original input length.
- Impact: This violates io.Writer expectations and causes io.MultiWriter to emit io.ErrShortWrite, potentially skipping downstream writers (including live output) and surfacing avoidable agent errors during disk-pressure scenarios.
- Suggested fix: Capture origLen := len(p) at function entry and always return origLen, nil when intentionally swallowing log-write failures and buffering the remainder.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-03-08T15:21:04Z

roborev: Combined Review (`ee037bd`)

Verdict: The changes improve subprocess timeout/cancellation handling and job logging, but a high-severity bug in the cancellation tracking logic must be addressed.

High

Location: internal/agent /process.go (in configureSubprocess)
Problem: exec.Cmd.Cancel is nil by default in Go. The if cmd.Cancel != nil condition will always evaluate to false unless explicitly set before calling. As a result, tracker.canceledBy Context.Store(true) is never executed, and contextProcessError will fail to correctly classify closed-pipe parse errors as context cancellations.
Suggested Fix: Remove the if cmd.Cancel != nil wrapper and intercept the cancellation unconditionally. If the original cmd.Cancel is
nil, fall back to calling cmd.Process.Kill() to preserve default behavior.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-03-08T15:30:41Z

roborev: Combined Review (`15f6e40`)

Verdict: The changes improve agent timeout handling and log resilience, but require a fix for a high-severity bug in subprocess cancellation tracking.

High

Location
: internal/agent/process.go:24 (in configureSubprocess)
Problem: exec.CommandContext leaves cmd.Cancel as nil by default. Because configureSubprocess only wraps cmd.Cancel if it is not nil,
the cancellation tracking logic is never applied for standard commands. This causes tracker.canceledByContext to always remain false (breaking the parse-error timeout mapping) and causes TestConfigureSubprocessDoesNotMarkCanceledWhenProcessAlreadyExited to fail immediately on the cmd.Cancel == nil check
.

Fix: Provide a default cancel fallback before wrapping if it is nil:

cancel := cmd.Cancel
if cancel == nil {
    cancel = func() error { return cmd.Process.Kill() }
}

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm · 2026-03-08T15:31:17Z

"exec.CommandContext does not leave cmd.Cancel nil. In Go, exec.CommandContext sets
cmd.Cancel to kill the process by default. All subprocesses in this change use
exec.CommandContext, so the wrapper in configureSubprocess is active in the real code paths. I
verified this locally with both go doc os/exec.CommandContext and a small runtime check showing
cmd.Cancel == nil is false for exec.CommandContext(...). Because of that, the reported premise
does not apply to this branch.`"

Harden job logging and timeout cleanup

87f5db6

Address timeout cleanup review feedback

0602214

wesm added 2 commits March 8, 2026 10:01

Fix timeout error classification and partial log writes

4bf62cd

Fix remaining timeout and log ordering edge cases

d4434f5

Tighten subprocess cancellation tracking

ee037bd

Avoid marking exited subprocesses as canceled

15f6e40

wesm merged commit 8c707e0 into roborev-dev:main Mar 8, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden job logging and timeout cleanup#476

Harden job logging and timeout cleanup#476
wesm merged 6 commits intoroborev-dev:mainfrom
wesm:fix/low-resources-logging

wesm commented Mar 8, 2026

Uh oh!

roborev-ci Bot commented Mar 8, 2026

Uh oh!

roborev-ci Bot commented Mar 8, 2026

Uh oh!

roborev-ci Bot commented Mar 8, 2026

Uh oh!

roborev-ci Bot commented Mar 8, 2026

Uh oh!

roborev-ci Bot commented Mar 8, 2026

Uh oh!

wesm commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wesm commented Mar 8, 2026

Summary

What can go wrong today

Scenario 1: log file creation fails when the job starts

Scenario 2: the job timeout fires, but the review does not actually unwind

What this PR changes

1. Retryable job log writer

2. Centralized subprocess wait configuration and timeout cleanup

3. Explicit timeout handling in the worker

Why this fixes the reported behavior

Validation

Notes from local investigation

Uh oh!

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (87f5db6)

Medium

Uh oh!

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (0602214)

Medium

Uh oh!

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (d4434f5)

Medium

Uh oh!

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (ee037bd)

High

Uh oh!

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (15f6e40)

High

Uh oh!

wesm commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

roborev: Combined Review (`87f5db6`)

roborev: Combined Review (`0602214`)

roborev: Combined Review (`d4434f5`)

roborev: Combined Review (`ee037bd`)

roborev: Combined Review (`15f6e40`)