Skip to content

Harden job logging and timeout cleanup#476

Merged
wesm merged 6 commits intoroborev-dev:mainfrom
wesm:fix/low-resources-logging
Mar 8, 2026
Merged

Harden job logging and timeout cleanup#476
wesm merged 6 commits intoroborev-dev:mainfrom
wesm:fix/low-resources-logging

Conversation

@wesm
Copy link
Copy Markdown
Collaborator

@wesm wesm commented Mar 8, 2026

Summary

This PR hardens two failure paths that show up under resource pressure:

  1. Per-job log files could be missing even though the job appeared in the TUI.
  2. Jobs could appear to run well past job_timeout_minutes because the worker context timed out but the agent subprocess/pipe handling did not always unwind promptly.

The primary goal here is reliability, not changing policy. The existing default 30 minute timeout remains in place; this change makes that timeout and per-job logging more dependable.

What can go wrong today

Scenario 1: log file creation fails when the job starts

Per-job logging was best-effort. We attempted to open ~/.roborev/logs/jobs/<id>.log once at job start. If that failed because of a transient filesystem or resource problem, roborev logged a warning and permanently disabled disk logging for that job.

On a resource-constrained machine, that can happen if:

  • the machine is under RAM pressure and filesystem operations stall or fail transiently
  • the data directory is temporarily unavailable or in a bad state
  • writes start failing mid-job and never recover into a reopened log file

The visible result is a job in the TUI with no corresponding job log file, even though the job itself may finish.

Scenario 2: the job timeout fires, but the review does not actually unwind

The worker already applies a hard timeout via context.WithTimeout(...), defaulting to 30 minutes. The bug is that some agent subprocess paths could still remain stuck in process wait / pipe handling after the context deadline.

That means a user can observe a review job apparently "running" for longer than the configured timeout, especially when subprocesses are slow to terminate or inherited pipes remain open.

On low-resource machines this is more likely because:

  • child processes can stall while under memory pressure
  • cleanup can take longer
  • stdout pipe / background process behavior can keep Wait() from returning promptly

What this PR changes

1. Retryable job log writer

Replace the old one-shot safeWriter behavior with a retrying jobLogWriter:

  • it keeps trying to open/reopen the per-job log file after transient failures
  • it buffers a bounded amount of output in memory while disk logging is unavailable
  • it flushes buffered content once logging becomes available again
  • if the buffer overflows, it records how many bytes were dropped instead of silently disabling logging forever

This means a transient log-file open/write failure no longer dooms the entire job to have no disk log.

2. Centralized subprocess wait configuration and timeout cleanup

Add shared subprocess helpers so all agent adapters use the same wait-delay handling.

For streaming adapters, close the stdout pipe when the job context is done. This helps break cases where the context deadline has fired but the reader / wait path is still blocked behind lingering pipe state.

After cmd.Wait() completes, return context.DeadlineExceeded when the job context has expired instead of surfacing a generic subprocess error.

3. Explicit timeout handling in the worker

When an agent returns a deadline error, the worker now records a stable job error like:

agent timeout after 30m0s

and sends it through the normal retry/failover path.

This makes timeout failures clearer in the database, TUI, and hooks, and avoids jobs looking like ambiguous generic agent failures.

Why this fixes the reported behavior

The reported symptoms were:

  • not every displayed job had a log file
  • some review jobs seemed to hang for over an hour
  • the machine had limited RAM

This PR directly addresses those failure modes:

  • transient resource issues no longer permanently disable per-job disk logging
  • timed-out agent jobs are much more likely to unwind promptly instead of lingering past the configured timeout

Validation

  • go fmt ./...
  • go vet ./...
  • go test ./...

Notes from local investigation

On the investigated machine there are more jobs in reviews.db than files in ~/.roborev/logs/jobs, but the mismatch is historical rather than current: jobs from the last 7 days all had logs. The older missing logs line up with the existing log-retention/cleanup behavior and prior best-effort logging behavior. This PR is aimed at preventing new gaps under transient resource pressure and making timeout enforcement behave consistently.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (87f5db6)

Verdict: This PR hardens subprocess timeout handling and improves job log reliability, but introduces a few medium-severity issues regarding goroutine leaks and logging resilience.

Medium

Goroutine leak risk in context pipe-closer
File: internal/agent/process.go
:15

closeOnContextDone spawns a goroutine blocked on ctx.Done(). If a caller uses a context without a deadline or cancel function (e.g.,
context.Background()), ctx.Done() returns nil and the goroutine will block forever. Furthermore, if a cancellable context is provided but never actually cancelled, the goroutine can still live indefinitely after the function returns.
Suggested fix: Add if ctx.Done() == nil { return } before starting the goroutine. To handle the case where a context isn't cancelled, make closeOnContextDone return a stop() function and select on both ctx.Done() and a done channel, allowing callers to defer stop().

io.MultiWriter halts on client stream failure, dropping local logs
File: internal/daemon/worker.go:448
io.MultiWriter writes sequentially and stops on
the first error. Because outputWriter is passed first, if the client stream fails or disconnects, io.MultiWriter halts and the rest of the agent's output is dropped before it can reach the resilient jobLog writer.
Suggested fix: Swap the order to io. MultiWriter(jobLog, outputWriter). Since jobLog.Write handles its own errors and always returns len(p), nil, placing it first ensures the local log captures everything even if the client stream fails.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (0602214)

Verdict: One medium-severity issue identified regarding partial write handling; no other significant issues
or security flaws were found.

Medium

  • File: internal/daemon/joblog.go:147, internal/daemon/joblog.go:199
    Problem: Partial writes are not handled correctly. Both direct writes and buffered flushes ignore the returned n on
    error; if Write writes a prefix and returns an error, the full payload remains/re-enters the buffer and can be written again, causing duplicated log bytes.
    Suggested fix: Handle n, err := w.f.Write(...) explicitly and keep only the unwritten suffix buffered (
    p[n:]), and in flushBufferedLocked trim the written prefix before returning on error.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (d4434f5)

Verdict: One Medium severity issue identified regarding an io.Writer contract violation in the new job log writer; otherwise, the changes are clean.

Medium

  • Partial-write path returns wrong byte count
    • Files:
      internal/daemon/joblog.go (Lines 153, 164)

    • Description: In jobLogWriter.Write, p is sliced (p = p[n:]) after a partial write, then the function returns len(p), nil. That returns
      the remaining length, not the original input length.

    • Impact: This violates io.Writer expectations and causes io.MultiWriter to emit io.ErrShortWrite, potentially skipping downstream writers (including live output) and surfacing avoidable agent errors during disk-pressure scenarios.

    • Suggested fix: Capture origLen := len(p) at function entry and always return origLen, nil when intentionally swallowing log-write failures and buffering the remainder.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (ee037bd)

Verdict: The changes improve subprocess timeout/cancellation handling and job logging, but a high-severity bug in the cancellation tracking logic must be addressed.

High

  • Location: internal/agent /process.go (in configureSubprocess)
  • Problem: exec.Cmd.Cancel is nil by default in Go. The if cmd.Cancel != nil condition will always evaluate to false unless explicitly set before calling. As a result, tracker.canceledBy Context.Store(true) is never executed, and contextProcessError will fail to correctly classify closed-pipe parse errors as context cancellations.
  • Suggested Fix: Remove the if cmd.Cancel != nil wrapper and intercept the cancellation unconditionally. If the original cmd.Cancel is
    nil, fall back to calling cmd.Process.Kill() to preserve default behavior.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Mar 8, 2026

roborev: Combined Review (15f6e40)

Verdict: The changes improve agent timeout handling and log resilience, but require a fix for a high-severity bug in subprocess cancellation tracking.

High

  • Location
    :
    internal/agent/process.go:24 (in configureSubprocess)
  • Problem: exec.CommandContext leaves cmd.Cancel as nil by default. Because configureSubprocess only wraps cmd.Cancel if it is not nil,
    the cancellation tracking logic is never applied for standard commands. This causes tracker.canceledByContext to always remain false (breaking the parse-error timeout mapping) and causes TestConfigureSubprocessDoesNotMarkCanceledWhenProcessAlreadyExited to fail immediately on the cmd.Cancel == nil check
    .
  • Fix: Provide a default cancel fallback before wrapping if it is nil:
    cancel := cmd.Cancel
    if cancel == nil {
        cancel = func() error { return cmd.Process.Kill() }
    }

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm
Copy link
Copy Markdown
Collaborator Author

wesm commented Mar 8, 2026

"exec.CommandContext does not leave cmd.Cancel nil. In Go, exec.CommandContext sets
cmd.Cancel to kill the process by default. All subprocesses in this change use
exec.CommandContext, so the wrapper in configureSubprocess is active in the real code paths. I
verified this locally with both go doc os/exec.CommandContext and a small runtime check showing
cmd.Cancel == nil is false for exec.CommandContext(...). Because of that, the reported premise
does not apply to this branch.`"

@wesm wesm merged commit 8c707e0 into roborev-dev:main Mar 8, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant