You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR hardens two failure paths that show up under resource pressure:
Per-job log files could be missing even though the job appeared in the TUI.
Jobs could appear to run well past job_timeout_minutes because the worker context timed out but the agent subprocess/pipe handling did not always unwind promptly.
The primary goal here is reliability, not changing policy. The existing default 30 minute timeout remains in place; this change makes that timeout and per-job logging more dependable.
What can go wrong today
Scenario 1: log file creation fails when the job starts
Per-job logging was best-effort. We attempted to open ~/.roborev/logs/jobs/<id>.log once at job start. If that failed because of a transient filesystem or resource problem, roborev logged a warning and permanently disabled disk logging for that job.
On a resource-constrained machine, that can happen if:
the machine is under RAM pressure and filesystem operations stall or fail transiently
the data directory is temporarily unavailable or in a bad state
writes start failing mid-job and never recover into a reopened log file
The visible result is a job in the TUI with no corresponding job log file, even though the job itself may finish.
Scenario 2: the job timeout fires, but the review does not actually unwind
The worker already applies a hard timeout via context.WithTimeout(...), defaulting to 30 minutes. The bug is that some agent subprocess paths could still remain stuck in process wait / pipe handling after the context deadline.
That means a user can observe a review job apparently "running" for longer than the configured timeout, especially when subprocesses are slow to terminate or inherited pipes remain open.
On low-resource machines this is more likely because:
child processes can stall while under memory pressure
cleanup can take longer
stdout pipe / background process behavior can keep Wait() from returning promptly
What this PR changes
1. Retryable job log writer
Replace the old one-shot safeWriter behavior with a retrying jobLogWriter:
it keeps trying to open/reopen the per-job log file after transient failures
it buffers a bounded amount of output in memory while disk logging is unavailable
it flushes buffered content once logging becomes available again
if the buffer overflows, it records how many bytes were dropped instead of silently disabling logging forever
This means a transient log-file open/write failure no longer dooms the entire job to have no disk log.
2. Centralized subprocess wait configuration and timeout cleanup
Add shared subprocess helpers so all agent adapters use the same wait-delay handling.
For streaming adapters, close the stdout pipe when the job context is done. This helps break cases where the context deadline has fired but the reader / wait path is still blocked behind lingering pipe state.
After cmd.Wait() completes, return context.DeadlineExceeded when the job context has expired instead of surfacing a generic subprocess error.
3. Explicit timeout handling in the worker
When an agent returns a deadline error, the worker now records a stable job error like:
agent timeout after 30m0s
and sends it through the normal retry/failover path.
This makes timeout failures clearer in the database, TUI, and hooks, and avoids jobs looking like ambiguous generic agent failures.
Why this fixes the reported behavior
The reported symptoms were:
not every displayed job had a log file
some review jobs seemed to hang for over an hour
the machine had limited RAM
This PR directly addresses those failure modes:
transient resource issues no longer permanently disable per-job disk logging
timed-out agent jobs are much more likely to unwind promptly instead of lingering past the configured timeout
Validation
go fmt ./...
go vet ./...
go test ./...
Notes from local investigation
On the investigated machine there are more jobs in reviews.db than files in ~/.roborev/logs/jobs, but the mismatch is historical rather than current: jobs from the last 7 days all had logs. The older missing logs line up with the existing log-retention/cleanup behavior and prior best-effort logging behavior. This PR is aimed at preventing new gaps under transient resource pressure and making timeout enforcement behave consistently.
Verdict: This PR hardens subprocess timeout handling and improves job log reliability, but introduces a few medium-severity issues regarding goroutine leaks and logging resilience.
Medium
Goroutine leak risk in context pipe-closer File: internal/agent/process.go
:15 closeOnContextDone spawns a goroutine blocked on ctx.Done(). If a caller uses a context without a deadline or cancel function (e.g., context.Background()), ctx.Done() returns nil and the goroutine will block forever. Furthermore, if a cancellable context is provided but never actually cancelled, the goroutine can still live indefinitely after the function returns. Suggested fix: Add if ctx.Done() == nil { return } before starting the goroutine. To handle the case where a context isn't cancelled, make closeOnContextDone return a stop() function and select on both ctx.Done() and a done channel, allowing callers to defer stop().
io.MultiWriter halts on client stream failure, dropping local logs File: internal/daemon/worker.go:448 io.MultiWriter writes sequentially and stops on
the first error. Because outputWriter is passed first, if the client stream fails or disconnects, io.MultiWriter halts and the rest of the agent's output is dropped before it can reach the resilient jobLog writer. Suggested fix: Swap the order to io. MultiWriter(jobLog, outputWriter). Since jobLog.Write handles its own errors and always returns len(p), nil, placing it first ensures the local log captures everything even if the client stream fails.
Verdict: One medium-severity issue identified regarding partial write handling; no other significant issues
or security flaws were found.
Medium
File:internal/daemon/joblog.go:147, internal/daemon/joblog.go:199 Problem: Partial writes are not handled correctly. Both direct writes and buffered flushes ignore the returned n on
error; if Write writes a prefix and returns an error, the full payload remains/re-enters the buffer and can be written again, causing duplicated log bytes. Suggested fix: Handle n, err := w.f.Write(...) explicitly and keep only the unwritten suffix buffered ( p[n:]), and in flushBufferedLocked trim the written prefix before returning on error.
Verdict: One Medium severity issue identified regarding an io.Writer contract violation in the new job log writer; otherwise, the changes are clean.
Medium
Partial-write path returns wrong byte count
Files: internal/daemon/joblog.go (Lines 153, 164)
Description: In jobLogWriter.Write, p is sliced (p = p[n:]) after a partial write, then the function returns len(p), nil. That returns
the remaining length, not the original input length.
Impact: This violates io.Writer expectations and causes io.MultiWriter to emit io.ErrShortWrite, potentially skipping downstream writers (including live output) and surfacing avoidable agent errors during disk-pressure scenarios.
Suggested fix: Capture origLen := len(p) at function entry and always return origLen, nil when intentionally swallowing log-write failures and buffering the remainder.
Verdict: The changes improve subprocess timeout/cancellation handling and job logging, but a high-severity bug in the cancellation tracking logic must be addressed.
High
Location:internal/agent /process.go (in configureSubprocess)
Problem:exec.Cmd.Cancel is nil by default in Go. The if cmd.Cancel != nil condition will always evaluate to false unless explicitly set before calling. As a result, tracker.canceledBy Context.Store(true) is never executed, and contextProcessError will fail to correctly classify closed-pipe parse errors as context cancellations.
Suggested Fix: Remove the if cmd.Cancel != nil wrapper and intercept the cancellation unconditionally. If the original cmd.Cancel is nil, fall back to calling cmd.Process.Kill() to preserve default behavior.
Verdict: The changes improve agent timeout handling and log resilience, but require a fix for a high-severity bug in subprocess cancellation tracking.
High
Location
:internal/agent/process.go:24 (in configureSubprocess)
Problem:exec.CommandContext leaves cmd.Cancel as nil by default. Because configureSubprocess only wraps cmd.Cancel if it is not nil,
the cancellation tracking logic is never applied for standard commands. This causes tracker.canceledByContext to always remain false (breaking the parse-error timeout mapping) and causes TestConfigureSubprocessDoesNotMarkCanceledWhenProcessAlreadyExited to fail immediately on the cmd.Cancel == nil check
.
Fix: Provide a default cancel fallback before wrapping if it is nil:
"exec.CommandContext does not leave cmd.Cancel nil. In Go, exec.CommandContext sets cmd.Cancel to kill the process by default. All subprocesses in this change use exec.CommandContext, so the wrapper in configureSubprocess is active in the real code paths. I
verified this locally with both go doc os/exec.CommandContext and a small runtime check showing cmd.Cancel == nil is false for exec.CommandContext(...). Because of that, the reported premise
does not apply to this branch.`"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR hardens two failure paths that show up under resource pressure:
job_timeout_minutesbecause the worker context timed out but the agent subprocess/pipe handling did not always unwind promptly.The primary goal here is reliability, not changing policy. The existing default 30 minute timeout remains in place; this change makes that timeout and per-job logging more dependable.
What can go wrong today
Scenario 1: log file creation fails when the job starts
Per-job logging was best-effort. We attempted to open
~/.roborev/logs/jobs/<id>.logonce at job start. If that failed because of a transient filesystem or resource problem, roborev logged a warning and permanently disabled disk logging for that job.On a resource-constrained machine, that can happen if:
The visible result is a job in the TUI with no corresponding job log file, even though the job itself may finish.
Scenario 2: the job timeout fires, but the review does not actually unwind
The worker already applies a hard timeout via
context.WithTimeout(...), defaulting to 30 minutes. The bug is that some agent subprocess paths could still remain stuck in process wait / pipe handling after the context deadline.That means a user can observe a review job apparently "running" for longer than the configured timeout, especially when subprocesses are slow to terminate or inherited pipes remain open.
On low-resource machines this is more likely because:
Wait()from returning promptlyWhat this PR changes
1. Retryable job log writer
Replace the old one-shot
safeWriterbehavior with a retryingjobLogWriter:This means a transient log-file open/write failure no longer dooms the entire job to have no disk log.
2. Centralized subprocess wait configuration and timeout cleanup
Add shared subprocess helpers so all agent adapters use the same wait-delay handling.
For streaming adapters, close the stdout pipe when the job context is done. This helps break cases where the context deadline has fired but the reader / wait path is still blocked behind lingering pipe state.
After
cmd.Wait()completes, returncontext.DeadlineExceededwhen the job context has expired instead of surfacing a generic subprocess error.3. Explicit timeout handling in the worker
When an agent returns a deadline error, the worker now records a stable job error like:
agent timeout after 30m0sand sends it through the normal retry/failover path.
This makes timeout failures clearer in the database, TUI, and hooks, and avoids jobs looking like ambiguous generic agent failures.
Why this fixes the reported behavior
The reported symptoms were:
This PR directly addresses those failure modes:
Validation
go fmt ./...go vet ./...go test ./...Notes from local investigation
On the investigated machine there are more jobs in
reviews.dbthan files in~/.roborev/logs/jobs, but the mismatch is historical rather than current: jobs from the last 7 days all had logs. The older missing logs line up with the existing log-retention/cleanup behavior and prior best-effort logging behavior. This PR is aimed at preventing new gaps under transient resource pressure and making timeout enforcement behave consistently.