Skip to content

fix(0.14.1): per-attempt fetch timeout + retry thrown network errors#24

Merged
drewstone merged 2 commits into
mainfrom
fix/backend-timeout
May 20, 2026
Merged

fix(0.14.1): per-attempt fetch timeout + retry thrown network errors#24
drewstone merged 2 commits into
mainfrom
fix/backend-timeout

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

A real eval persona burned 15 minutes on one hung request — the tcloud router accepted the connection, never responded, and the fetch sat open until the runtime gave up with fetch failed. Two gaps in createOpenAICompatibleBackend:

  1. No per-attempt deadline — a hung upstream blocked indefinitely.
  2. Thrown fetch errors weren't retried — only HTTP error statuses were; a thrown fetch failed killed the attempt.

Fix

  • BackendRetryPolicy.requestTimeoutMs (default 120s) — per-attempt AbortController deadline linked to the caller signal. Hung upstream aborts in 2min and retries.
  • The fetch call is wrapped — a thrown error is retried (backoff) like a 5xx. Caller aborts stay terminal. Exhausted retries → BackendTransportError.

Test plan

  • 3 new tests (thrown-error retry, timeout-abort+retry, all-throw→error)
  • 216 tests pass, typecheck + biome clean

Unblocks the eval run — without this one router hiccup kills a persona.

drewstone added 2 commits May 20, 2026 22:49
A production eval persona burned 15 minutes on a single hung request:
the tcloud router accepted the connection, never responded, and the
fetch sat open until the runtime gave up with `fetch failed`. Two gaps
in createOpenAICompatibleBackend caused it:

1. No per-attempt deadline — a hung upstream blocked the attempt
   indefinitely.
2. Thrown fetch errors (network failure, DNS, the eventual `fetch
   failed`) propagated straight out of the retry loop. Only HTTP error
   *statuses* were retried; a thrown error killed the attempt.

Fixes:
- BackendRetryPolicy.requestTimeoutMs (default 120s) — each attempt gets
  an AbortController deadline linked to the caller signal. A hung
  upstream now aborts in 2 min and retries instead of hanging.
- The fetch call is wrapped: a thrown error is treated as a retryable
  transport failure (backoff + retry) just like a 5xx. Caller-initiated
  aborts stay terminal. Exhausted retries throw BackendTransportError
  with the last error message.

3 new tests: thrown-error retry, per-attempt-timeout abort + retry,
all-attempts-throw → BackendTransportError. 216 tests pass.
@drewstone drewstone merged commit 9500733 into main May 20, 2026
1 check passed
@drewstone drewstone deleted the fix/backend-timeout branch May 20, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant