chore(webapp,run-engine): downgrade boundary log noise to warn#3462
chore(webapp,run-engine): downgrade boundary log noise to warn#3462
Conversation
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
WalkthroughThe diff standardizes observability and error classification across the codebase: adds structured logging for object-store upload failures; introduces an OpenTelemetry meter and a Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Several boundary log sites and customer-input validation paths were
unconditionally logging at error level for failures the system already
handles gracefully (disconnect, retry-skip, return early). This batch
downgrades them to warn or counts them as metrics — visibility is
preserved in stdout / OTel metrics without surfacing them as alerts.
- apiBuilder.server.ts: logBoundaryError helper — warn for AbortError
and ServiceValidationError at loader/action boundary catches
- handleSocketIo.server.ts: warn for "Worker authentication failed"
(system disconnects on auth failure; refs TRI-8863)
- waitpointSystem.ts: skip throw and warn when run was canceled
while suspended (benign cancel-vs-resume race, no checkpoint to
resume from)
- runAttemptSystem.ts: warn for failed parse/validate of customer's
flushedMetadata (system already returns gracefully)
- batch-queue/index.ts: warn for non-retryable batch item failures
via result.skipRetries (queue size limit exceeded, etc.)
- queryPerformanceMonitor.server.ts: slow queries are observability,
not errors — warn
- timeoutDeployment.server.ts: deployment-state mismatch is a benign
timeout-vs-completion race — warn
- platform.v3.server.ts: platform_client.failures_total OTel counter
with {function, kind} labels replaces logger.error from BillingClient
call sites; helper recordPlatformFailure(fn, kind)
- waitpointCompletionPacket.server.ts: log inner uploadError before
throwing the wrapper ServiceValidationError so the underlying error
context isn't lost
644147b to
bdd5a5a
Compare
- runAttemptSystem.ts: remove unreachable `if (packetError)` block.
After `if (!packet) { return; }`, packet is truthy, which means
packetError is null per tryCatch's contract. Pre-existing dead code.
- platform.v3.server.ts: fix stale comment that referenced logger.warn —
the helper only counts a metric (no per-call logging by design).
Summary
Several boundary catches and customer-input validation paths were logging at
errorlevel for failures the system already handles gracefully — disconnect on auth failure, return undefined, skip retries, etc. This batch routes them towarn(which stays in stdout) or counts them as OTel metrics, so visibility is preserved without surfacing them as alerts.Changes
New helper / pattern:
apiBuilder.server.ts—logBoundaryError(message, error, url)inspects the inner error type at loader/action boundary catches; downgrades towarnforAbortError,ServiceValidationError, andEngineServiceValidationError.platform.v3.server.ts—platform_client.failures_totalOTel counter with{function, kind}labels; helperrecordPlatformFailure(fn, kind)replaces the previous error-level logging across allBillingClientwrappers.Log-level downgrades:
handleSocketIo.server.ts—Worker authentication failed→ warn (system disconnects on failure; refs TRI-8863)waitpointSystem.ts— whenrunStatus === "CANCELED"in the suspended-without-checkpoint branch, skip the throw and warn instead (benign cancel-vs-resume race, nothing to resume)runAttemptSystem.ts—flushedMetadataparse/validate failures → warn (customer-side data shape, system returns gracefully)batch-queue/index.ts— final-attempt failures withresult.skipRetries→ warn (callbacks already opted out of retry, e.g. queue size limit hit)queryPerformanceMonitor.server.ts— slow queries → warn (observability signal, not an application error)timeoutDeployment.server.ts— deployment-state mismatch in the timeout job → warn (timeout-vs-completion race)Inner error preservation:
waitpointCompletionPacket.server.ts—logger.error(uploadError)before throwing theServiceValidationErrorwrapper, so the underlying upload error stays visible.Why
The pattern across all of these is the same: a boundary log treated any thrown/returned error as
errorregardless of cause, even when the cause was an expected, system-handled condition (client disconnect, customer quota, race condition, schema validation of customer data). That made the logs noisy and made it harder to spot real bugs.Where the underlying signal is still useful operationally (slow queries, billing call failures), we route it to OTel metrics with low-cardinality labels so dashboards and alerts can be tuned independently of error logs.
Test plan
pnpm run typecheck --filter webapppnpm run build --filter @internal/run-enginewaitpointSystem.tsreturns{status: "skipped"}instead of throwingplatform_client.failures_totalcounter shows up in metrics with{function, kind}labels when the billing client errors