fix(chat): sanitize agent/cron failures and add user-safe error fallback with Sentry reporting#1332
Conversation
…rd reporting option
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAgent execution failures now report full internal error context to observability while emitting sanitized user-facing messages; shared message constants/helpers were added; frontend appends and deduplicates the shared sanitized bubble; backend and frontend tests validate sanitization and deduplication. ChangesError message sanitization and structured reporting
Sequence Diagram(s)sequenceDiagram
participant Client
participant WebChannel
participant Observability
participant EventBus
Client->>WebChannel: start_chat / run_chat_task
WebChannel->>Observability: report_error(internal details + metadata)
WebChannel->>EventBus: publish(chat_error with generic_inference_error_user_message)
EventBus->>Client: chat_error (sanitized message + Discord link)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Comment |
…s for matching messages
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/src/providers/ChatRuntimeProvider.tsx (1)
718-729:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winGuard the fallback bubble with a synchronous per-thread dedupe.
This check reads
lastMsgbefore the asyncaddInferenceResponse(...)thunk updates the store, so two non-cancelledonErrorevents that land back-to-back can both enqueue the same user-facing error bubble. That misses the new dedupe contract for repeated failures. A smalluseRef/seenChatEventsRefguard keyed bythread_idand set before dispatching would make this atomic enough for the error path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src/providers/ChatRuntimeProvider.tsx` around lines 718 - 729, The fallback error bubble dispatch can race because you read lastMsg from store.getState() then asynchronously dispatch addInferenceResponse, allowing duplicate bubbles on back-to-back non-cancelled errors; fix by adding a synchronous per-thread dedupe map (e.g. seenChatEventsRef keyed by event.thread_id) checked and set immediately before calling dispatch(addInferenceResponse(...)) so the check-and-set is atomic for that thread; reference and update the same key when the thread recovers/receives a new non-error message to allow future real messages to show.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/src/providers/__tests__/ChatRuntimeProvider.test.tsx`:
- Around line 432-441: The test uses an unreliable setTimeout(0) to wait for the
thunk + mocked threadApi.appendMessage to resolve; replace that timer flush with
a proper waitFor assertion (imported from testing-library) that polls until the
expected call appears. Specifically, remove the await new Promise(resolve =>
setTimeout(resolve, 0)) and wrap the matchingCalls check inside waitFor(() => {
... }) so the code that filters vi.mocked(threadApi.appendMessage).mock.calls
for 't-err-dedupe' and the error string runs repeatedly until it passes.
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 62-97: The test
start_chat_emits_sanitized_chat_error_on_inference_failure is nondeterministic
because it waits for a real provider failure; instead inject a deterministic
failure by making run_chat_task (or the agent/provider used by start_chat)
return an error in the test scope and then assert the emitted chat_error;
specifically, modify the test to configure/start the chat with a test-only
failing provider or mock implementation (or patch the agent construction path
used by start_chat) so run_chat_task immediately yields a controlled error, then
subscribe_web_channel_events() and assert the sanitized message equals
generic_inference_error_user_message() and contains no raw transport details.
In `@src/openhuman/channels/providers/web.rs`:
- Around line 274-279: The call to report_error in run_chat_task currently
buries request_id and thread_id inside the formatted error string; update the
report_error invocation to pass request_id and thread_id as structured
tags/extra context instead of only in detailed.as_str(). Locate the report_error
call (the one using detailed.as_str(), "web_channel", "run_chat_task", and the
tags slice with ("channel","web"), ("error_type","inference")) and add
request_id and thread_id to that tags/extra parameters (ensuring you convert
Options or Strings to &str as required by report_error) so Sentry/correlation
can index them directly.
In `@src/openhuman/cron/scheduler.rs`:
- Around line 261-276: The current observability call in the Err(e) branch of
run_agent_job() causes reporting on every retry; move Sentry reporting out of
run_agent_job() so it only happens when execute_job_with_retry() exhausts
retries, or change run_agent_job() to return Result<(bool, String),
anyhow::Error> (i.e., include the raw error alongside the sanitized
AGENT_JOB_USER_FAILURE_MESSAGE) so the caller (execute_job_with_retry() / its
caller) can log/report exactly once on final failure; update call sites to
handle the new return type and perform
crate::core::observability::report_error(...) only when retries are exhausted.
---
Outside diff comments:
In `@app/src/providers/ChatRuntimeProvider.tsx`:
- Around line 718-729: The fallback error bubble dispatch can race because you
read lastMsg from store.getState() then asynchronously dispatch
addInferenceResponse, allowing duplicate bubbles on back-to-back non-cancelled
errors; fix by adding a synchronous per-thread dedupe map (e.g.
seenChatEventsRef keyed by event.thread_id) checked and set immediately before
calling dispatch(addInferenceResponse(...)) so the check-and-set is atomic for
that thread; reference and update the same key when the thread recovers/receives
a new non-error message to allow future real messages to show.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4a696c23-1a28-4f91-a1e8-3ff0242df43f
📒 Files selected for processing (6)
app/src/providers/ChatRuntimeProvider.tsxapp/src/providers/__tests__/ChatRuntimeProvider.test.tsxsrc/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rssrc/openhuman/cron/scheduler.rssrc/openhuman/cron/scheduler_tests.rs
…sage verification
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 64-105: The test must always reset the forced error state; wrap
the setup call to set_test_forced_run_chat_task_error(Some(...)).await with a
RAII guard that ensures cleanup in Drop: create a small guard type (e.g.,
TestForcedRunChatTaskErrorGuard) constructed immediately after setting the
forced error and implement Drop to spawn an async task (tokio::spawn) that calls
set_test_forced_run_chat_task_error(None).await so the async reset runs even if
the test panics or an assert/expect aborts; replace the manual trailing cleanup
call with this guard to guarantee reset.
In `@src/openhuman/cron/scheduler.rs`:
- Around line 114-128: The retry-exhaustion reporting currently sends only the
sanitized AGENT_JOB_USER_FAILURE_MESSAGE (in the Err(_) arm), losing the raw
error; update the logic around the match that checks matches!(job.job_type,
JobType::Agent) so that you preserve the original Err(e) (e.g., capture e before
replacing last_output with AGENT_JOB_USER_FAILURE_MESSAGE) and pass the raw
error details into crate::core::observability::report_error when retries are
exhausted (include the raw error string or its debug display alongside the
existing metadata fields), while continuing to use
AGENT_JOB_USER_FAILURE_MESSAGE for any user-facing output; ensure the change is
applied both at the shown block and the other occurrence around line 278 that
mirrors this behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1b646ac5-6492-4e20-a9c1-b1adea61074c
📒 Files selected for processing (4)
app/src/providers/__tests__/ChatRuntimeProvider.test.tsxsrc/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rssrc/openhuman/cron/scheduler.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/openhuman/channels/providers/web.rs
… cleanup in chat task tests
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 76-79: The test embeds a real internal staging URL in the forced
error string; update the call to set_test_forced_run_chat_task_error in
web_tests.rs to use a placeholder/dummy URL (e.g.,
"https://example.invalid/openai/v1/chat/completions" or any non-sensitive
domain) so the test still simulates a URL-containing transport error but does
not leak internal infrastructure; keep the rest of the assertion logic unchanged
so the code paths in set_test_forced_run_chat_task_error and the chat task error
stripping logic are exercised as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0252b32f-3cd9-42aa-bbf4-df00910ce129
📒 Files selected for processing (3)
src/openhuman/channels/providers/web_tests.rssrc/openhuman/cron/scheduler.rssrc/openhuman/cron/scheduler_tests.rs
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
Summary
Replaced raw technical agent failure text in chat with a user-friendly fallback message
Added consistent error UX:
Sanitized web-channel
chat_errorpayloads:Added structured observability reporting:
Updated cron agent job failures:
Added/updated tests:
Problem
Agent and cron failures could expose:
Resulted in:
Needed:
Solution
Introduced user-facing fallback error messaging:
Standardized fallback copy:
<openhuman-link path="community/discord">Report on Discord</openhuman-link>Added backend observability reporting:
core::observability::report_error(...)Improved cron scheduler behavior:
error sending request for url (...)exposureAdded regression coverage:
Tradeoffs
Submission Checklist
Tests
docs/TESTING-STRATEGY.mdCoverage
.github/workflows/coverage.ymlpnpm test:coveragepnpm test:rustCoverage Matrix
docs/TEST-COVERAGE-MATRIX.mdFeature IDs
## RelatedNetwork Dependencies
Manual Smoke Checklist
docs/RELEASE-MANUAL-SMOKE.mdLinked Issues
Closes #NNNImpact
Platform / Runtime
Security
Observability
Compatibility
Performance
Related
Summary by CodeRabbit
Bug Fixes
Tests