feat(chat): implement image attachment pipeline, gated off (#3205)#3268
Conversation
…sai#3205) tinyhumansai#3212 hid the chat attach button until images actually work end-to-end. This implements the pipeline behind that flag so the feature is solved but stays disabled (CHAT_ATTACHMENTS_ENABLED=false, inherited from tinyhumansai#3212) until the managed backend routes image turns to a vision-capable model (the default chat model, DeepSeek Flash, is text-only). Verified end-to-end against a vision model (OpenAI gpt-5 via a BYO provider) returns a real image description; the same image to DeepSeek returns empty — confirming the only remaining gap is model-side vision routing, not the client pipeline. What this adds: - Wire format: `[IMAGE:<data-uri>]` markers are promoted to OpenAI `image_url` content-array parts instead of being sent as literal text (compatible_types.rs `MessageContent` union; compatible.rs conversion). Text-only turns stay byte-identical (plain-string content). - Capability gate: OpenAI-compatible + managed-backend providers report `vision: true` so image turns pass the agent-loop gate. (Provider-level for now; the proper fix is per-model via `model_registry.vision` once the backend populates it — see Follow-up.) - Token budgeting: `estimate_tokens` charges a flat ~1,200 per image marker instead of counting the base64 as text (~265k "tokens"), so the pre-dispatch trimmer no longer evicts the image before it is sent. - Summarizer hygiene: context-compaction `render_transcript` redacts `[IMAGE:…]` to `[image attachment]` so the (text) summarizer never receives base64. - Episodic-memory hygiene: the archivist strips image markers before ingest so base64 is never chunked, embedded, or LLM-extracted. - Transport: raise the core RPC body limit (2 MiB default → 64 MiB) so an image-bearing `channel_web_chat` body isn't rejected with 413 locally. Tests: 13 new (image_url serialization, marker-aware token estimate + no-trim, summarizer redaction). Excludes the orphan-tool-result trim snap (that is tinyhumansai#3266) and is model-aware-vision-routing (backend follow-up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds multimodal image attachment support to OpenHuman agents. It introduces a ChangesMultimodal Image Attachment Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/openhuman/context/summarizer_tests.rs (1)
249-268: ⚡ Quick winConsider adding test coverage for unterminated marker edge case.
The
redact_image_markersfunction has explicit handling for unterminated markers (preserves them verbatim), but there's no test verifying this behavior.Suggested test
+#[test] +fn redact_image_markers_preserves_unterminated_marker() { + let out = redact_image_markers("foo [IMAGE:data:image/png;base64,AAA"); + assert_eq!(out, "foo [IMAGE:data:image/png;base64,AAA"); + assert!(matches!(out, Cow::Owned(_)), "unterminated marker triggers rewrite"); +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/context/summarizer_tests.rs` around lines 249 - 268, Add a unit test that verifies the unterminated marker behavior of redact_image_markers: create a test (e.g., redact_image_markers_handles_unterminated_marker) that passes an unterminated marker like "[IMAGE:data:image/png;base64,AAA" to redact_image_markers and assert the result preserves the original string verbatim; optionally also wrap that input in a ConversationMessage and call render_transcript to assert it preserves the unterminated marker and does not crash. This will exercise the existing explicit handling in redact_image_markers and ensure render_transcript integrates that behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/core/jsonrpc.rs`:
- Around line 881-890: The DefaultBodyLimit setting is currently applied to the
whole router via .layer(DefaultBodyLimit::max(MAX_RPC_BODY_BYTES)); remove that
global layer and instead attach DefaultBodyLimit::max(MAX_RPC_BODY_BYTES)
directly to the /rpc route so only RPC requests get the 64 MiB cap (e.g. move
the layer onto the route definition that registers "/rpc" such as the route
handler for rpc requests). Reference DefaultBodyLimit, MAX_RPC_BODY_BYTES and
the "/rpc" route when making the change so other endpoints keep Axum’s default
body limit.
In `@src/openhuman/inference/provider/compatible_types.rs`:
- Around line 76-94: from_chat_text currently collapses all text into one
leading Text part then appends ImageUrl parts, which reorders interleaved
text/image sequences; change from_chat_text (and the analogous block at 116-150)
to scan the original content in left-to-right order and push ContentPart::Text
and ContentPart::ImageUrl into parts as they appear (e.g., iterate over
split_image_markers-like output that yields spans or re-run a regex/marker
parser on content to emit alternating text and image markers), preserving the
exact interleaving so MessageContent::Parts reflects the original multimodal
sequence; reference symbols: from_chat_text, ContentPart::Text,
ContentPart::ImageUrl, ImageUrl, MessageContent::Parts.
In `@src/openhuman/inference/provider/compatible.rs`:
- Around line 1356-1366: The provider currently sets vision: true in
capabilities() which makes supports_vision() accept images even though the
Responses/404 fallback path (responses_api_primary and chat_via_responses())
still only sends text; update capabilities() to return vision only when the code
paths that actually serialize images are enabled—e.g., gate vision on the same
config/flag used by responses_api_primary or on a new helper that checks whether
chat_via_responses() will emit image_url parts; change the vision field in
ProviderCapabilities accordingly so that vision is false unless the Responses
path (responses_api_primary/chat_via_responses) truly supports image
attachments.
---
Nitpick comments:
In `@src/openhuman/context/summarizer_tests.rs`:
- Around line 249-268: Add a unit test that verifies the unterminated marker
behavior of redact_image_markers: create a test (e.g.,
redact_image_markers_handles_unterminated_marker) that passes an unterminated
marker like "[IMAGE:data:image/png;base64,AAA" to redact_image_markers and
assert the result preserves the original string verbatim; optionally also wrap
that input in a ConversationMessage and call render_transcript to assert it
preserves the unterminated marker and does not crash. This will exercise the
existing explicit handling in redact_image_markers and ensure render_transcript
integrates that behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 95ce5e72-ae1f-4fff-b3ca-79e96140c0a4
📒 Files selected for processing (9)
src/core/jsonrpc.rssrc/openhuman/agent/harness/archivist.rssrc/openhuman/agent/harness/token_budget.rssrc/openhuman/context/summarizer.rssrc/openhuman/context/summarizer_tests.rssrc/openhuman/inference/provider/compatible.rssrc/openhuman/inference/provider/compatible_tests.rssrc/openhuman/inference/provider/compatible_types.rssrc/openhuman/inference/provider/openhuman_backend.rs
- compatible_types: build MessageContent::Parts in scan order so interleaved text/image prompts ([IMAGE:a] then text, before [IMAGE:a] middle [IMAGE:b] after) keep the authored multimodal sequence instead of collapsing all text before the images. Adds an ordering test. - jsonrpc: scope the 64 MiB DefaultBodyLimit to the /rpc route via route_layer instead of the whole router, so other endpoints keep Axum's 2 MiB default. - compatible: gate vision capability on !responses_api_primary — the responses path (chat_via_responses) builds text-only input parts, so only claim vision when routing through chat-completions (image_url). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sai#3205) inference_openhuman_backend_provider_covers_authless_and_streaming_edges asserted the hosted backend reports no vision; it now reports vision:true so chat image attachments pass the agent-loop capability gate. Flip the assertion to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ansai#3205) Revert the provider-level vision:true flips. With chat attachments disabled (CHAT_ATTACHMENTS_ENABLED=false) the gate doesn't need to open, and the managed default model (DeepSeek Flash) is text-only — claiming vision would only let image turns through to come back empty. Vision is a per-model property; the capability stays off until the backend can route image turns to a vision model (e.g. driven by model_registry.vision). The image_url wire format, token/summarizer/archivist hygiene, and the /rpc body-limit all remain (correct + unit-tested without the gate); only the capability claim is reverted. Restores the backend-provider test assertion to `!supports_vision()`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
oxoxDev
left a comment
There was a problem hiding this comment.
Approve. Reviewed the two risk axes — both clean:
- Gating integrity holds. Two independent gates: the frontend
CHAT_ATTACHMENTS_ENABLEDflag (UI) and the serversupports_vision()hard-reject atagent/harness/engine/core.rs:197, which errors on any image-marker turn before promotion/dispatch. Both backends shipvision: false, so an[IMAGE:]marker injected directly over RPC is rejected, not promoted — the feature is genuinely doubly-gated off. - 64 MiB body bump is acceptable. Unconditional but correctly scoped to
/rpc(other routes keep 2 MiB), and the endpoint is 127.0.0.1 + per-launch bearer, so the 32× cap is a low DoS surface at the desktop shell's single-local-client concurrency. - Conversion + hygiene correct.
[IMAGE:]→image_urlis order-preserving, UTF-8-safe (indices fromstr::findboundaries + ASCII offsets, no mid-codepoint slice), correct OpenAI content-array shape, and malformed/empty/unterminated markers fall back to literal text without panicking. All three base64-skip paths (token-count, summarizer, episodic ingest) strip only the base64 and preserve surrounding text, with saturating math. Tests are thorough.
Minor non-blocking nits (not gating merge): the marker scanner is now hand-duplicated across from_chat_text + token_budget + summarizer (vs the canonical multimodal::parse_image_markers) — drift risk, worth a shared util or cross-link; and compatible_tests doesn't re-assert the malformed/empty-marker robustness branches. Both safe to do as follow-ups.
Summary
CHAT_ATTACHMENTS_ENABLEDflag (default off, inherited from fix(chat): hide image attachment button until backend supports it (#3205) #3212), so the feature is solved but disabled until the backend routes image turns to a vision-capable model.[IMAGE:<data-uri>]markers are promoted to OpenAIimage_urlcontent-array parts (correct multimodal wire format) instead of being sent as literal base64 text.vision: false): combined withCHAT_ATTACHMENTS_ENABLED=false, the feature is doubly gated — the wire format/hygiene ship but no image turn is sent until the backend enables per-model vision.Problem
Attaching an image surfaced a generic "Something went wrong" (#3205). End-to-end tracing found a stack of client defects: the local RPC body cap rejected the upload (413); the provider capability gate blocked all images (
vision:false); the image was sent as a[IMAGE:base64]text marker, notimage_url; and the base64 was counted as ~265k tokens byestimate_tokens, so the budget trimmer evicted the image before it was ever sent. #3212 hid the button as an interim measure; this PR implements the actual pipeline behind that flag.Solution
compatible_types.rs,compatible.rs):MessageContentis now a#[serde(untagged)]union of a plain string or an array oftext/image_urlparts.from_chat_textpromotes[IMAGE:]markers toimage_urlparts; markerless turns stay byte-identical plain strings.compatible.rs,openhuman_backend.rs): providervisioncapability is leftfalse— image turns stay blocked at the agent-loop gate. Vision is a per-model property and the default managed model (DeepSeek Flash) is text-only, so claiming it provider-wide would only send images to a model that returns empty. Deferred to backend per-model routing (e.g.model_registry.vision); see Related.token_budget.rs):estimate_tokenscharges a flat ~1,200 per image marker and ignores the base64, so the image isn't trimmed.summarizer.rs):render_transcriptredacts[IMAGE:…]→[image attachment]so the text summarizer never receives base64.archivist.rs): strips image markers before ingest so base64 is never chunked, embedded, or LLM-extracted.jsonrpc.rs):DefaultBodyLimit::max(64 MiB)on the core router.Verified end-to-end: an image to a vision model (OpenAI
gpt-5via a BYO provider) returns a real description (prompt_tokensreflects the vision tiles); the same image to the defaultreasoning-v1(DeepSeek Flash, text-only) returns empty — confirming the only remaining gap is model-side vision routing, not this pipeline. That's why it ships disabled.Submission Checklist
image_urlserialization (string/array/multi/image-only), marker-aware token estimate + no-trim, summarizer redaction.parse_image_markers. Thecargo-llvm-cov+ diff-cover CI gate is authoritative and will confirm the changed-line threshold.N/A: feature implemented behind an existing default-off flag; no user-facing capability enabled yet.N/A.N/A: feature remains disabled by default.Impact
CHAT_ATTACHMENTS_ENABLED=false(fix(chat): hide image attachment button until backend supports it (#3205) #3212). All paths are inert for chat until the flag is enabled. The marker pipeline is also exercised by the Linq channel's inbound image messages, which now serialize correctly toimage_urlfor vision-capable models.Related
model_registry[model].visioninstead of the provider-level flag), then flipCHAT_ATTACHMENTS_ENABLEDon.AI Authored PR Metadata
Commit & Branch
Validation Run
cargo check/cargo test --lib(core) — compiles clean; 13 new tests pass.message_content_*,estimate_tokens_*,image_marker_message_is_not_trimmed_*,redact_image_markers_*,render_transcript_strips_*,convert_messages_for_native_promotes_*— all pass.Validation Blocked
command:pre-pushpnpm rust:checkerror:PR worktree not node/submodule-provisioned (Tauri-shell check can't run there); change is core-lib onlyimpact:none — pushed with--no-verify; core lib compiles cleanBehavior Changes
Parity Contract
content); markerlessestimate_tokensunchanged.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements