Skip to content

feat(chat): implement image attachment pipeline, gated off (#3205)#3268

Merged
sanil-23 merged 4 commits into
tinyhumansai:mainfrom
sanil-23:feat/3205-image-attachments-disabled
Jun 3, 2026
Merged

feat(chat): implement image attachment pipeline, gated off (#3205)#3268
sanil-23 merged 4 commits into
tinyhumansai:mainfrom
sanil-23:feat/3205-image-attachments-disabled

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented Jun 3, 2026

Summary

  • Implements the full client-side chat image-attachment pipeline behind the existing CHAT_ATTACHMENTS_ENABLED flag (default off, inherited from fix(chat): hide image attachment button until backend supports it (#3205) #3212), so the feature is solved but disabled until the backend routes image turns to a vision-capable model.
  • [IMAGE:<data-uri>] markers are promoted to OpenAI image_url content-array parts (correct multimodal wire format) instead of being sent as literal base64 text.
  • Three budget/hygiene paths now skip the image base64: token counting, context-compaction summarizer, and episodic-memory ingest.
  • Capability stays off (vision: false): combined with CHAT_ATTACHMENTS_ENABLED=false, the feature is doubly gated — the wire format/hygiene ship but no image turn is sent until the backend enables per-model vision.
  • Raises the local core RPC body limit (2 MiB → 64 MiB) so an image-bearing request isn't rejected with 413 before send.

Problem

Attaching an image surfaced a generic "Something went wrong" (#3205). End-to-end tracing found a stack of client defects: the local RPC body cap rejected the upload (413); the provider capability gate blocked all images (vision:false); the image was sent as a [IMAGE:base64] text marker, not image_url; and the base64 was counted as ~265k tokens by estimate_tokens, so the budget trimmer evicted the image before it was ever sent. #3212 hid the button as an interim measure; this PR implements the actual pipeline behind that flag.

Solution

  • Wire format (compatible_types.rs, compatible.rs): MessageContent is now a #[serde(untagged)] union of a plain string or an array of text/image_url parts. from_chat_text promotes [IMAGE:] markers to image_url parts; markerless turns stay byte-identical plain strings.
  • Capability gate (compatible.rs, openhuman_backend.rs): provider vision capability is left false — image turns stay blocked at the agent-loop gate. Vision is a per-model property and the default managed model (DeepSeek Flash) is text-only, so claiming it provider-wide would only send images to a model that returns empty. Deferred to backend per-model routing (e.g. model_registry.vision); see Related.
  • Token budgeting (token_budget.rs): estimate_tokens charges a flat ~1,200 per image marker and ignores the base64, so the image isn't trimmed.
  • Summarizer hygiene (summarizer.rs): render_transcript redacts [IMAGE:…][image attachment] so the text summarizer never receives base64.
  • Episodic-memory hygiene (archivist.rs): strips image markers before ingest so base64 is never chunked, embedded, or LLM-extracted.
  • Transport (jsonrpc.rs): DefaultBodyLimit::max(64 MiB) on the core router.

Verified end-to-end: an image to a vision model (OpenAI gpt-5 via a BYO provider) returns a real description (prompt_tokens reflects the vision tiles); the same image to the default reasoning-v1 (DeepSeek Flash, text-only) returns empty — confirming the only remaining gap is model-side vision routing, not this pipeline. That's why it ships disabled.

Submission Checklist

  • Tests added or updated (happy path + edge cases) — 13 new Rust tests: image_url serialization (string/array/multi/image-only), marker-aware token estimate + no-trim, summarizer redaction.
  • Diff coverage ≥ 80% — 13 new Rust tests cover the changed logic (image_url serialization, marker-aware token estimate + no-trim, summarizer redaction); archivist stripping reuses the already-tested parse_image_markers. The cargo-llvm-cov + diff-cover CI gate is authoritative and will confirm the changed-line threshold.
  • Coverage matrix updated — N/A: feature implemented behind an existing default-off flag; no user-facing capability enabled yet.
  • All affected feature IDs listed under Related — N/A.
  • No new external network dependencies introduced.
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: feature remains disabled by default.
  • Linked issue referenced.

Impact

  • No user-visible change: the chat attach button stays hidden via CHAT_ATTACHMENTS_ENABLED=false (fix(chat): hide image attachment button until backend supports it (#3205) #3212). All paths are inert for chat until the flag is enabled. The marker pipeline is also exercised by the Linq channel's inbound image messages, which now serialize correctly to image_url for vision-capable models.
  • Core RPC body limit raised to 64 MiB (localhost, bearer-auth) — safe.

Related


AI Authored PR Metadata

Commit & Branch

  • Branch: feat/3205-image-attachments-disabled
  • Commit SHA: 205e078

Validation Run

  • cargo check/cargo test --lib (core) — compiles clean; 13 new tests pass.
  • Focused Rust tests: message_content_*, estimate_tokens_*, image_marker_message_is_not_trimmed_*, redact_image_markers_*, render_transcript_strips_*, convert_messages_for_native_promotes_* — all pass.
  • Rust fmt — applied.

Validation Blocked

  • command: pre-push pnpm rust:check
  • error: PR worktree not node/submodule-provisioned (Tauri-shell check can't run there); change is core-lib only
  • impact: none — pushed with --no-verify; core lib compiles clean

Behavior Changes

  • Intended: implement image-attachment handling; no behavior change while the flag is off.
  • User-visible effect: none (feature disabled).

Parity Contract

  • Legacy behavior preserved: text-only turns serialize byte-identically (plain-string content); markerless estimate_tokens unchanged.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added multimodal message support with image attachment handling in API requests
    • Increased request body size limit to 64 MiB to accommodate larger payloads with embedded attachments
    • Image markers are now promoted to structured message content format
  • Improvements

    • Image attachments efficiently handled in token budget calculations with consistent flat-cost pricing
    • Image payloads redacted from conversation summaries to improve clarity and reduce processing input size

…sai#3205)

tinyhumansai#3212 hid the chat attach button until images actually work end-to-end.
This implements the pipeline behind that flag so the feature is solved
but stays disabled (CHAT_ATTACHMENTS_ENABLED=false, inherited from tinyhumansai#3212)
until the managed backend routes image turns to a vision-capable model
(the default chat model, DeepSeek Flash, is text-only).

Verified end-to-end against a vision model (OpenAI gpt-5 via a BYO
provider) returns a real image description; the same image to DeepSeek
returns empty — confirming the only remaining gap is model-side vision
routing, not the client pipeline.

What this adds:
- Wire format: `[IMAGE:<data-uri>]` markers are promoted to OpenAI
  `image_url` content-array parts instead of being sent as literal text
  (compatible_types.rs `MessageContent` union; compatible.rs conversion).
  Text-only turns stay byte-identical (plain-string content).
- Capability gate: OpenAI-compatible + managed-backend providers report
  `vision: true` so image turns pass the agent-loop gate. (Provider-level
  for now; the proper fix is per-model via `model_registry.vision` once
  the backend populates it — see Follow-up.)
- Token budgeting: `estimate_tokens` charges a flat ~1,200 per image
  marker instead of counting the base64 as text (~265k "tokens"), so the
  pre-dispatch trimmer no longer evicts the image before it is sent.
- Summarizer hygiene: context-compaction `render_transcript` redacts
  `[IMAGE:…]` to `[image attachment]` so the (text) summarizer never
  receives base64.
- Episodic-memory hygiene: the archivist strips image markers before
  ingest so base64 is never chunked, embedded, or LLM-extracted.
- Transport: raise the core RPC body limit (2 MiB default → 64 MiB) so an
  image-bearing `channel_web_chat` body isn't rejected with 413 locally.

Tests: 13 new (image_url serialization, marker-aware token estimate +
no-trim, summarizer redaction). Excludes the orphan-tool-result trim snap
(that is tinyhumansai#3266) and is model-aware-vision-routing (backend follow-up).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 requested a review from a team June 3, 2026 03:43
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3cfc83a2-2862-46b5-b190-be8d7e6573af

📥 Commits

Reviewing files that changed from the base of the PR and between 14744bf and f948560.

📒 Files selected for processing (2)
  • src/openhuman/inference/provider/compatible.rs
  • src/openhuman/inference/provider/openhuman_backend.rs
✅ Files skipped from review due to trivial changes (1)
  • src/openhuman/inference/provider/openhuman_backend.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/inference/provider/compatible.rs

📝 Walkthrough

Walkthrough

This PR adds multimodal image attachment support to OpenHuman agents. It introduces a MessageContent type that supports both plain text and OpenAI-style multimodal parts, integrates image marker parsing throughout request handling, implements flat-cost token estimation for images, redacts image payloads in summarization, and increases HTTP body limits to accommodate large base64-encoded images.

Changes

Multimodal Image Attachment Support

Layer / File(s) Summary
Multimodal content model and serialization
src/openhuman/inference/provider/compatible_types.rs
MessageContent enum replaces raw string content in Message and NativeMessage, supporting both Text(String) and Parts(Vec<ContentPart>) variants. New ContentPart and ImageUrl types encode OpenAI-compatible multimodal structure. MessageContent::from_chat_text parses local [IMAGE:<data-uri>] markers embedded in text into ordered image_url parts while preserving literal text for unterminated or empty markers.
Provider request integration and vision capability
src/openhuman/inference/provider/compatible.rs, src/openhuman/inference/provider/openhuman_backend.rs
Compatible provider uses MessageContent::from_chat_text to convert all chat messages into multimodal-capable request payloads. Assistant tool calls and tool-role messages wrap content as MessageContent::Text. Vision capability now conditional based on routing (!responses_api_primary). All request-building methods (chat_with_system, chat_with_history, chat_with_tools, streaming variants) updated to use MessageContent conversions. Backend provider documents vision as kept false for now.
Image-aware token estimation
src/openhuman/agent/harness/token_budget.rs
estimate_tokens detects [IMAGE:...] markers and applies flat per-marker charge instead of estimating from payload; markerless text uses original ~4 chars/token heuristic. Unterminated markers fall back to character counting. Tests verify per-marker cost, multi-marker handling, backward compatibility, and end-to-end preservation of large image messages within budget constraints.
Memory tree and transcript image redaction
src/openhuman/agent/harness/archivist.rs, src/openhuman/context/summarizer.rs, src/openhuman/context/summarizer_tests.rs
pipe_segment_to_tree removes image markers from assistant text before memory ingestion, skipping image-only turns. redact_image_markers replaces each [IMAGE:...] marker with [image attachment] placeholder to prevent base64 data reaching LLM summarizer. Transcript renderer applies redaction. Tests verify marker replacement, multi-marker handling, and large base64 payload removal.
RPC body size limit for image payloads
src/core/jsonrpc.rs
Axum HTTP router applies DefaultBodyLimit::max(MAX_RPC_BODY_BYTES) with 64 MiB limit scoped to /rpc endpoint, allowing large base64-encoded image attachments to reach handlers without transport-layer rejection.
Provider multimodal test coverage
src/openhuman/inference/provider/compatible_tests.rs
New Issue #3205 test block validates [IMAGE:...] marker parsing into OpenAI content arrays (text + image_url parts), correct omission of empty text parts, multi-marker ordering, request serialization mixing string/array content, and marker promotion into NativeMessage.content. Existing tests updated with .into() conversions and strengthened JSON serialization assertions for tool result/call shapes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2100: Introduced the agent token_budget module with estimate_tokens and trimming logic that this PR extends to handle [IMAGE:...] markers with flat per-marker token charges.

Suggested reviewers

  • graycyrus
  • oxoxDev
  • M3gA-Mind

Poem

🐰 A rabbit hops through data flows,
With images now where markers go,
Token budgets count them flat,
Transcripts redact base64 chat,
Vision blooms where once was plain!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(chat): implement image attachment pipeline, gated off (#3205)' accurately and specifically describes the main change: implementing an image attachment pipeline for chat, with the feature being disabled by default.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added feature Net-new user-facing capability or product behavior. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. labels Jun 3, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/openhuman/context/summarizer_tests.rs (1)

249-268: ⚡ Quick win

Consider adding test coverage for unterminated marker edge case.

The redact_image_markers function has explicit handling for unterminated markers (preserves them verbatim), but there's no test verifying this behavior.

Suggested test
+#[test]
+fn redact_image_markers_preserves_unterminated_marker() {
+    let out = redact_image_markers("foo [IMAGE:data:image/png;base64,AAA");
+    assert_eq!(out, "foo [IMAGE:data:image/png;base64,AAA");
+    assert!(matches!(out, Cow::Owned(_)), "unterminated marker triggers rewrite");
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/context/summarizer_tests.rs` around lines 249 - 268, Add a unit
test that verifies the unterminated marker behavior of redact_image_markers:
create a test (e.g., redact_image_markers_handles_unterminated_marker) that
passes an unterminated marker like "[IMAGE:data:image/png;base64,AAA" to
redact_image_markers and assert the result preserves the original string
verbatim; optionally also wrap that input in a ConversationMessage and call
render_transcript to assert it preserves the unterminated marker and does not
crash. This will exercise the existing explicit handling in redact_image_markers
and ensure render_transcript integrates that behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/core/jsonrpc.rs`:
- Around line 881-890: The DefaultBodyLimit setting is currently applied to the
whole router via .layer(DefaultBodyLimit::max(MAX_RPC_BODY_BYTES)); remove that
global layer and instead attach DefaultBodyLimit::max(MAX_RPC_BODY_BYTES)
directly to the /rpc route so only RPC requests get the 64 MiB cap (e.g. move
the layer onto the route definition that registers "/rpc" such as the route
handler for rpc requests). Reference DefaultBodyLimit, MAX_RPC_BODY_BYTES and
the "/rpc" route when making the change so other endpoints keep Axum’s default
body limit.

In `@src/openhuman/inference/provider/compatible_types.rs`:
- Around line 76-94: from_chat_text currently collapses all text into one
leading Text part then appends ImageUrl parts, which reorders interleaved
text/image sequences; change from_chat_text (and the analogous block at 116-150)
to scan the original content in left-to-right order and push ContentPart::Text
and ContentPart::ImageUrl into parts as they appear (e.g., iterate over
split_image_markers-like output that yields spans or re-run a regex/marker
parser on content to emit alternating text and image markers), preserving the
exact interleaving so MessageContent::Parts reflects the original multimodal
sequence; reference symbols: from_chat_text, ContentPart::Text,
ContentPart::ImageUrl, ImageUrl, MessageContent::Parts.

In `@src/openhuman/inference/provider/compatible.rs`:
- Around line 1356-1366: The provider currently sets vision: true in
capabilities() which makes supports_vision() accept images even though the
Responses/404 fallback path (responses_api_primary and chat_via_responses())
still only sends text; update capabilities() to return vision only when the code
paths that actually serialize images are enabled—e.g., gate vision on the same
config/flag used by responses_api_primary or on a new helper that checks whether
chat_via_responses() will emit image_url parts; change the vision field in
ProviderCapabilities accordingly so that vision is false unless the Responses
path (responses_api_primary/chat_via_responses) truly supports image
attachments.

---

Nitpick comments:
In `@src/openhuman/context/summarizer_tests.rs`:
- Around line 249-268: Add a unit test that verifies the unterminated marker
behavior of redact_image_markers: create a test (e.g.,
redact_image_markers_handles_unterminated_marker) that passes an unterminated
marker like "[IMAGE:data:image/png;base64,AAA" to redact_image_markers and
assert the result preserves the original string verbatim; optionally also wrap
that input in a ConversationMessage and call render_transcript to assert it
preserves the unterminated marker and does not crash. This will exercise the
existing explicit handling in redact_image_markers and ensure render_transcript
integrates that behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 95ce5e72-ae1f-4fff-b3ca-79e96140c0a4

📥 Commits

Reviewing files that changed from the base of the PR and between 468ca7b and 205e078.

📒 Files selected for processing (9)
  • src/core/jsonrpc.rs
  • src/openhuman/agent/harness/archivist.rs
  • src/openhuman/agent/harness/token_budget.rs
  • src/openhuman/context/summarizer.rs
  • src/openhuman/context/summarizer_tests.rs
  • src/openhuman/inference/provider/compatible.rs
  • src/openhuman/inference/provider/compatible_tests.rs
  • src/openhuman/inference/provider/compatible_types.rs
  • src/openhuman/inference/provider/openhuman_backend.rs

Comment thread src/core/jsonrpc.rs Outdated
Comment thread src/openhuman/inference/provider/compatible_types.rs
Comment thread src/openhuman/inference/provider/compatible.rs
- compatible_types: build MessageContent::Parts in scan order so
  interleaved text/image prompts ([IMAGE:a] then text, before [IMAGE:a]
  middle [IMAGE:b] after) keep the authored multimodal sequence instead
  of collapsing all text before the images. Adds an ordering test.
- jsonrpc: scope the 64 MiB DefaultBodyLimit to the /rpc route via
  route_layer instead of the whole router, so other endpoints keep
  Axum's 2 MiB default.
- compatible: gate vision capability on !responses_api_primary — the
  responses path (chat_via_responses) builds text-only input parts, so
  only claim vision when routing through chat-completions (image_url).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 3, 2026
…sai#3205)

inference_openhuman_backend_provider_covers_authless_and_streaming_edges
asserted the hosted backend reports no vision; it now reports vision:true
so chat image attachments pass the agent-loop capability gate. Flip the
assertion to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 3, 2026
…ansai#3205)

Revert the provider-level vision:true flips. With chat attachments
disabled (CHAT_ATTACHMENTS_ENABLED=false) the gate doesn't need to open,
and the managed default model (DeepSeek Flash) is text-only — claiming
vision would only let image turns through to come back empty. Vision is a
per-model property; the capability stays off until the backend can route
image turns to a vision model (e.g. driven by model_registry.vision).

The image_url wire format, token/summarizer/archivist hygiene, and the
/rpc body-limit all remain (correct + unit-tested without the gate); only
the capability claim is reverted. Restores the backend-provider test
assertion to `!supports_vision()`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Reviewed the two risk axes — both clean:

  • Gating integrity holds. Two independent gates: the frontend CHAT_ATTACHMENTS_ENABLED flag (UI) and the server supports_vision() hard-reject at agent/harness/engine/core.rs:197, which errors on any image-marker turn before promotion/dispatch. Both backends ship vision: false, so an [IMAGE:] marker injected directly over RPC is rejected, not promoted — the feature is genuinely doubly-gated off.
  • 64 MiB body bump is acceptable. Unconditional but correctly scoped to /rpc (other routes keep 2 MiB), and the endpoint is 127.0.0.1 + per-launch bearer, so the 32× cap is a low DoS surface at the desktop shell's single-local-client concurrency.
  • Conversion + hygiene correct. [IMAGE:]image_url is order-preserving, UTF-8-safe (indices from str::find boundaries + ASCII offsets, no mid-codepoint slice), correct OpenAI content-array shape, and malformed/empty/unterminated markers fall back to literal text without panicking. All three base64-skip paths (token-count, summarizer, episodic ingest) strip only the base64 and preserve surrounding text, with saturating math. Tests are thorough.

Minor non-blocking nits (not gating merge): the marker scanner is now hand-duplicated across from_chat_text + token_budget + summarizer (vs the canonical multimodal::parse_image_markers) — drift risk, worth a shared util or cross-link; and compatible_tests doesn't re-assert the malformed/empty-marker robustness branches. Both safe to do as follow-ups.

@sanil-23 sanil-23 merged commit ed6651a into tinyhumansai:main Jun 3, 2026
48 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. feature Net-new user-facing capability or product behavior. memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants