Skip to content

fix(summarize): chunk large sessions to fit LLM context window#472

Merged
rohitg00 merged 2 commits into
rohitg00:mainfrom
efenex:fix/summarize-chunking
May 19, 2026
Merged

fix(summarize): chunk large sessions to fit LLM context window#472
rohitg00 merged 2 commits into
rohitg00:mainfrom
efenex:fix/summarize-chunking

Conversation

@efenex
Copy link
Copy Markdown
Contributor

@efenex efenex commented May 17, 2026

Summary

JSONL-imported sessions can have far more observations than the 500-cap MAX_OBS_PER_SESSION that constrains native sessions. mem::summarize previously built one prompt containing every observation and shipped it as a single LLM call, which exceeded the provider's context window for sessions >~7,000 observations and returned an unhelpful 400 from upstream - silently leaving large bulk-imported sessions out of the semantic tier.

This PR makes mem::summarize chunk + map-reduce when needed, with per-chunk retry, skip on persistent failure, and parallel batching.

Problem

  • Native flow caps sessions at MAX_OBS_PER_SESSION=500 via mem::observe's guard (src/functions/observe.ts:128).
  • JSONL import (mem::replay::import-jsonl in src/functions/replay.ts) writes directly to KV, bypassing the guard. Real-world imports produce sessions with 5k–100k+ observations.
  • mem::summarize then builds one prompt with every observation and ships it via provider.summarize(...). Sessions above ~7k obs exceed the provider's context window; the upstream returns a generic 400; the summary never lands; the session is invisible to consolidate-pipeline → semantic / reflect.

Approach

Map-reduce inside mem::summarize:

  • Sessions ≤ SUMMARIZE_CHUNK_SIZE (default 400) take the legacy single-call path with zero behavior change and zero overhead.
  • Larger sessions:
    1. Split into chunks of SUMMARIZE_CHUNK_SIZE
    2. Summarize each chunk with the existing per-session prompt, in parallel batches of SUMMARIZE_CHUNK_CONCURRENCY (default 6)
    3. Merge partials via a new REDUCE_SYSTEM / buildReducePrompt

Robustness

  • Per-chunk retry-once on transient parse failures or provider errors
  • Persistently-failing chunks are skipped (not propagated). A flaky chunk doesn't waste 30+ already-completed LLM calls
  • Bail with too_many_chunks_skipped only if >50% of chunks fail — guards against producing a misleadingly thin merged summary

Configuration

env var default purpose
SUMMARIZE_CHUNK_SIZE 400 obs per chunk; ≈ 50k tokens at ~110 tok/obs (fits 128k-window models)
SUMMARIZE_CHUNK_CONCURRENCY 6 parallel chunk LLM calls; high-RPM providers can allow 100+

Validation (live, against a real bulk-imported corpus)

session size chunks wallclock result
5,392 obs 14 39s quality 100, 5 decisions, 30 files
10,704 obs 27 34s quality 100, 6 decisions, 56 files
105,966 obs 265 (c=50) ~12 min persisted server-side
52-session bulk backfill ~25 min 25 new semantic facts + 6 new reflect insights in consolidate-pipeline

Known limit

The iii-engine has a hardcoded ~180s function-invocation timeout (in the Rust binary; not exposed in iii-config.yaml). Sessions large enough that chunked summarize wallclock exceeds that will see the HTTP client receive a timeout/500 response, even though the handler completes and persists server-side. Raising SUMMARIZE_CHUNK_CONCURRENCY to 50+ pushes the cliff past any realistic session size on a generous-RPM provider.

The true fix is an async-job pattern (return jobId, poll for result). Out of scope here.

Operator tool

scripts/backfill-imported-sessions.sh walks jsonl-import-tagged sessions and POSTs mem::summarize per session, with:

  • --project-pattern REGEX, --skip-agents, --tag NAME (filter)
  • --max-obs N, --limit N (scope)
  • --dry-run (preview cost, no POSTs)
  • --debug-on-error (dump failed-session metadata + obs + provider response for diagnosis)
  • Cost estimator (configurable per-1M rates) so operators see ~$ before committing

Test plan

  • npx vitest run test/summarize.test.ts - 9 cases
    • small session → single-call path preserved
    • large session → N chunk calls + 1 reduce
    • SUMMARIZE_CHUNK_SIZE env override respected
    • flaky chunk: parse-fail → retry → succeed → no skip
    • persistently-broken chunk: skipped, reduce runs on survivors
    • too many skips → too_many_chunks_skipped bailout
    • provider error after retry → skipped (not propagated)
    • chunks run in parallel batches per SUMMARIZE_CHUNK_CONCURRENCY
  • Live: multi-session bulk backfill on a real corpus

Files

  • src/prompts/summary.ts — add REDUCE_SYSTEM + buildReducePrompt
  • src/functions/summarize.ts — chunking, retry, skip, parallelism (single-call path unchanged)
  • test/summarize.test.ts — new
  • .env.example — document the two new env vars
  • .gitignore — ignore agentmemory-debug/ and data-*/ (operator artefacts)
  • scripts/backfill-imported-sessions.sh — new, companion operator tool

Summary by CodeRabbit

  • New Features
    • Chunked summarization for large sessions with parallel processing and partial-summary merging.
  • Documentation
    • Documented two optional environment variables to tune chunk size and concurrency for summarization.
  • Chores
    • Updated ignore patterns for data/debug directories.
    • Added a backfill utility for importing and summarizing past sessions (dry-run and debug modes).
  • Tests
    • Added comprehensive tests covering chunking, retries, skipping, and concurrency behavior.

Review Change Stack

JSONL-imported sessions can have far more observations than the
500-cap MAX_OBS_PER_SESSION that constrains native sessions.
mem::summarize previously built one prompt containing every
observation and shipped it as a single LLM call, which exceeded the
provider's context window for sessions >~7,000 observations and
returned an unhelpful 400 from upstream — silently leaving large
bulk-imported sessions out of the semantic tier.

Approach: map-reduce inside mem::summarize.
- Sessions ≤ SUMMARIZE_CHUNK_SIZE (default 400) take the legacy
  single-call path with no overhead
- Larger sessions are split into chunks, each summarized with the
  existing per-session prompt in parallel batches of
  SUMMARIZE_CHUNK_CONCURRENCY (default 6), and partial summaries
  merged via a new REDUCE_SYSTEM prompt
- Per-chunk retry-once on transient parse / provider errors
- Persistently-failing chunks are skipped (not propagated) so a
  flaky chunk doesn't waste 30+ already-completed LLM calls on
  the same session
- Bail with too_many_chunks_skipped only if >50% of chunks fail

Companion operator tool: scripts/backfill-imported-sessions.sh
walks jsonl-imported sessions and POSTs mem::summarize per session,
with project / agent / obs-count filters, cost estimation, and
per-failure payload dumping for debugging provider rejections.

Validated locally against a real corpus:
- 5,392-obs session (14 chunks, c=6): 39s
- 10,704-obs session (27 chunks, c=6): 34s
- 105,966-obs session (265 chunks, c=50): handler completes
  server-side and persists
- 52-session bulk backfill → 25 new semantic facts + 6 new reflect
  insights produced by consolidate-pipeline

Known limit: iii-engine has a hardcoded 180s function-invocation
timeout. Sessions large enough that chunked summarize wallclock
exceeds that will return a timeout/500 to the HTTP client even
though the handler completes and persists server-side. High-RPM
providers (Novita / DeepInfra / DeepSeek typically allow 100+
concurrent) can raise SUMMARIZE_CHUNK_CONCURRENCY to push the
cliff well past any realistic session size. True fix is an
async-job pattern; left as follow-up.

- src/prompts/summary.ts: add REDUCE_SYSTEM + buildReducePrompt
- src/functions/summarize.ts: chunking, retry, skip, parallelism
- test/summarize.test.ts: 9 cases covering single-call path,
  chunking, env-override, retry-then-success, persistent skip,
  too-many-skipped bail, provider error after retry, concurrency
- .env.example: document SUMMARIZE_CHUNK_SIZE / _CONCURRENCY
- .gitignore: agentmemory-debug/, data-*/  (operator artefacts)
- scripts/backfill-imported-sessions.sh: bulk-import backfill tool

9/9 new tests pass; existing tests untouched.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 17, 2026

@efenex is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab92fbdc-db9a-41c9-9621-0198f735e546

📥 Commits

Reviewing files that changed from the base of the PR and between 99cec95 and 86c0622.

📒 Files selected for processing (2)
  • scripts/backfill-imported-sessions.sh
  • test/summarize.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/backfill-imported-sessions.sh
  • test/summarize.test.ts

📝 Walkthrough

Walkthrough

Adds map-reduce chunked summarization (configurable chunk size/concurrency, per-chunk retry/skip, reduce merge), integrates it into mem::summarize, adds comprehensive tests, a backfill script to process existing sessions, and documents env/gitignore updates.

Changes

Chunked summarization with map-reduce

Layer / File(s) Summary
Reduce prompt contract and builder
src/prompts/summary.ts
REDUCE_SYSTEM defines required XML output and merge rules; buildReducePrompt formats labeled partials including observation range metadata.
Map-reduce chunking implementation
src/functions/summarize.ts
Adds produceSummaryXml and per-chunk retry/skip logic, reads SUMMARIZE_CHUNK_SIZE/SUMMARIZE_CHUNK_CONCURRENCY, runs chunked summarize calls in parallel batches, enforces skip thresholds, and invokes reduce on surviving partials.
Integration into mem::summarize handler
src/functions/summarize.ts
Swaps single-call provider.summarize for produceSummaryXml invocation, destructures mode/chunks metadata, and updates logging for chunked responses.
Comprehensive chunking test coverage
test/summarize.test.ts
Vitest suite covering small-session single-call, large-session map-reduce merge, chunk size/concurrency overrides, retry/skip behaviors, skip-ratio failures, provider errors, all-chunks-fail, and concurrency limits.
Backfill utility for existing sessions
scripts/backfill-imported-sessions.sh
CLI script to find completed jsonl-import sessions, dry-run or POST /agentmemory/summarize per session, collect OK/SKIP/FAIL, optionally dump per-session debug JSON, and trigger consolidation.
Environment and ignore configuration
.env.example, .gitignore
Documents SUMMARIZE_CHUNK_SIZE and SUMMARIZE_CHUNK_CONCURRENCY and adds ignore patterns for data-*/ and agentmemory-debug/.

Sequence Diagram

sequenceDiagram
  participant Handler as mem::summarize handler
  participant Chunker as produceSummaryXml
  participant Provider as LLM provider
  participant Reduce as reduce phase
  
  Handler->>Chunker: compressed observations
  alt observations < chunk threshold
    Chunker->>Provider: single summary call
    Provider-->>Chunker: summary XML
  else observations >= chunk threshold
    Chunker->>Provider: chunk 1 summarize (with retry)
    Chunker->>Provider: chunk 2 summarize (parallel)
    Chunker->>Provider: chunk N summarize (parallel batches)
    Provider-->>Chunker: partial summaries (skip on failure)
    Chunker->>Reduce: partial summaries list
    Reduce->>Provider: reduce prompt (merge partials)
    Provider-->>Reduce: merged final summary
    Reduce-->>Chunker: final summary XML
  end
  Chunker-->>Handler: response + mode + chunks metadata
  Handler->>Handler: persist summary and metadata
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Chunks of wisdom, batched in time,
Parallel hops in measured rhyme,
Retry once, skip the broken part,
Reduce the pieces to a single heart,
Summaries stitched — a coder's art.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix(summarize): chunk large sessions to fit LLM context window' accurately and concisely summarizes the main change: implementing chunking for large sessions in the summarize function to address context window overflow issues.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/functions/summarize.ts (1)

168-179: 💤 Low value

Consider tracking original indices alongside partials to avoid O(n²) lookup.

Using partialByIdx.indexOf(p) to recover original indices is O(n) per partial, yielding O(n²) for large sessions with many chunks. For a session with ~250 chunks (100k observations at default chunk size), this becomes ~62k comparisons.

♻️ Suggested fix
-  const partialByIdx: Array<SessionSummary | null> = new Array(chunks.length).fill(null);
+  const partialByIdx: Array<{ summary: SessionSummary; idx: number } | null> = new Array(chunks.length).fill(null);
   for (let batchStart = 0; batchStart < chunks.length; batchStart += concurrency) {
     const batch = chunks.slice(batchStart, batchStart + concurrency);
     await Promise.all(
       batch.map(async (chunk, j) => {
         const idx = batchStart + j;
-        partialByIdx[idx] = await summarizeChunkWithRetry(
+        const summary = await summarizeChunkWithRetry(
           provider,
           chunk,
           sessionId,
           project,
           idx,
           chunks.length,
         );
+        partialByIdx[idx] = summary ? { summary, idx } : null;
       }),
     );
   }

-  const skipped = partialByIdx.filter((p) => p === null).length;
-  const partials = partialByIdx.filter((p): p is SessionSummary => p !== null);
+  const skipped = partialByIdx.filter((p) => p === null).length;
+  const partialsWithIdx = partialByIdx.filter((p): p is { summary: SessionSummary; idx: number } => p !== null);

   // ... skip ratio check unchanged ...

-  const reduceInput = partials.map((p) => {
-    const originalIdx = partialByIdx.indexOf(p);
+  const reduceInput = partialsWithIdx.map(({ summary: p, idx: originalIdx }) => {
     return {
       title: p.title,
       // ... rest unchanged
     };
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/summarize.ts` around lines 168 - 179, The current reduceInput
builds obsRangeStart/End by calling partialByIdx.indexOf(p), which is O(n²);
instead, preserve each chunk's original index when you create or transform
partials and use that stored index here—e.g., ensure each partial object
includes an originalIdx (or build partials as tuples [partial, originalIdx]) and
then compute obsRangeStart as originalIdx * chunkSize + 1 and obsRangeEnd with
Math.min((originalIdx + 1) * chunkSize, compressed.length) using that stored
originalIdx; update the code paths that construct partials so they attach
originalIdx rather than relying on partialByIdx.indexOf in reduceInput.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/backfill-imported-sessions.sh`:
- Line 151: The script currently constructs local file paths and URLs using
sessionId directly (local file="$DEBUG_DIR/${id}.json" and the URL query param
build), which allows path traversal and breaks queries for special characters;
update the code that sets local file and any URL that uses sessionId to (1)
sanitize or slugify/hex-encode the sessionId before joining with DEBUG_DIR
(e.g., remove/replace path separators and dots or use a safe basename/hex
function) so the resulting filename cannot escape DEBUG_DIR, and (2) URL-encode
the sessionId when building query strings so special characters are
percent-encoded; apply these changes to the variables/expressions that define
local file and the URL/query construction that reference sessionId/id.
- Around line 71-74: The curl calls that check endpoints (e.g., the one hitting
"$URL/agentmemory/livez") must include connection and total timeouts to avoid
indefinite hangs; update each curl invocation in this script (the occurrences
around the livez check and the other spots noted) to add options like
--connect-timeout <seconds> and --max-time <seconds> (or -m) with sensible
defaults (e.g., 5s connect, 15-30s total) so the script fails fast on network
issues and prints the existing error message/exit code as before.
- Around line 188-193: The script currently runs jq unconditionally on resp (the
curl result for POST to /agentmemory/summarize), which can cause set -e to abort
if resp is not valid JSON; fix by first validating resp is JSON (e.g., use jq -e
. <<<"$resp" >/dev/null 2>&1) and if that check fails, set status="false",
err="invalid_json" (or include raw resp), and title="" before attempting any jq
extraction; otherwise proceed to run the existing jq calls that populate status,
err, and title from resp.

In `@test/summarize.test.ts`:
- Around line 28-30: The vi.mock call currently points at "./audit.js" (creating
a test/audit.js mock) but should mock the real source module; update the vi.mock
invocation in test/summarize.test.ts to reference the actual module path used
elsewhere (../src/functions/audit.js) and keep the mock factory (safeAudit:
vi.fn()) intact so imports of safeAudit from the source module are correctly
stubbed.

---

Nitpick comments:
In `@src/functions/summarize.ts`:
- Around line 168-179: The current reduceInput builds obsRangeStart/End by
calling partialByIdx.indexOf(p), which is O(n²); instead, preserve each chunk's
original index when you create or transform partials and use that stored index
here—e.g., ensure each partial object includes an originalIdx (or build partials
as tuples [partial, originalIdx]) and then compute obsRangeStart as originalIdx
* chunkSize + 1 and obsRangeEnd with Math.min((originalIdx + 1) * chunkSize,
compressed.length) using that stored originalIdx; update the code paths that
construct partials so they attach originalIdx rather than relying on
partialByIdx.indexOf in reduceInput.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 55ab9501-0221-47f4-8e5c-402a33a83668

📥 Commits

Reviewing files that changed from the base of the PR and between 9061da5 and 99cec95.

📒 Files selected for processing (6)
  • .env.example
  • .gitignore
  • scripts/backfill-imported-sessions.sh
  • src/functions/summarize.ts
  • src/prompts/summary.ts
  • test/summarize.test.ts

Comment thread scripts/backfill-imported-sessions.sh Outdated
Comment thread scripts/backfill-imported-sessions.sh Outdated
Comment thread scripts/backfill-imported-sessions.sh Outdated
Comment thread test/summarize.test.ts Outdated
Four nits flagged by the automated reviewer, all worth fixing:

- scripts/backfill: add curl --connect-timeout + --max-time profiles
  (META_CURL_OPTS vs WORK_CURL_OPTS). Metadata reads fail fast and
  retry on transient blips; LLM-backed work calls get a wide 30-min
  cap and no retry (retrying a half-finished LLM job double-spends).
- scripts/backfill: sanitize sessionId before joining with DEBUG_DIR
  in dump_failure() (otherwise a session id containing `/` or `..`
  could escape the debug dir). UUIDs in practice, but the server
  doesn't enforce that.
- scripts/backfill: switch the observations query to
  `--get --data-urlencode "sessionId=$id"` so special characters
  can't corrupt the query string.
- scripts/backfill: guard `jq` on summarize + consolidate responses
  with `jq -e . </dev/null 2>&1` first. iii's HTTP layer occasionally
  returns non-JSON (HTML 5xx, empty body on timeout). Without the
  guard, `set -e` aborts the whole backfill loop on a single bad
  response — now it logs `invalid_json_response` and moves on.
- test/summarize.test.ts: fix `vi.mock("./audit.js", ...)` path to
  `"../src/functions/audit.js"`. The old path resolved to
  `test/audit.js` (nonexistent), so the mock was a silent no-op.
  Tests passed anyway because `safeAudit` writes to a mocked KV.

9/9 tests still pass; backfill dry-run still resolves the corpus
cleanly.
@efenex
Copy link
Copy Markdown
Contributor Author

efenex commented May 17, 2026

Thanks for the review — pushed 86c0622 addressing all four points:

  • curl timeouts: added META_CURL_OPTS (connect 10s / max 30s + retry) and WORK_CURL_OPTS (connect 10s / max 1800s, no retry) profiles. Split because retrying an in-flight LLM call doubles the cost and the server-side handler keeps running anyway.
  • debug filename sanitization: tr -c 'A-Za-z0-9._-' '_' on the session id before joining with DEBUG_DIR.
  • observations URL encoding: switched to curl --get --data-urlencode "sessionId=$id".
  • jq parse guard: wrapped summarize + consolidate response parsing in jq -e . </dev/null 2>&1 first; bad responses log `invalid_json_response` instead of aborting the loop.
  • test mock path: `vi.mock("./audit.js", ...)` → `"../src/functions/audit.js"`. The old path resolved to a nonexistent module, so the mock was a silent no-op (tests passed anyway because safeAudit writes to mocked KV).

9/9 tests still pass.

@rohitg00 rohitg00 mentioned this pull request May 19, 2026
@rohitg00 rohitg00 merged commit e68d4eb into rohitg00:main May 19, 2026
1 of 2 checks passed
@rohitg00
Copy link
Copy Markdown
Owner

Merged + shipping in v0.9.21. Thanks @efenex — sessions >7k obs were silently failing at the LLM context limit, so this restores the consolidation pipeline for power users. Backfill script under scripts/ is a one-time operator helper for users hitting the pre-fix bug.

@rohitg00 rohitg00 mentioned this pull request May 19, 2026
rohitg00 added a commit that referenced this pull request May 19, 2026
Quality + integration wave. Bundles 11 PRs since v0.9.20:

Contributor feature:
- #237 OpenCode plugin with 22 auto-capture hooks (@cl0ckt0wer)

Bug fixes (9):
- #516 memory_recall endpoint + format/token_budget (@serhiizghama, closes #507/#440)
- #461 env-file AGENTMEMORY_DROP_STALE_INDEX flag honored (@honor2030, closes #456)
- #487 Windows hook path quoting (@honor2030, closes #477)
- #517 viewer IME composition guard (@jonathanzhan1975)
- #472 chunk large sessions for LLM context window (@efenex)
- #473 surface lessons in smart-search + diagnose tally (@efenex)
- #486 declare all Hermes plugin hooks (@honor2030)
- #500 rebuildIndex non-blocking on boot (@efenex)
- #504 batched embed in rebuildIndex (25h -> 3h) (@efenex)
- #491 cli skip onboarding without tty (@honor2030)

Upstream-installer revert:
- #546 drop --next workaround now that iii-hq/iii#1660 shipped

1067/1067 tests pass across 95 files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants