Skip to content

feat(eval): pluggable benchmark harness with in-house coding-agent corpus#562

Merged
rohitg00 merged 2 commits into
mainfrom
eval/scaffold-longmemeval
May 20, 2026
Merged

feat(eval): pluggable benchmark harness with in-house coding-agent corpus#562
rohitg00 merged 2 commits into
mainfrom
eval/scaffold-longmemeval

Conversation

@rohitg00
Copy link
Copy Markdown
Owner

@rohitg00 rohitg00 commented May 20, 2026

Summary

Adds an eval/ tree with a pluggable Adapter contract, three reference adapters, two benchmarks, scoring math, and a sandbox script. Stays outside files field so the npm tarball is unchanged.

What ships

  • Adapter interface (eval/runner/types.ts):
    • init(sessions, config) -> State
    • query(q, state, k) -> RankedDoc[]
  • Three adapters:
    • grep — tokenized substring baseline, zero deps
    • vector — OpenAI text-embedding-3-small + cosine (OPENAI_API_KEY)
    • agentmemory-hybridPOST /agentmemory/{remember,smart-search} (full production stack)
  • Two benchmarks:
    • coding-agent-life-v1 — committed 15-session fictional Rust CLI (shipctl) project with 15 hand-graded queries covering single-session, multi-session causal, preference, temporal
    • LongMemEval _s — public 500-Q benchmark, dataset fetched on demand (278MB)
  • Scoring (eval/runner/score.ts): P@K, R@K, hit, top-gold-rank, NDJSON per-query rows + summary.json per-adapter and per-type aggregates
  • Sandbox script (eval/scripts/sandbox.sh): boots clean agentmemory + iii-engine on ports 3411/3412 with isolated data dir at /tmp/agentmemory-eval-sandbox/; tears down on exit. Required because running against a real ~/.agentmemory pollutes both the eval and the user's real store.

Published scorecard

docs/benchmarks/2026-05-20-coding-agent-life-v1.md — first published scorecard, sandbox-reproducible:

Adapter P@5 R@5 Top-5 hit rate p50 latency
agentmemory-hybrid 0.578 0.967 15 / 15 14 ms
grep baseline 0.267 0.967 15 / 15 0 ms

100% top-5 hit rate. 2.2x better precision than the grep baseline on identical input. No LLM in the retrieval loop.

npm scripts

npm run eval:coding-life     # in-house, no download
npm run eval:longmemeval     # public, needs OPENAI_API_KEY + dataset

Reproduce locally

source eval/scripts/sandbox.sh
npm run eval:coding-life -- --adapters grep,agentmemory

Outputs land in eval/reports/coding-life/: scores.ndjson + summary.json.

Test plan

  • npm test — 1072 / 1072 pass (96 files), including 5 new tests for the eval scaffold
  • npm run build — 12 files, 2398 KB, no regression in bundle size
  • Sandbox run on macOS Apple Silicon — agentmemory-hybrid hits 15/15 with isolated data dir
  • Polluted-vs-clean comparison confirms sandbox is required (precision recovers, recall hits ceiling)
  • eval/ excluded from npm tarball (files field unchanged)
  • eval/data/coding-agent-life-v1/ whitelisted past .gitignore's data/ rule
  • No external project names in diff (gbrain, brainbench, etc — clean grep)

Follow-up

  • Refresh LongMemEval _s numbers in main README with the new harness (needs OPENAI_API_KEY + 278MB dataset + ~$2 spend)
  • Add more adapters (BM25-only baseline, embedding-only baseline) to isolate which agentmemory layer is load-bearing
  • Expand coding-agent-life-v1 to 50 sessions with paraphrased queries + planted distractors

Summary by CodeRabbit

  • New Features

    • Added an evaluation framework to compare grep, vector, and hybrid retrieval adapters and record per-adapter metrics.
  • Documentation

    • Published benchmark reports and a reproducible template; added local reproduction and sandbox instructions.
  • Data

    • Included a new in-house coding-agent-life-v1 dataset of sessions and queries.
  • Tests

    • Added tests validating adapter behavior and scoring/aggregation logic.
  • Chores

    • Updated ignore rules to exclude transient eval artifacts.

Review Change Stack

…rpus

Adds eval/ tree (outside files field so npm tarball stays thin) with Adapter
interface, three reference adapters (grep / vector / agentmemory-hybrid),
two benchmarks (LongMemEval _s public, coding-agent-life-v1 in-house 15
sessions), scoring (P@K, R@K, hit, top-gold-rank), NDJSON output,
sandbox script.

coding-agent-life-v1 published scorecard at
docs/benchmarks/2026-05-20-coding-agent-life-v1.md:
agentmemory-hybrid R@5=0.967 P@5=0.578 (100% hit) vs grep R@5=0.967 P@5=0.267.
2.2x better precision on identical input, sandbox-reproducible.

Adapter contract: init(sessions, config) -> State; query(q, state, k) -> RankedDoc[]

npm scripts:
  npm run eval:coding-life   (no download, no API key for grep)
  npm run eval:longmemeval   (needs OPENAI key + 278MB download)

eval/scripts/sandbox.sh boots clean agentmemory + iii-engine on ports
3411/3412 with isolated data dir; tears down on exit.

README headline updated. 1072/1072 tests pass + 5 new eval tests.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentmemory Ready Ready Preview, Comment May 20, 2026 1:06pm

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

Adds an evaluation framework for retrieval: core types, three adapters (grep, vector, agentmemory), datasets and loader, two CLI runners, scoring+aggregation, sandbox script, tests, benchmark docs, and package scripts.

Changes

Agentmemory Evaluation Framework

Layer / File(s) Summary
Core evaluation type contracts
eval/runner/types.ts
Defines Session, Question, RankedDoc, generic Adapter<State>, and ScoreRow interfaces used by adapters and runners.
Baseline retrieval adapters
eval/runner/adapters/grep.ts, eval/runner/adapters/vector.ts
grep tokenizes queries and ranks by matched term counts. vector uses OpenAI embeddings, provides embed helpers, and ranks by cosine similarity.
Agentmemory API adapter
eval/runner/adapters/agentmemory.ts
Registers sessions via /agentmemory/remember and queries /agentmemory/smart-search, mapping observation IDs back to sessions and deduplicating results.
Evaluation datasets and loader
eval/data/coding-agent-life-v1/queries.json, eval/data/coding-agent-life-v1/sessions.json, eval/runner/load.ts
Adds coding-agent-life-v1 corpus (15 sessions/queries) and loadLongMemEval/stratifySample to build Question[] with flattened session haystacks.
Scoring and aggregation
eval/runner/score.ts
Implements scoreQuestion (precision@k/recall@k/hit/topGoldRank) and aggregate (per-adapter and per-type averages and latency P50).
CLI evaluation runners
eval/runner/coding-life.ts, eval/runner/longmemeval.ts
coding-life runs adapters on the in-house corpus; longmemeval runs long-context datasets with optional stratification; both write NDJSON rows and summary JSON.
Sandbox environment
eval/scripts/sandbox.sh
Bash helper to provision an isolated iii sandbox, start agentmemory, poll readiness, and export AGENTMEMORY_BASE_URL.
Evaluation tests
test/eval-adapters.test.ts
Vitest suite verifying dataset integrity, grep top-5 hit-rate, scoring unit tests, and aggregation correctness.
Documentation and configuration
.gitignore, eval/README.md, docs/benchmarks/2026-05-20-coding-agent-life-v1.md, docs/benchmarks/TEMPLATE.md, README.md, package.json
Adds eval README with quickstart and adapter guidance, benchmark template and sample report, .gitignore rules for eval artifacts, README benchmark entries, and npm scripts for evaluation runs.

Estimated code review effort:
🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs:

"I'm a rabbit with a tiny pen, I hopped through logs and back again,
I stitched adapters, tests, and docs with care 🐇,
Sessions, queries, scores in tow—benchmarks ready to glow,
Sandbox up, runners run, now metrics bloom under the sun!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: introducing a pluggable benchmark harness with an in-house coding-agent corpus for evaluation purposes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch eval/scaffold-longmemeval

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@eval/data/coding-agent-life-v1/sessions.json`:
- Line 40: The text in the session content incorrectly labels the migration as a
"two-phase migration" but then enumerates three phases; update the wording in
the string (the value of "content") to either "three-phase migration" or
"multi-phase migration" so it matches the listed phases (phase 1: add nullable
new column + dual-write; phase 2: backfill + flip reads; phase 3: drop old
column), ensuring the summary and the detailed phase list are consistent.

In `@eval/README.md`:
- Around line 31-34: Remove the unsafe rm "$(which iii)" removal and instead
install the pinned iii release into a user-local bin directory without deleting
existing binaries: ensure ~/.local/bin exists, extract the downloaded tarball
into that directory (the existing curl ... | tar -xz -C ~/.local/bin command is
fine) and instruct users to prepend ~/.local/bin to PATH (or add it to their
shell profile) so the pinned v0.11.2 binary is used; do not attempt to remove or
overwrite whatever which iii currently points to.
- Line 69: The README.md contains a fenced code block starting with ``` that
lacks a language tag (triggers markdownlint MD040); update that opening fence to
include a language identifier (for example change ``` to ```text or ```bash) so
the block is explicitly tagged, keeping the rest of the block content unchanged;
locate the fenced block in README.md (the tree-like snippet under the eval/
heading) and modify its opening fence accordingly.

In `@eval/runner/adapters/agentmemory.ts`:
- Around line 81-83: The code currently computes memId then looks up sessionId
via state.observationToSession (using memId) and ignores any direct
row.sessionId; change the logic in the block around memId/sessionId (variables
memId, sessionId and usage of state.observationToSession and seen) to prefer
row.sessionId if present: first set sessionId = row.sessionId (if defined),
otherwise derive memId = row.obsId ?? row.id ?? row.observationId and then set
sessionId = memId ? state.observationToSession.get(memId) : undefined; keep the
seen check and continue behavior the same so rows with an explicit sessionId are
not discarded.

In `@eval/runner/adapters/vector.ts`:
- Around line 40-46: In embedBatch, validate the embeddings response before
building the Float32Array array: ensure data.data length equals texts.length,
each item has a numeric index within [0, texts.length-1], indexes are unique,
and each embedding is a non-empty numeric array with expected dimensionality; if
any check fails, throw a descriptive error (including the offending
index/response snippet) instead of returning undefined entries so callers never
receive incomplete vectors. Use the function name embedBatch and the local
variables data, row.index, and row.embedding to locate and implement these
checks.

In `@eval/runner/coding-life.ts`:
- Line 37: The code currently does const k = Number(opts.k); which allows NaN, 0
or negative values; update the validation after that line to ensure k is a
positive integer (e.g., if (!Number.isInteger(k) || k <= 0) throw new Error or
return with a clear message) so downstream metric calculations and adapter calls
always receive a valid k; reference the variable k and the opts.k source to
locate where to add this check.
- Around line 65-79: The code currently calls adapter.init(sessions) and then
iterates queries with adapter.query, but if any query throws the
adapter.teardown(state) call is skipped; wrap the query loop and related
per-adapter logic in a try/finally so that adapter.teardown(state) always runs:
after obtaining state from adapter.init(sessions) run the for-loop and all
processing inside a try block and invoke await adapter.teardown(state) inside
the finally (guarding existence with if (adapter.teardown)) to guarantee
resource cleanup even on exceptions.

In `@eval/runner/load.ts`:
- Around line 23-26: The mapping silently injects empty content when
haystack_session_ids and haystack_sessions lengths differ; add an explicit
structural check before building haystack (compare r.haystack_session_ids.length
to r.haystack_sessions.length) and fail fast—throw or return an error with a
clear message that includes both lengths (and any relevant request id) instead
of using the null-coalescing fallback; then only construct haystack using the
validated arrays and continue to call flattenSession for each corresponding
element.

In `@eval/runner/longmemeval.ts`:
- Around line 46-48: The variables k, limit, and perType are parsed from opts
without validation and can become NaN or non-positive; update the parsing in
longmemeval.ts to explicitly parse and validate these CLI options (e.g., use
parseInt/Number, then check Number.isFinite(...) and value > 0 for k and
perType, and value > 0 if limit must be positive) and handle invalid values by
either throwing a clear Error or falling back to a safe default; locate the
parsing site where k, limit, and perType are assigned and replace it with
validated parsing that rejects NaN/non-positive inputs and surfaces a clear
message referencing opts.k, opts.limit and opts.stratify (or sets a documented
default) so downstream code using k, limit, and perType never receives invalid
numbers.
- Around line 74-80: Wrap per-question work in a try/finally so
adapter.teardown(state) always runs: after obtaining state from
adapter.init(q.haystack) call adapter.query(...) and the subsequent
scoring/appendFileSync/rows.push inside a try block, and call
adapter.teardown(state) in the finally block (guarding that state is defined).
Update the logic around adapter.init, adapter.query, scoreQuestion,
appendFileSync(ndjsonPath, ...), rows.push(row) and adapter.teardown to ensure
teardown executes even if adapter.query or scoreQuestion throws.

In `@eval/runner/score.ts`:
- Line 13: The precision calculation uses topK.length as the denominator which
inflates P@K when fewer than k results are returned; change the computation in
precisionAtK to divide by k (the requested cutoff) instead of topK.length, while
still handling k === 0 safely (return 0 when k is 0). Update the expression that
sets precisionAtK (referencing precisionAtK, topK, k, hits) so hits / k is used
when k > 0, otherwise 0.

In `@eval/scripts/sandbox.sh`:
- Line 31: The script unguardedly runs rm -rf "$SANDBOX_ROOT" which can
catastrophically delete important paths; before that command, validate
SANDBOX_ROOT by ensuring it is non-empty, not "/" (and not any other critical
root like ""), and resolves to an expected sandbox subtree (e.g. check realpath
"$SANDBOX_ROOT" and assert it starts with a known safe prefix or contains a
specific marker), and fail fast if the check fails; update the code around the
rm -rf invocation to perform these checks on the SANDBOX_ROOT variable and only
run rm -rf when the validation passes.

In `@test/eval-adapters.test.ts`:
- Around line 8-10: The test uses __dirname to build DATA_DIR which breaks in
ESM; update the path resolution to use import.meta.url (or
fileURLToPath(import.meta.url) + path.dirname) to compute DATA_DIR and keep JSON
reads for sessions and queries working; locate the DATA_DIR, sessions and
queries declarations in the file (symbols: DATA_DIR, sessions, queries) and
replace the __dirname-based resolution with an ESM-safe resolution using
import.meta.url so the module loads under ESM.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f441b2c-7e82-40a3-8477-55691b317757

📥 Commits

Reviewing files that changed from the base of the PR and between e9dc710 and d32abb2.

📒 Files selected for processing (18)
  • .gitignore
  • README.md
  • docs/benchmarks/2026-05-20-coding-agent-life-v1.md
  • docs/benchmarks/TEMPLATE.md
  • eval/README.md
  • eval/data/coding-agent-life-v1/queries.json
  • eval/data/coding-agent-life-v1/sessions.json
  • eval/runner/adapters/agentmemory.ts
  • eval/runner/adapters/grep.ts
  • eval/runner/adapters/vector.ts
  • eval/runner/coding-life.ts
  • eval/runner/load.ts
  • eval/runner/longmemeval.ts
  • eval/runner/score.ts
  • eval/runner/types.ts
  • eval/scripts/sandbox.sh
  • package.json
  • test/eval-adapters.test.ts

Comment thread eval/data/coding-agent-life-v1/sessions.json Outdated
Comment thread eval/README.md
Comment thread eval/README.md Outdated
Comment thread eval/runner/adapters/agentmemory.ts Outdated
Comment thread eval/runner/adapters/vector.ts
Comment thread eval/runner/longmemeval.ts Outdated
Comment thread eval/runner/longmemeval.ts Outdated
Comment thread eval/runner/score.ts Outdated
Comment thread eval/scripts/sandbox.sh
Comment on lines +8 to +10
const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[];
const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array<
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify ESM mode and locate __dirname usage in TS tests.
set -euo pipefail

echo "Checking package.json module type:"
cat package.json | jq -r '.type // "<missing>"'

echo
echo "Searching for __dirname usage in test/*.test.ts files:"
rg -nP --type=ts -C2 '\b__dirname\b' test

Repository: rohitg00/agentmemory

Length of output: 1160


Replace __dirname usage with ESM-safe path resolution.

__dirname is undefined in ESM and will fail this suite at module load time.

Proposed fix
 import { readFileSync } from "node:fs";
-import { resolve } from "node:path";
+import { dirname, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
 import { grepAdapter } from "../eval/runner/adapters/grep.js";
 import { aggregate, scoreQuestion } from "../eval/runner/score.js";
 import type { Question, Session } from "../eval/runner/types.js";
 
-const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = dirname(__filename);
+const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[];
const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array<
import { readFileSync } from "node:fs";
import { dirname, resolve } from "node:path";
import { fileURLToPath } from "node:url";
import { grepAdapter } from "../eval/runner/adapters/grep.js";
import { aggregate, scoreQuestion } from "../eval/runner/score.js";
import type { Question, Session } from "../eval/runner/types.js";
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);
const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[];
const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array<
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/eval-adapters.test.ts` around lines 8 - 10, The test uses __dirname to
build DATA_DIR which breaks in ESM; update the path resolution to use
import.meta.url (or fileURLToPath(import.meta.url) + path.dirname) to compute
DATA_DIR and keep JSON reads for sessions and queries working; locate the
DATA_DIR, sessions and queries declarations in the file (symbols: DATA_DIR,
sessions, queries) and replace the __dirname-based resolution with an ESM-safe
resolution using import.meta.url so the module loads under ESM.

- agentmemory adapter: prefer row.sessionId before observationToSession lookup
- vector adapter: validate embedBatch response (length, indexes, non-empty rows)
- coding-life: positive-int guard on --k; wrap query loop in try/finally so teardown runs
- longmemeval: positive-int guards on --k/--limit/--stratify; per-question try/finally
- load: throw on haystack_session_ids vs haystack_sessions length mismatch
- score: P@K denominator is k (requested cutoff) not topK.length
- sandbox.sh: guard rm -rf with non-empty + /tmp/ prefix check
- README: drop unsafe rm "$(which iii)"; instruct ~/.local/bin + PATH instead; add language tag to repo-layout fenced block
- sessions.json: fix "two-phase" -> "three-phase" wording mismatch
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
eval/scripts/sandbox.sh (1)

31-35: ⚠️ Potential issue | 🔴 Critical

Canonicalize SANDBOX_ROOT before the wipe check.

The raw prefix check still accepts traversal values like /tmp/../etc, so rm -rf "$SANDBOX_ROOT" can still delete outside /tmp.

Suggested fix
-if [[ -z "${SANDBOX_ROOT:-}" || "$SANDBOX_ROOT" == "/" || "$SANDBOX_ROOT" != /tmp/* ]]; then
-  echo "refusing to wipe SANDBOX_ROOT='$SANDBOX_ROOT' — must be non-empty and under /tmp/" >&2
+SANDBOX_ROOT_REAL="$(python3 -c 'import os,sys; print(os.path.realpath(sys.argv[1]))' "$SANDBOX_ROOT")"
+if [[ -z "${SANDBOX_ROOT:-}" || "$SANDBOX_ROOT_REAL" == "/" || "$SANDBOX_ROOT_REAL" != /tmp/* ]]; then
+  echo "refusing to wipe SANDBOX_ROOT='$SANDBOX_ROOT' (resolved: '$SANDBOX_ROOT_REAL') — must resolve under /tmp/" >&2
   exit 1
 fi
-rm -rf "$SANDBOX_ROOT"
+rm -rf "$SANDBOX_ROOT_REAL"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eval/scripts/sandbox.sh` around lines 31 - 35, Canonicalize SANDBOX_ROOT
before validating and deleting: resolve SANDBOX_ROOT to an absolute canonical
path (e.g., using realpath -m or readlink -f) into a new variable (like
CANON_SANDBOX_ROOT), then perform the existing non-empty/root and prefix check
against CANON_SANDBOX_ROOT (ensure it starts with /tmp/), and finally run rm -rf
on the canonicalized variable instead of the raw SANDBOX_ROOT; reference the
SANDBOX_ROOT variable and replace its use in the conditional and rm -rf with the
canonicalized name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@eval/scripts/sandbox.sh`:
- Around line 31-35: Canonicalize SANDBOX_ROOT before validating and deleting:
resolve SANDBOX_ROOT to an absolute canonical path (e.g., using realpath -m or
readlink -f) into a new variable (like CANON_SANDBOX_ROOT), then perform the
existing non-empty/root and prefix check against CANON_SANDBOX_ROOT (ensure it
starts with /tmp/), and finally run rm -rf on the canonicalized variable instead
of the raw SANDBOX_ROOT; reference the SANDBOX_ROOT variable and replace its use
in the conditional and rm -rf with the canonicalized name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1be9230c-bb00-489a-923d-c49d19cd9abe

📥 Commits

Reviewing files that changed from the base of the PR and between d32abb2 and 2d60560.

📒 Files selected for processing (10)
  • eval/README.md
  • eval/data/coding-agent-life-v1/sessions.json
  • eval/runner/adapters/agentmemory.ts
  • eval/runner/adapters/vector.ts
  • eval/runner/coding-life.ts
  • eval/runner/load.ts
  • eval/runner/longmemeval.ts
  • eval/runner/score.ts
  • eval/scripts/sandbox.sh
  • test/eval-adapters.test.ts
✅ Files skipped from review due to trivial changes (2)
  • eval/data/coding-agent-life-v1/sessions.json
  • eval/README.md

@rohitg00 rohitg00 merged commit 7fb72f4 into main May 20, 2026
7 checks passed
@rohitg00 rohitg00 deleted the eval/scaffold-longmemeval branch May 20, 2026 13:11
rohitg00 added a commit that referenced this pull request May 21, 2026
…uard (#588)

Three concerns surfaced when auditing PRs merged since v0.9.21:

1) Graph parser regex from #494 was correct for self-closing tags ONLY
   when they're the only entity in the document. With a self-closing
   entity followed by an explicit-close entity, the greedy `[^>]*` ate
   the trailing `/` of the self-close, the alternation fell through to
   the explicit-close branch, and `[\s\S]*?` ran ahead to the *next*
   `</entity>` — merging two entity declarations into one match and
   silently dropping a node. Switch to lazy `[^>]*?` so the
   self-closing alternation gets first chance. Test from #494 (which
   was failing on main but I didn't notice) now passes.

2) #386 shipped `${AGENTMEMORY_URL}` / `${AGENTMEMORY_SECRET}`
   placeholders in plugin/.mcp.json. Claude Code and Cursor expand
   those at MCP launch time; some hosts pass the literal string
   through. A truthy literal `"${AGENTMEMORY_URL}"` defeats the
   existing `|| DEFAULT_URL` fallback and would have the shim POST to
   `${AGENTMEMORY_URL}/agentmemory/...` (DNS failure). Add a
   defensive guard in src/mcp/rest-proxy.ts that strips any value of
   the form `${...}` so the fallback engages. Mirror in
   src/mcp/standalone.ts's mode-announce log line.

3) CHANGELOG entries for all PRs landed since v0.9.21 (#321, #325,
   #386, #454, #494, #526, #542, #560, #561, #562, #564, #567) plus
   the regex + env-guard hardening here.

Validation:
- npm test → 98 files, 1091 tests pass
- npm pack → 142 files, fresh install clean, iii-sdk@0.11.2 pinned,
  plugin/ shipped with hooks/scripts/skills/opencode/
- New test file covers 8 placeholder cases (unset, empty,
  `${VAR}` literal, mismatched braces, real values with $,
  unclosed `${`, no-braces `$VAR`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant