feat(tools): add Tool composition interface with multi_edit proof of concept#193
Conversation
zjshen14
left a comment
There was a problem hiding this comment.
Review: substantive change, needs design clarification before merge
The shape of this change is right — optional fields on Tool, a separate ToolRunner interface to avoid circular imports, and multi_edit as a real proof-of-concept that motivates the change. The interface evolution is backwards-compatible. Tests are focused and pass.
One real concern is the confirmation-gate bypass, and one design question is whether atomic actually means what it implies. I'd hold the merge until those are resolved.
Strengths
- Minimal interface change. All new fields on
Toolare optional; existing tools don't move.execute(params, ctx?)is the right TypeScript shape. ToolRunnercleanly avoids circular imports. Defining a 1-method interface thatToolRegistryhappens to implement is the right abstraction.multi_editmotivates the change. Real value, naturally composable, simple to reason about.- Tests are tight. 2 registry tests + 4 multi_edit tests; new behavior on existing tools is not modified.
Real concerns
1. (High) ctx.registry.execute() bypasses every confirmation gate
The PR description says:
The atomic confirmation behavior falls out naturally: the executor confirms
multi_editonce; its internal calls toctx.registry.execute("edit", ...)bypass the executor's confirmation gate entirely.
This is correct as described — but the reason it bypasses the gate is that ToolRegistry.execute() (src/tools/registry.ts) has no confirmation gate at all. The gate lives entirely in src/core/executor.ts (executor.ts:88-91). So ctx.registry.execute() doesn't just skip confirmation for multi_edit's atomic composition — it skips it for any composed tool, regardless of atomic.
This means a future tool author writing:
{
name: "fix_imports",
composedOf: ["bash"],
atomic: false,
async execute(_, ctx) {
return ctx!.registry.execute("bash", { command: "rm -rf ./node_modules && npm install" });
},
}…would silently bypass:
- The HITL confirmation prompt (
SAFE_COMMANDSallowlist gate) - The permission
denyrules from #156 - The
askrules from #164 - Plan-mode read-only enforcement
For multi_edit itself this is safe by coincidence: edit.requiresConfirmation only fires for paths outside cwd, and multi_edit runs the same path check at its own level, so any path the inner edit would have prompted on is one the outer multi_edit already prompted on. But the interface doesn't enforce this invariant — only the implementation of this one tool does.
We just shipped pattern-based deny rules in #156 specifically to give project owners control over what tools can do. Composition that bypasses those rules is a real regression in the security/UX model, even if no current tool exploits it.
Three ways to resolve this, in order of how I'd rank them:
a. Make ctx.registry.execute() route through the executor's confirmation gate (or surface a sibling method executeWithConfirm()). When atomic: true is set on the parent tool, the gate suppresses confirmation for sub-calls. When atomic is false or unset, sub-calls go through the normal gate. This makes atomic actually load-bearing.
b. Document explicitly that composed tools must NOT compose tools with meaningful requiresConfirmation and add a runtime check at registration time: reject any tool whose composedOf includes a tool that itself requires confirmation. Cheap and safe; constrains the design but matches what multi_edit actually does.
c. Keep the current shape but document the footgun loudly in base.ts next to composedOf, with a worked example of what NOT to compose. Weakest option — relies on tool authors reading docs they don't have to read.
I'd prefer (a). (b) is acceptable. (c) alone isn't.
2. (Medium) atomic is named like ACID but doesn't roll back
The test "stops on the first failing edit" is correct behavior — the implementation does fail-fast. But that means: edits 1 and 2 succeed, edit 3 fails → the file is left in a partial-edit state, and the user sees an error. "Atomic" usually means all-or-nothing.
Two paths:
- Rename to something more honest (
singleConfirmation,consolidated,bundled) and accept fail-fast semantics. - Actually implement rollback: capture pre-tool state (could leverage the existing snapshot system in
src/state/snapshot.ts) and restore on failure. Heavier but matches the name.
I'd rename. Rollback is a separate concern that's better handled by the existing snapshot/rewind machinery.
3. (Low) tmpDir?: string in ToolExecutionContext is unused
The field is declared but ToolRegistry.execute() never populates it and no tool reads it. Either drop it or add a comment explaining its intended future use (presumably tied to Context.getSessionTmpDir()?). Right now it's a draft leftover.
Smaller observations
composedOfis informational only. Worth a comment noting where you'd expect to use it later (e.g. plan mode filtering when all sub-tools are readonly, observability event annotation, docs generation).- The error message "multi_edit requires an execution context" is opaque. Consider
\"multi_edit must be invoked via ToolRegistry, not called directly.\" - CodeQL is FAILURE. Worth investigating before merge — might be unrelated, but the run URL is in the PR. Even if unrelated, the PR description's test plan claims "622 tests" but main is at 645+ now — worth re-running locally and updating.
Verdict
Hold for design clarification on the confirmation bypass. The fix is bounded (option a or b above) and doesn't require redesigning the rest of the PR — the interface shape is good, multi_edit is the right first example, and the registry plumbing is correct. Once ctx.registry.execute() either respects confirmations or has a runtime guard preventing the dangerous composition, this is a strong merge.
The atomic naming/semantics is the second thing to resolve before the field name calcifies as part of the public Tool interface. Hard to change later.
…concept Extends the Tool interface with optional `composedOf`, `atomic`, and `ctx` fields so tools can delegate to other tools as a single confirmed unit. The ToolRegistry now builds a ToolExecutionContext (registry reference + optional tmpDir) and passes it to every tool's execute() call, enabling composed tools to call sub-tools without routing back through the executor's confirmation gate. `multi_edit` is the bundled proof of concept: it applies a list of ordered edits to a single file via the registry-backed `edit` tool, confirmed once as an atomic operation instead of once per edit. Closes #65 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tmpDir Addresses the three real concerns from the #193 review. 1. Confirmation-gate bypass (the high-severity one). ToolRegistry.execute() now refuses to run any tool whose composedOf includes a sub-tool that has its own requiresConfirmation predicate, unless the composing tool sets singleConfirmation: true. Without this guard a tool author writing composedOf: ["bash"] would have silently bypassed the executor's HITL prompt, the permission deny rules from #156, the ask rules from #164, and plan-mode read-only enforcement. The guard runs at execute time (not register time) so registration order is irrelevant. 2. Renamed atomic to singleConfirmation. The original name implied ACID all-or-nothing semantics, but the implementation is fail-fast with possibly partial state (no rollback). singleConfirmation honestly describes what the flag does: one user prompt covers the composite operation. multi_edit's description now spells out explicitly that earlier successful edits are NOT rolled back if a later edit fails — use /rewind or git to undo partial state. Rollback can layer on top later if/when needed via the existing snapshot/rewind machinery. 3. Dropped the unused tmpDir field from ToolExecutionContext. Also tightened multi_edit's "called without ctx" error message from a generic "execution context" to the actionable "must be invoked via ToolRegistry, not called directly." Tests: - Renamed metadata test (atomic → singleConfirmation, same behavior) - New: rejects a composed tool that delegates to a confirmation-gated sub-tool without singleConfirmation (proves the guard fires) - New: allows a composed tool to delegate to a confirmation-gated sub-tool when singleConfirmation is true (proves the guard doesn't over-reject) All 653 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Took this over from the review — three fixes in 721052d on top of the original commitPer the review thread above, all three concerns are now addressed without changing the shape of the original PR. The interface evolution, 1. ✅ Confirmation-gate bypass — runtime guard in
|
…ile" CodeQL flagged the two new multi_edit tests for writing to fixed filenames under os.tmpdir(). The risk is the classic symlink-race class: an attacker who can pre-create /tmp/opencli-test-<Date.now()>/multi.ts as a symlink could redirect the test's writeFile() to an arbitrary path. The pattern matches the rest of this file (existing tests use the same join(tmpDir, "literal.txt") shape and aren't flagged only because CodeQL ignores pre-existing lines), but the new tests trip the rule. Switch the multi_edit describe block to a per-block mkdtemp() that creates a kernel-allocated random suffix. Scoped to just the new tests to keep the diff small; converting the whole file is a worthwhile follow-up but out of scope for #65. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eval): make notContains substring-aware so overlap with contains doesn't false-fail The previous scorer flagged a `notContains` hit whenever the forbidden string appeared anywhere in the file, even if it was a substring of a matched `contains` pattern. That broke the obvious YAML form for strict-equality fixes — `contains: '=== ""'`, `notContains: '== ""'` — because the correctly-fixed code contains both: `==` is inside `===`. Switch to position-aware ranges: a `notContains` hit only counts when its [start, end] range is not fully covered by some `contains` range. This preserves the original fix-off-by-one behaviour (where the forbidden pattern extends past the permitted match) and unlocks the natural form for `fix-strict-equality.yaml`, which is tightened in this commit. Two regression tests added in scorer.test.ts pin both directions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(eval): add multi_edit scenario exercising tool composition end-to-end #193 introduced multi_edit as a `composedOf: [edit]` tool with a single confirmation gate. Until now its registration path was only covered by unit tests in file.test.ts — there was no end-to-end check that a real agent run against a provider could pick the tool up, apply two coordinated edits, and leave the file in a valid (tsc-clean) state. This scenario prompts the agent to fix both bugs in utils.ts (empty-array reduce + loose equality) using multi_edit explicitly, and asserts the resulting file contains both fixes. Confirmed 21/21 pass against gemini-3.1-flash-lite-preview. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…docs (#207) The multi_edit tool and the composedOf/singleConfirmation composition fields were added in #193 but the three doc files that describe the source structure and Tool interface were not updated. - CLAUDE.md + AGENTS.md: add multi_edit to the file/ tool list - docs/architecture.md: add multi-edit.ts to the file/ tree; add ToolExecutionContext interface and composedOf/singleConfirmation fields to the Tool interface code block; update execute() signature to show the optional ctx parameter Co-authored-by: Claude <noreply@anthropic.com>
Summary
Toolinterface (src/tools/base.ts) with optionalcomposedOf,atomic, andctxfields, plus aToolRunnerinterface andToolExecutionContexttype to avoid circular importsToolRegistry.execute()to build aToolExecutionContext({ registry: this }) and pass it to every tool'sexecute()call, enabling composed tools to call sub-tools without routing back through the executor's confirmation gatemulti_editas the bundled proof of concept: applies an ordered list of{old_string, new_string}edits to a single file via the registry-backededittool, confirmed once as an atomic operationWhy this matters: The
Toolinterface is the lowest-level contract in the system. Adding optional composition fields now (composedOf,atomic,ctx) establishes the interface without breaking any existing tools and without requiring every tool's signature to change later. The atomic confirmation behavior falls out naturally: the executor confirmsmulti_editonce; its internal calls toctx.registry.execute("edit", ...)bypass the executor's confirmation gate entirely.Related issue
Closes #65
Test plan
npm run typecheck && npm run lint && npm run format:check && npm test— all pass (622 tests, 0 failures)Generated by Claude Code