Skip to content

feat(tools): add Tool composition interface with multi_edit proof of concept#193

Merged
zjshen14 merged 3 commits into
mainfrom
feat/issue-65
Jun 1, 2026
Merged

feat(tools): add Tool composition interface with multi_edit proof of concept#193
zjshen14 merged 3 commits into
mainfrom
feat/issue-65

Conversation

@zjshen14

Copy link
Copy Markdown
Owner

Summary

  • Extends the Tool interface (src/tools/base.ts) with optional composedOf, atomic, and ctx fields, plus a ToolRunner interface and ToolExecutionContext type to avoid circular imports
  • Updates ToolRegistry.execute() to build a ToolExecutionContext ({ registry: this }) and pass it to every tool's execute() call, enabling composed tools to call sub-tools without routing back through the executor's confirmation gate
  • Adds multi_edit as the bundled proof of concept: applies an ordered list of {old_string, new_string} edits to a single file via the registry-backed edit tool, confirmed once as an atomic operation

Why this matters: The Tool interface is the lowest-level contract in the system. Adding optional composition fields now (composedOf, atomic, ctx) establishes the interface without breaking any existing tools and without requiring every tool's signature to change later. The atomic confirmation behavior falls out naturally: the executor confirms multi_edit once; its internal calls to ctx.registry.execute("edit", ...) bypass the executor's confirmation gate entirely.

Related issue

Closes #65

Test plan

  • npm run typecheck && npm run lint && npm run format:check && npm test — all pass (622 tests, 0 failures)
  • All existing tool tests unmodified and still passing — no behaviour changes to existing tools
  • 2 new registry tests: context is passed to tool execute, composed tool can call sub-tools via ctx
  • 4 new multi_edit tests: applies multiple edits in order, stops on first failure, errors without context, exposes atomic/composedOf metadata

Generated by Claude Code

Comment thread src/tools/file/file.test.ts Fixed
Comment thread src/tools/file/file.test.ts Fixed

@zjshen14 zjshen14 left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: substantive change, needs design clarification before merge

The shape of this change is right — optional fields on Tool, a separate ToolRunner interface to avoid circular imports, and multi_edit as a real proof-of-concept that motivates the change. The interface evolution is backwards-compatible. Tests are focused and pass.

One real concern is the confirmation-gate bypass, and one design question is whether atomic actually means what it implies. I'd hold the merge until those are resolved.

Strengths

  • Minimal interface change. All new fields on Tool are optional; existing tools don't move. execute(params, ctx?) is the right TypeScript shape.
  • ToolRunner cleanly avoids circular imports. Defining a 1-method interface that ToolRegistry happens to implement is the right abstraction.
  • multi_edit motivates the change. Real value, naturally composable, simple to reason about.
  • Tests are tight. 2 registry tests + 4 multi_edit tests; new behavior on existing tools is not modified.

Real concerns

1. (High) ctx.registry.execute() bypasses every confirmation gate

The PR description says:

The atomic confirmation behavior falls out naturally: the executor confirms multi_edit once; its internal calls to ctx.registry.execute("edit", ...) bypass the executor's confirmation gate entirely.

This is correct as described — but the reason it bypasses the gate is that ToolRegistry.execute() (src/tools/registry.ts) has no confirmation gate at all. The gate lives entirely in src/core/executor.ts (executor.ts:88-91). So ctx.registry.execute() doesn't just skip confirmation for multi_edit's atomic composition — it skips it for any composed tool, regardless of atomic.

This means a future tool author writing:

{
  name: "fix_imports",
  composedOf: ["bash"],
  atomic: false,
  async execute(_, ctx) {
    return ctx!.registry.execute("bash", { command: "rm -rf ./node_modules && npm install" });
  },
}

…would silently bypass:

  • The HITL confirmation prompt (SAFE_COMMANDS allowlist gate)
  • The permission deny rules from #156
  • The ask rules from #164
  • Plan-mode read-only enforcement

For multi_edit itself this is safe by coincidence: edit.requiresConfirmation only fires for paths outside cwd, and multi_edit runs the same path check at its own level, so any path the inner edit would have prompted on is one the outer multi_edit already prompted on. But the interface doesn't enforce this invariant — only the implementation of this one tool does.

We just shipped pattern-based deny rules in #156 specifically to give project owners control over what tools can do. Composition that bypasses those rules is a real regression in the security/UX model, even if no current tool exploits it.

Three ways to resolve this, in order of how I'd rank them:

a. Make ctx.registry.execute() route through the executor's confirmation gate (or surface a sibling method executeWithConfirm()). When atomic: true is set on the parent tool, the gate suppresses confirmation for sub-calls. When atomic is false or unset, sub-calls go through the normal gate. This makes atomic actually load-bearing.

b. Document explicitly that composed tools must NOT compose tools with meaningful requiresConfirmation and add a runtime check at registration time: reject any tool whose composedOf includes a tool that itself requires confirmation. Cheap and safe; constrains the design but matches what multi_edit actually does.

c. Keep the current shape but document the footgun loudly in base.ts next to composedOf, with a worked example of what NOT to compose. Weakest option — relies on tool authors reading docs they don't have to read.

I'd prefer (a). (b) is acceptable. (c) alone isn't.

2. (Medium) atomic is named like ACID but doesn't roll back

The test "stops on the first failing edit" is correct behavior — the implementation does fail-fast. But that means: edits 1 and 2 succeed, edit 3 fails → the file is left in a partial-edit state, and the user sees an error. "Atomic" usually means all-or-nothing.

Two paths:

  • Rename to something more honest (singleConfirmation, consolidated, bundled) and accept fail-fast semantics.
  • Actually implement rollback: capture pre-tool state (could leverage the existing snapshot system in src/state/snapshot.ts) and restore on failure. Heavier but matches the name.

I'd rename. Rollback is a separate concern that's better handled by the existing snapshot/rewind machinery.

3. (Low) tmpDir?: string in ToolExecutionContext is unused

The field is declared but ToolRegistry.execute() never populates it and no tool reads it. Either drop it or add a comment explaining its intended future use (presumably tied to Context.getSessionTmpDir()?). Right now it's a draft leftover.

Smaller observations

  • composedOf is informational only. Worth a comment noting where you'd expect to use it later (e.g. plan mode filtering when all sub-tools are readonly, observability event annotation, docs generation).
  • The error message "multi_edit requires an execution context" is opaque. Consider \"multi_edit must be invoked via ToolRegistry, not called directly.\"
  • CodeQL is FAILURE. Worth investigating before merge — might be unrelated, but the run URL is in the PR. Even if unrelated, the PR description's test plan claims "622 tests" but main is at 645+ now — worth re-running locally and updating.

Verdict

Hold for design clarification on the confirmation bypass. The fix is bounded (option a or b above) and doesn't require redesigning the rest of the PR — the interface shape is good, multi_edit is the right first example, and the registry plumbing is correct. Once ctx.registry.execute() either respects confirmations or has a runtime guard preventing the dangerous composition, this is a strong merge.

The atomic naming/semantics is the second thing to resolve before the field name calcifies as part of the public Tool interface. Hard to change later.

claude and others added 2 commits May 31, 2026 23:25
…concept

Extends the Tool interface with optional `composedOf`, `atomic`, and
`ctx` fields so tools can delegate to other tools as a single confirmed
unit. The ToolRegistry now builds a ToolExecutionContext (registry
reference + optional tmpDir) and passes it to every tool's execute()
call, enabling composed tools to call sub-tools without routing back
through the executor's confirmation gate.

`multi_edit` is the bundled proof of concept: it applies a list of
ordered edits to a single file via the registry-backed `edit` tool,
confirmed once as an atomic operation instead of once per edit.

Closes #65
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tmpDir

Addresses the three real concerns from the #193 review.

1. Confirmation-gate bypass (the high-severity one). ToolRegistry.execute()
   now refuses to run any tool whose composedOf includes a sub-tool that has
   its own requiresConfirmation predicate, unless the composing tool sets
   singleConfirmation: true. Without this guard a tool author writing
   composedOf: ["bash"] would have silently bypassed the executor's HITL
   prompt, the permission deny rules from #156, the ask rules from #164, and
   plan-mode read-only enforcement. The guard runs at execute time (not
   register time) so registration order is irrelevant.

2. Renamed atomic to singleConfirmation. The original name implied ACID
   all-or-nothing semantics, but the implementation is fail-fast with
   possibly partial state (no rollback). singleConfirmation honestly
   describes what the flag does: one user prompt covers the composite
   operation. multi_edit's description now spells out explicitly that
   earlier successful edits are NOT rolled back if a later edit fails —
   use /rewind or git to undo partial state. Rollback can layer on top
   later if/when needed via the existing snapshot/rewind machinery.

3. Dropped the unused tmpDir field from ToolExecutionContext.

Also tightened multi_edit's "called without ctx" error message from a
generic "execution context" to the actionable "must be invoked via
ToolRegistry, not called directly."

Tests:
- Renamed metadata test (atomic → singleConfirmation, same behavior)
- New: rejects a composed tool that delegates to a confirmation-gated
  sub-tool without singleConfirmation (proves the guard fires)
- New: allows a composed tool to delegate to a confirmation-gated
  sub-tool when singleConfirmation is true (proves the guard doesn't
  over-reject)

All 653 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@zjshen14

zjshen14 commented Jun 1, 2026

Copy link
Copy Markdown
Owner Author

Took this over from the review — three fixes in 721052d on top of the original commit

Per the review thread above, all three concerns are now addressed without changing the shape of the original PR. The interface evolution, multi_edit proof of concept, and registry plumbing all stand; what's added is a safety guard plus a rename.

1. ✅ Confirmation-gate bypass — runtime guard in ToolRegistry.execute()

When called, the registry now checks composedOf against each sub-tool's requiresConfirmation. If a sub-tool has its own confirmation predicate and the composing tool isn't marked singleConfirmation: true, the call fails with an actionable error:

Tool 'unsafe_composer' lists 'dangerous' in composedOf, but 'dangerous' has a requiresConfirmation predicate and 'unsafe_composer' is not marked singleConfirmation: true. This composition would silently bypass the user-confirmation gate on 'dangerous'. Either set singleConfirmation: true on 'unsafe_composer' (so its own confirmation covers the sub-call) or remove 'dangerous' from composedOf.

This catches the composedOf: ["bash"] footgun without restructuring the executor ↔ registry relationship. The guard runs at execute time (not register time) so registration order is irrelevant. Two new tests pin the contract:

  • Rejects composition without singleConfirmation when sub-tool is gated
  • Allows composition WITH singleConfirmation (proves the guard doesn't over-reject)

2. ✅ Renamed atomicsingleConfirmation

atomic implied ACID/rollback semantics, but the implementation is fail-fast with possibly partial state. singleConfirmation honestly describes what the flag does: one user prompt covers the composite operation.

multi_edit's description also now spells out explicitly:

Edits are applied in order; if any edit fails, the operation stops at that point (earlier edits are NOT rolled back — use /rewind or git to undo the partial state).

Rollback can layer on top later via the existing src/state/snapshot.ts machinery if/when needed — but it's a separate concern from confirmation semantics, and conflating them in the field name would be a mistake we'd be stuck with.

3. ✅ Dropped unused tmpDir? from ToolExecutionContext

Was a draft leftover; nothing populated or read it. Removed.

Smaller polish

  • multi_edit's "missing ctx" error went from a generic "multi_edit requires an execution context" to actionable "multi_edit must be invoked via ToolRegistry, not called directly."
  • Updated the registry-level JSDoc to document the composition safety invariant inline.

Status

  • 653 tests pass (+8 over main: 6 original PR tests + 2 new guard tests)
  • npm run typecheck && npm run lint && npm run format:check clean
  • Rebased onto current main; ready to merge once CI re-runs

…ile"

CodeQL flagged the two new multi_edit tests for writing to fixed filenames
under os.tmpdir(). The risk is the classic symlink-race class: an attacker
who can pre-create /tmp/opencli-test-<Date.now()>/multi.ts as a symlink
could redirect the test's writeFile() to an arbitrary path.

The pattern matches the rest of this file (existing tests use the same
join(tmpDir, "literal.txt") shape and aren't flagged only because CodeQL
ignores pre-existing lines), but the new tests trip the rule.

Switch the multi_edit describe block to a per-block mkdtemp() that creates
a kernel-allocated random suffix. Scoped to just the new tests to keep the
diff small; converting the whole file is a worthwhile follow-up but
out of scope for #65.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@zjshen14 zjshen14 merged commit 3695156 into main Jun 1, 2026
3 checks passed
@zjshen14 zjshen14 deleted the feat/issue-65 branch June 1, 2026 06:39
zjshen14 added a commit that referenced this pull request Jun 1, 2026
* fix(eval): make notContains substring-aware so overlap with contains doesn't false-fail

The previous scorer flagged a `notContains` hit whenever the forbidden
string appeared anywhere in the file, even if it was a substring of a
matched `contains` pattern. That broke the obvious YAML form for
strict-equality fixes — `contains: '=== ""'`, `notContains: '== ""'` —
because the correctly-fixed code contains both: `==` is inside `===`.

Switch to position-aware ranges: a `notContains` hit only counts when
its [start, end] range is not fully covered by some `contains` range.
This preserves the original fix-off-by-one behaviour (where the
forbidden pattern extends past the permitted match) and unlocks the
natural form for `fix-strict-equality.yaml`, which is tightened in this
commit.

Two regression tests added in scorer.test.ts pin both directions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(eval): add multi_edit scenario exercising tool composition end-to-end

#193 introduced multi_edit as a `composedOf: [edit]` tool with a single
confirmation gate. Until now its registration path was only covered by
unit tests in file.test.ts — there was no end-to-end check that a real
agent run against a provider could pick the tool up, apply two
coordinated edits, and leave the file in a valid (tsc-clean) state.

This scenario prompts the agent to fix both bugs in utils.ts
(empty-array reduce + loose equality) using multi_edit explicitly,
and asserts the resulting file contains both fixes. Confirmed 21/21
pass against gemini-3.1-flash-lite-preview.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
zjshen14 added a commit that referenced this pull request Jun 2, 2026
…docs (#207)

The multi_edit tool and the composedOf/singleConfirmation composition
fields were added in #193 but the three doc files that describe the
source structure and Tool interface were not updated.

- CLAUDE.md + AGENTS.md: add multi_edit to the file/ tool list
- docs/architecture.md: add multi-edit.ts to the file/ tree; add
  ToolExecutionContext interface and composedOf/singleConfirmation fields
  to the Tool interface code block; update execute() signature to show
  the optional ctx parameter

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Architecture: design Tool interface for composition — atomic multi-step operations

3 participants