feat(tools): add Tool composition interface with multi_edit proof of concept by zjshen14 · Pull Request #193 · zjshen14/opencli

zjshen14 · 2026-05-31T03:12:08Z

Summary

Extends the Tool interface (src/tools/base.ts) with optional composedOf, atomic, and ctx fields, plus a ToolRunner interface and ToolExecutionContext type to avoid circular imports
Updates ToolRegistry.execute() to build a ToolExecutionContext ({ registry: this }) and pass it to every tool's execute() call, enabling composed tools to call sub-tools without routing back through the executor's confirmation gate
Adds multi_edit as the bundled proof of concept: applies an ordered list of {old_string, new_string} edits to a single file via the registry-backed edit tool, confirmed once as an atomic operation

Why this matters: The Tool interface is the lowest-level contract in the system. Adding optional composition fields now (composedOf, atomic, ctx) establishes the interface without breaking any existing tools and without requiring every tool's signature to change later. The atomic confirmation behavior falls out naturally: the executor confirms multi_edit once; its internal calls to ctx.registry.execute("edit", ...) bypass the executor's confirmation gate entirely.

Related issue

Closes #65

Test plan

npm run typecheck && npm run lint && npm run format:check && npm test — all pass (622 tests, 0 failures)
All existing tool tests unmodified and still passing — no behaviour changes to existing tools
2 new registry tests: context is passed to tool execute, composed tool can call sub-tools via ctx
4 new multi_edit tests: applies multiple edits in order, stops on first failure, errors without context, exposes atomic/composedOf metadata

Generated by Claude Code

zjshen14

Review: substantive change, needs design clarification before merge

The shape of this change is right — optional fields on Tool, a separate ToolRunner interface to avoid circular imports, and multi_edit as a real proof-of-concept that motivates the change. The interface evolution is backwards-compatible. Tests are focused and pass.

One real concern is the confirmation-gate bypass, and one design question is whether atomic actually means what it implies. I'd hold the merge until those are resolved.

Strengths

Minimal interface change. All new fields on Tool are optional; existing tools don't move. execute(params, ctx?) is the right TypeScript shape.
ToolRunner cleanly avoids circular imports. Defining a 1-method interface that ToolRegistry happens to implement is the right abstraction.
multi_edit motivates the change. Real value, naturally composable, simple to reason about.
Tests are tight. 2 registry tests + 4 multi_edit tests; new behavior on existing tools is not modified.

Real concerns

1. (High) `ctx.registry.execute()` bypasses every confirmation gate

The PR description says:

The atomic confirmation behavior falls out naturally: the executor confirms multi_edit once; its internal calls to ctx.registry.execute("edit", ...) bypass the executor's confirmation gate entirely.

This is correct as described — but the reason it bypasses the gate is that ToolRegistry.execute() (src/tools/registry.ts) has no confirmation gate at all. The gate lives entirely in src/core/executor.ts (executor.ts:88-91). So ctx.registry.execute() doesn't just skip confirmation for multi_edit's atomic composition — it skips it for any composed tool, regardless of atomic.

This means a future tool author writing:

{
  name: "fix_imports",
  composedOf: ["bash"],
  atomic: false,
  async execute(_, ctx) {
    return ctx!.registry.execute("bash", { command: "rm -rf ./node_modules && npm install" });
  },
}

…would silently bypass:

The HITL confirmation prompt (SAFE_COMMANDS allowlist gate)
The permission deny rules from #156
The ask rules from #164
Plan-mode read-only enforcement

For multi_edit itself this is safe by coincidence: edit.requiresConfirmation only fires for paths outside cwd, and multi_edit runs the same path check at its own level, so any path the inner edit would have prompted on is one the outer multi_edit already prompted on. But the interface doesn't enforce this invariant — only the implementation of this one tool does.

We just shipped pattern-based deny rules in #156 specifically to give project owners control over what tools can do. Composition that bypasses those rules is a real regression in the security/UX model, even if no current tool exploits it.

Three ways to resolve this, in order of how I'd rank them:

a. Make ctx.registry.execute() route through the executor's confirmation gate (or surface a sibling method executeWithConfirm()). When atomic: true is set on the parent tool, the gate suppresses confirmation for sub-calls. When atomic is false or unset, sub-calls go through the normal gate. This makes atomic actually load-bearing.

b. Document explicitly that composed tools must NOT compose tools with meaningful requiresConfirmation and add a runtime check at registration time: reject any tool whose composedOf includes a tool that itself requires confirmation. Cheap and safe; constrains the design but matches what multi_edit actually does.

c. Keep the current shape but document the footgun loudly in base.ts next to composedOf, with a worked example of what NOT to compose. Weakest option — relies on tool authors reading docs they don't have to read.

I'd prefer (a). (b) is acceptable. (c) alone isn't.

2. (Medium) `atomic` is named like ACID but doesn't roll back

The test "stops on the first failing edit" is correct behavior — the implementation does fail-fast. But that means: edits 1 and 2 succeed, edit 3 fails → the file is left in a partial-edit state, and the user sees an error. "Atomic" usually means all-or-nothing.

Two paths:

Rename to something more honest (singleConfirmation, consolidated, bundled) and accept fail-fast semantics.
Actually implement rollback: capture pre-tool state (could leverage the existing snapshot system in src/state/snapshot.ts) and restore on failure. Heavier but matches the name.

I'd rename. Rollback is a separate concern that's better handled by the existing snapshot/rewind machinery.

3. (Low) `tmpDir?: string` in `ToolExecutionContext` is unused

The field is declared but ToolRegistry.execute() never populates it and no tool reads it. Either drop it or add a comment explaining its intended future use (presumably tied to Context.getSessionTmpDir()?). Right now it's a draft leftover.

Smaller observations

composedOf is informational only. Worth a comment noting where you'd expect to use it later (e.g. plan mode filtering when all sub-tools are readonly, observability event annotation, docs generation).
The error message "multi_edit requires an execution context" is opaque. Consider \"multi_edit must be invoked via ToolRegistry, not called directly.\"
CodeQL is FAILURE. Worth investigating before merge — might be unrelated, but the run URL is in the PR. Even if unrelated, the PR description's test plan claims "622 tests" but main is at 645+ now — worth re-running locally and updating.

Verdict

Hold for design clarification on the confirmation bypass. The fix is bounded (option a or b above) and doesn't require redesigning the rest of the PR — the interface shape is good, multi_edit is the right first example, and the registry plumbing is correct. Once ctx.registry.execute() either respects confirmations or has a runtime guard preventing the dangerous composition, this is a strong merge.

The atomic naming/semantics is the second thing to resolve before the field name calcifies as part of the public Tool interface. Hard to change later.

…concept Extends the Tool interface with optional `composedOf`, `atomic`, and `ctx` fields so tools can delegate to other tools as a single confirmed unit. The ToolRegistry now builds a ToolExecutionContext (registry reference + optional tmpDir) and passes it to every tool's execute() call, enabling composed tools to call sub-tools without routing back through the executor's confirmation gate. `multi_edit` is the bundled proof of concept: it applies a list of ordered edits to a single file via the registry-backed `edit` tool, confirmed once as an atomic operation instead of once per edit. Closes #65 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tmpDir Addresses the three real concerns from the #193 review. 1. Confirmation-gate bypass (the high-severity one). ToolRegistry.execute() now refuses to run any tool whose composedOf includes a sub-tool that has its own requiresConfirmation predicate, unless the composing tool sets singleConfirmation: true. Without this guard a tool author writing composedOf: ["bash"] would have silently bypassed the executor's HITL prompt, the permission deny rules from #156, the ask rules from #164, and plan-mode read-only enforcement. The guard runs at execute time (not register time) so registration order is irrelevant. 2. Renamed atomic to singleConfirmation. The original name implied ACID all-or-nothing semantics, but the implementation is fail-fast with possibly partial state (no rollback). singleConfirmation honestly describes what the flag does: one user prompt covers the composite operation. multi_edit's description now spells out explicitly that earlier successful edits are NOT rolled back if a later edit fails — use /rewind or git to undo partial state. Rollback can layer on top later if/when needed via the existing snapshot/rewind machinery. 3. Dropped the unused tmpDir field from ToolExecutionContext. Also tightened multi_edit's "called without ctx" error message from a generic "execution context" to the actionable "must be invoked via ToolRegistry, not called directly." Tests: - Renamed metadata test (atomic → singleConfirmation, same behavior) - New: rejects a composed tool that delegates to a confirmation-gated sub-tool without singleConfirmation (proves the guard fires) - New: allows a composed tool to delegate to a confirmation-gated sub-tool when singleConfirmation is true (proves the guard doesn't over-reject) All 653 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

zjshen14 · 2026-06-01T06:28:31Z

Took this over from the review — three fixes in `721052d` on top of the original commit

Per the review thread above, all three concerns are now addressed without changing the shape of the original PR. The interface evolution, multi_edit proof of concept, and registry plumbing all stand; what's added is a safety guard plus a rename.

1. ✅ Confirmation-gate bypass — runtime guard in `ToolRegistry.execute()`

When called, the registry now checks composedOf against each sub-tool's requiresConfirmation. If a sub-tool has its own confirmation predicate and the composing tool isn't marked singleConfirmation: true, the call fails with an actionable error:

Tool 'unsafe_composer' lists 'dangerous' in composedOf, but 'dangerous' has a requiresConfirmation predicate and 'unsafe_composer' is not marked singleConfirmation: true. This composition would silently bypass the user-confirmation gate on 'dangerous'. Either set singleConfirmation: true on 'unsafe_composer' (so its own confirmation covers the sub-call) or remove 'dangerous' from composedOf.

This catches the composedOf: ["bash"] footgun without restructuring the executor ↔ registry relationship. The guard runs at execute time (not register time) so registration order is irrelevant. Two new tests pin the contract:

Rejects composition without singleConfirmation when sub-tool is gated
Allows composition WITH singleConfirmation (proves the guard doesn't over-reject)

2. ✅ Renamed `atomic` → `singleConfirmation`

atomic implied ACID/rollback semantics, but the implementation is fail-fast with possibly partial state. singleConfirmation honestly describes what the flag does: one user prompt covers the composite operation.

multi_edit's description also now spells out explicitly:

Edits are applied in order; if any edit fails, the operation stops at that point (earlier edits are NOT rolled back — use /rewind or git to undo the partial state).

Rollback can layer on top later via the existing src/state/snapshot.ts machinery if/when needed — but it's a separate concern from confirmation semantics, and conflating them in the field name would be a mistake we'd be stuck with.

3. ✅ Dropped unused `tmpDir?` from `ToolExecutionContext`

Was a draft leftover; nothing populated or read it. Removed.

Smaller polish

multi_edit's "missing ctx" error went from a generic "multi_edit requires an execution context" to actionable "multi_edit must be invoked via ToolRegistry, not called directly."
Updated the registry-level JSDoc to document the composition safety invariant inline.

Status

653 tests pass (+8 over main: 6 original PR tests + 2 new guard tests)
npm run typecheck && npm run lint && npm run format:check clean
Rebased onto current main; ready to merge once CI re-runs

…ile" CodeQL flagged the two new multi_edit tests for writing to fixed filenames under os.tmpdir(). The risk is the classic symlink-race class: an attacker who can pre-create /tmp/opencli-test-<Date.now()>/multi.ts as a symlink could redirect the test's writeFile() to an arbitrary path. The pattern matches the rest of this file (existing tests use the same join(tmpDir, "literal.txt") shape and aren't flagged only because CodeQL ignores pre-existing lines), but the new tests trip the rule. Switch the multi_edit describe block to a per-block mkdtemp() that creates a kernel-allocated random suffix. Scoped to just the new tests to keep the diff small; converting the whole file is a worthwhile follow-up but out of scope for #65. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(eval): make notContains substring-aware so overlap with contains doesn't false-fail The previous scorer flagged a `notContains` hit whenever the forbidden string appeared anywhere in the file, even if it was a substring of a matched `contains` pattern. That broke the obvious YAML form for strict-equality fixes — `contains: '=== ""'`, `notContains: '== ""'` — because the correctly-fixed code contains both: `==` is inside `===`. Switch to position-aware ranges: a `notContains` hit only counts when its [start, end] range is not fully covered by some `contains` range. This preserves the original fix-off-by-one behaviour (where the forbidden pattern extends past the permitted match) and unlocks the natural form for `fix-strict-equality.yaml`, which is tightened in this commit. Two regression tests added in scorer.test.ts pin both directions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(eval): add multi_edit scenario exercising tool composition end-to-end #193 introduced multi_edit as a `composedOf: [edit]` tool with a single confirmation gate. Until now its registration path was only covered by unit tests in file.test.ts — there was no end-to-end check that a real agent run against a provider could pick the tool up, apply two coordinated edits, and leave the file in a valid (tsc-clean) state. This scenario prompts the agent to fix both bugs in utils.ts (empty-array reduce + loose equality) using multi_edit explicitly, and asserts the resulting file contains both fixes. Confirmed 21/21 pass against gemini-3.1-flash-lite-preview. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…docs (#207) The multi_edit tool and the composedOf/singleConfirmation composition fields were added in #193 but the three doc files that describe the source structure and Tool interface were not updated. - CLAUDE.md + AGENTS.md: add multi_edit to the file/ tool list - docs/architecture.md: add multi-edit.ts to the file/ tree; add ToolExecutionContext interface and composedOf/singleConfirmation fields to the Tool interface code block; update execute() signature to show the optional ctx parameter Co-authored-by: Claude <noreply@anthropic.com>

zjshen14 mentioned this pull request May 31, 2026

Architecture: design Tool interface for composition — atomic multi-step operations #65

Closed

6 tasks

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

Comment thread src/tools/file/file.test.ts Fixed

Comment thread src/tools/file/file.test.ts Fixed

zjshen14 commented Jun 1, 2026

View reviewed changes

claude and others added 2 commits May 31, 2026 23:25

zjshen14 force-pushed the feat/issue-65 branch from 47b2ebb to 721052d Compare June 1, 2026 06:28

zjshen14 merged commit 3695156 into main Jun 1, 2026
3 checks passed

zjshen14 deleted the feat/issue-65 branch June 1, 2026 06:39

zjshen14 mentioned this pull request Jun 1, 2026

fix(eval): substring-aware notContains + multi_edit scenario #204

Merged

3 tasks

This was referenced Jun 2, 2026

docs: add multi_edit and tool composition fields to source structure docs #207

Merged

Research: features to learn from OpenHands, Open Interpreter, and OpenCode #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): add Tool composition interface with multi_edit proof of concept#193

feat(tools): add Tool composition interface with multi_edit proof of concept#193
zjshen14 merged 3 commits into
mainfrom
feat/issue-65

zjshen14 commented May 31, 2026

Uh oh!

Uh oh!

Uh oh!

zjshen14 left a comment

Uh oh!

zjshen14 commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zjshen14 commented May 31, 2026

Summary

Related issue

Test plan

Uh oh!

Uh oh!

Uh oh!

zjshen14 left a comment

Choose a reason for hiding this comment

Review: substantive change, needs design clarification before merge

Strengths

Real concerns

1. (High) ctx.registry.execute() bypasses every confirmation gate

2. (Medium) atomic is named like ACID but doesn't roll back

3. (Low) tmpDir?: string in ToolExecutionContext is unused

Smaller observations

Verdict

Uh oh!

zjshen14 commented Jun 1, 2026

Took this over from the review — three fixes in 721052d on top of the original commit

1. ✅ Confirmation-gate bypass — runtime guard in ToolRegistry.execute()

2. ✅ Renamed atomic → singleConfirmation

3. ✅ Dropped unused tmpDir? from ToolExecutionContext

Smaller polish

Status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. (High) `ctx.registry.execute()` bypasses every confirmation gate

2. (Medium) `atomic` is named like ACID but doesn't roll back

3. (Low) `tmpDir?: string` in `ToolExecutionContext` is unused

Took this over from the review — three fixes in `721052d` on top of the original commit

1. ✅ Confirmation-gate bypass — runtime guard in `ToolRegistry.execute()`

2. ✅ Renamed `atomic` → `singleConfirmation`

3. ✅ Dropped unused `tmpDir?` from `ToolExecutionContext`