Skip to content

Tool Use: decouple execution and formalize function calls and responses #159

@michaelwasserman

Description

@michaelwasserman

Background

The initial design integrates JS tool execution within the prompt() call itself. The browser process operates as a hidden agent, looping on (model prediction → tool execution → response observation) until a final text response was ready.

While an agentic closed loop prioritizes API simplicity, it detracts from design goals to offer granular control and direct Language Model interactions through this lower-level API.

Motivation

We'd like to realign initial tool use integrations with Prompt API objectives:

  • Provide essential Language Model tool types: i.e. Function Declarations (FD), Function Calls (FC), and Function Responses (FR).
    • These can be used by API clients to inspect, reconstruct, and test session history.
  • Offer fine-grained client control over tool execution loops used for agentic integrations.
    • This enables clients to define the looping patterns, error handling, limits, etc.
  • Align with patterns established by major LLM APIs (OpenAI, Gemini, Claude).
    • This empower clients to use the Prompt API more interchangeably with server-based APIs.

Proposal

We propose a shift from an agentic, closed-loop model to an API-centric, open-loop model for tool execution in the Prompt API. The prompt() method should return a structured Function Call object to the client, which then manages the execution and Function Response feedback loop, maximizing developer control and observability. Function Declarations also need not provide execute functions for now.

The API should initially provide more granular functionally, as ease-of-use can be more readily provided by client libraries and future API enhancements.

Design Principle Closed-Loop (Original) Open-Loop (Proposed)
Developer Control Limited; tool execution is encapsulated; interventions are ad-hoc in tool bodies. Full; client controls logic, guardrails, and error handling.
Debuggability Poor; trajectory of tool calls is hidden; clients define bespoke call and response representations. Excellent; Each tool call and response is a discrete step handled by the client.
Industry Alignment Divergent; forces custom agent code. Aligned; Improves portability of agentic code across client-side and server-side models.

Specific Changes

To enable the open-loop model, Function Call and Function Response must be formalized as first-class elements in the API message flow:

Riffing @FrankLi-MSFT 's prototype and @jingyun19's design, declaration would look like:

// The declaration for a tool that a language model can invoke.
dictionary LanguageModelToolDeclaration {
  required DOMString name;
  required DOMString description;
  // JSON schema for the input parameters.
  required object inputSchema;
  // Maybe add an optional/informational responseSchema per #137?
};

dictionary LanguageModelCreateCoreOptions {
  ....
  // Tools that the language model can use.
  sequence<LanguageModelTool> tools;
};

Then, tool calls and responses are made accessible to the API client. This definitely needs some workshopping!

// Represents a tool call requested by the language model.
dictionary LanguageModelToolCall {
  // Unique identifier for this tool call within the session.
  required DOMString callID;
  required DOMString name;
  // An object fitting the JSON input schema for the tool.
  object input;
};

// Represents the response from executing a tool call.
dictionary LanguageModelToolResponse {
  // Matches the callID from the corresponding tool call.
  required DOMString callID;
  // Response from tool execution. (using an object, rather than a string, per #138)
  // Maybe this should be a sequence of LanguageModelMessageContent?
  required object response;
  // Maybe add explicit success/error signals?
};

enum LanguageModelMessageType { "text", "image", "audio", "tool-call", "tool-response" };

typedef (
  ImageBitmapSource
  or AudioBuffer
  or HTMLAudioElement
  or BufferSource
  or DOMString
  or LanguageModelToolCall  // NEW
  or LanguageModelToolResponse  // NEW
) LanguageModelMessageValue;

// NEW: When `expectedOutputs` has non-text (e.g. `tool-call`), this now yields a content dictionary sequence.
Promise<DOMString or sequence<LanguageModelMessageContent>> prompt(
    LanguageModelPrompt input,
    optional LanguageModelPromptOptions options = {}
  );

// NEW: When `expectedOutputs` has non-text (e.g. `tool-call`), stream chunks are now content dictionaries.
ReadableStream promptStreaming(
  LanguageModelPrompt input,
  optional LanguageModelPromptOptions options = {}
);

We'll need to resolve a number of finer points, e.g.

  • Behavior when supplying responseConstraint (maybe those requests won't yield tool calls?).
  • Behavior when tools are declared but expectedOutputs does not contain tool-call (same or don't yield tool calls?).
  • Whether LanguageModelMessageValue should be split to represent model inputs and outputs.
  • Behavior when responses are not input immediately after tool calls (do warnings suffice, is prompt queueing a footgun?).
  • Ability to specify tool call frequency or specific occurrences amid constrained responses.
  • Probably a lot more!

In any case, the design aims provide the necessary low-level building blocks for developers to construct robust, observable, and fully customized tool use workflows. This proposal is still flexible and we're seeking design feedback!

Much credit for explorations should be given to:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions