-
Notifications
You must be signed in to change notification settings - Fork 18
Merge Script Tools API and WebMCP explainers [Part I] #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ce1d0f8 to
6748764
Compare
anssiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signing off for the general direction with a non-blocking suggestion to include acknowledgments.
Co-authored-by: Anssi Kostiainen <anssi.kostiainen@gmail.com>
docs/proposal.md
Outdated
|
|
||
| ### Recommendation | ||
|
|
||
| A **hybrid** approach of both of the examples above is recommended as this would make it easy for web developers to get started adding tools to their page, while leaving open the possibility of manifest-based approaches in the future. To implement this hybrid approach, a `"toolcall"` event is dispatched on every incoming tool call _before_ executing the tool's `execute` function. The event handler can handle the tool call by calling the event's `preventDefault()` method, and then responding to the agent with `respondWith()` as shown above. If the event handle does not call `preventDefault()` then the browser's default behavior for tool calls will occur. The `execute` function for the requested tool is called. If a tool with the requested name does not exist, then the browser responds to the agent with an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a deeper discussion for the API but curious as I'm reading this. I'm not following if the recommendation is for the json manifest based declaration or the provideContext API based (which includes the execute function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #8 as I'm not convinced we need the declarative form at all.
If we do decide we want this, an alternative approach is to include the declarative part only as a listing for informational/indexing purposes but keep only the procedural API for actually calling the functions. That is, an agent will still need to load the page to use the tool even with the declarative form so force registration and calling to happen in the same way.
We could avoid having a separate, differently shaped, API between the two forms by always requiring agent.provideContext even for declaratively provided tools. If we're worried about duplication between the declarative/procedural forms we could keep a parsed version of the declarative registration available as an object, e.g.
// manifest.json:
{
"tools": [
{
"name": "add-todo",
"description": "Add a new todo item to the list",
...
}
]
}// js
window.agent.provideContext({
tools: [
window.agent.manifestTools['add-todo'],
]
});There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we need a deeper discussion around the API. Keeping this doc as is for this PR so we can address this later. At the moment, proposal.md is just a temporary home for stuff that was moved out of explainer.md. I fully expect the API will change dramatically from what's written here, especially since the direction of prior art and the Script Tools API is leaning closer to something like a single defineTool call per tool.
bokand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
Still have to finish my pass but sending out a few collected nits - will finish tomorrow morning.
docs/proposal.md
Outdated
|
|
||
| When an agent that is connected to the page sends a tool call, the JavaScript callback is invoked, where the page can handle the tool call and respond to the agent. Simple applications can handle tool calls entirely in page script, but more complex applications may choose to delegate computationally heavy operations to workers and respond to the agent asynchronously. | ||
|
|
||
| Handling tool cools in the main thread with the option of delegating to workers serves a few purposes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered delegating to a real MCP server as well? This is one option we've considered as concern about "duplicated effort between this and MCP" was a commonly heard concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not something we considered while writing our original proposal, but worth investigating further.
|
|
||
| - Allows additional context different discovery mechanisms without rendering a page. | ||
|
|
||
| **Disadvantages:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another disadvantage compared to the imperative form is that these cannot be context-dependent - i.e. the imperative form allows you to reset the set of available tools by calling provideContext. The declarative form is effectively "always available".
docs/proposal.md
Outdated
|
|
||
| ### Recommendation | ||
|
|
||
| A **hybrid** approach of both of the examples above is recommended as this would make it easy for web developers to get started adding tools to their page, while leaving open the possibility of manifest-based approaches in the future. To implement this hybrid approach, a `"toolcall"` event is dispatched on every incoming tool call _before_ executing the tool's `execute` function. The event handler can handle the tool call by calling the event's `preventDefault()` method, and then responding to the agent with `respondWith()` as shown above. If the event handle does not call `preventDefault()` then the browser's default behavior for tool calls will occur. The `execute` function for the requested tool is called. If a tool with the requested name does not exist, then the browser responds to the agent with an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #8 as I'm not convinced we need the declarative form at all.
If we do decide we want this, an alternative approach is to include the declarative part only as a listing for informational/indexing purposes but keep only the procedural API for actually calling the functions. That is, an agent will still need to load the page to use the tool even with the declarative form so force registration and calling to happen in the same way.
We could avoid having a separate, differently shaped, API between the two forms by always requiring agent.provideContext even for declaratively provided tools. If we're worried about duplication between the declarative/procedural forms we could keep a parsed version of the declarative registration available as an object, e.g.
// manifest.json:
{
"tools": [
{
"name": "add-todo",
"description": "Add a new todo item to the list",
...
}
]
}// js
window.agent.provideContext({
tools: [
window.agent.manifestTools['add-todo'],
]
});| content: [ | ||
| { | ||
| type: "text", | ||
| text: `Stamp "${name}" added successfully! The collection now contains ${stamps.length} stamps.`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to specify how this structured output looks. WDYT about including something like the recently addedoutputSchema in MCP?
No need to decide here but filed #9 to discuss.
| return { | ||
| content: [ | ||
| { | ||
| type: "text", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have y'all considered non text output? Thinking through examples this seems like it'd be very useful but I'm not sure yet how it'd look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking each content item could be either a string or a Blob URI for images and audio. A Blob URI with a supported mimeType is returned to the browser, and that's exposed to the agent in a format that the agent expects (i.e. base64 embedded in JSON).
|
|
||
| #### Use a worker | ||
|
|
||
| To improve the user experience and make it possible for the stamp application to handle a large number of tool calls without tying up the document's main thread, the web developer may choose to move the tool handling into a dedicated worker script. Handling tool calls in a worker keeps the UI responsive, and makes it possible to handle potentially long-running operations. For example, if the user asks an AI agent to add a list of hundreds of stamps from an external source such as a spreadsheet, this will result in hundreds of tool calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that could happen at the application level? Tools are already async so an author could just postMessage the real work over to the worker already. Registering tools directly in a worker seems like maybe a small ergonomics improvement for that use case.
One interesting but maybe scary idea was tools in a service worker which could allow tool usage without a browsing context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring tool registration to happen only in a top-level browsing context makes it easy to reason about which agent sees the tools (its the agent in the sidebar next to the tab). It also avoids potentially conflicting tool registrations from other frames and workers, and lets the browser enforce that the script registering the tools comes from the same origin. That was the main reason for not considering tool registration directly in workers.
An alternative to consider, we could define an object which holds the tools (like the AutomationDelegate from the Script Tools proposal). The object can only be created by a document in a top-level frame, but it can be transfered to workers via postMessage. So, for web devs who knows they want all of their tool handling to happen in a worker, they can just transfer the object to the worker and register the tools there. This allows the better ergonomics but still enforces the idea of one set of tools per tab.
This is a first pass at unifying Microsoft's Web Model Context and Google Script Tools explainers.
Summary of changes
Follow-ups
The proposal.md still contains just Web Model Context API stuff, more or less verbatim. Still need to converge on a proposed API design that takes Script Tools, WebMCP (MCP-B), and other prior art into account. Keeping that out of this PR to avoid it becoming too large and since the API is still being discussed.
+@khushalsagar