Skip to content

fix(studio): server-side tool execution via needsApproval flow (AI-658)#45556

Draft
mattrossman wants to merge 8 commits into
masterfrom
mattrossman/ai-658-server-side-approval
Draft

fix(studio): server-side tool execution via needsApproval flow (AI-658)#45556
mattrossman wants to merge 8 commits into
masterfrom
mattrossman/ai-658-server-side-approval

Conversation

@mattrossman
Copy link
Copy Markdown
Contributor

Second attempt at fixing AI-658, superseding #45339.

Instead of patching tracing around client-side SQL execution, this refactors execute_sql and deploy_edge_function to use AI SDK's needsApproval pattern — execution moves to the server, which resolves the split-trace problem structurally.

  • execute_sql and deploy_edge_function tools now have needsApproval: true and server-side execute handlers; client no longer runs mutations directly
  • UI approval flow uses addToolApprovalResponse instead of addToolResult; DisplayBlockRenderer and EdgeFunctionRenderer are stripped of client-side mutation hooks
  • Braintrust span context is threaded across the approval boundary so Turn 1 and Turn 2 land in the same trace; Turn 1 suppresses online scoring until Turn 2 logs the combined output
  • Restores edge function replace-warning guard (regression from initial attempt)

Closes AI-658

References

- Upgrade ai package to 6.0.173
- Add needsApproval + server-side execute to execute_sql and deploy_edge_function
- Switch sendAutomaticallyWhen to lastAssistantMessageIsCompleteWithApprovalResponses
- Client uses addToolApprovalResponse instead of addToolResult
- Remove client-side SQL/deploy execution from renderers; move to server
- Skip Braintrust span output on Turn 1 (approval boundary); recover approved tool data from rawMessages on Turn 2
- Add ai to pnpm minimumReleaseAgeExclude to unblock same-day publish
@vercel
Copy link
Copy Markdown

vercel Bot commented May 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
studio-self-hosted Ready Ready Preview, Comment May 4, 2026 9:01pm
studio-staging Ready Ready Preview, Comment May 4, 2026 9:01pm
6 Skipped Deployments
Project Deployment Actions Updated (UTC)
studio Ignored Ignored May 4, 2026 9:01pm
design-system Skipped Skipped May 4, 2026 9:01pm
docs Skipped Skipped May 4, 2026 9:01pm
learn Skipped Skipped May 4, 2026 9:01pm
ui-library Skipped Skipped May 4, 2026 9:01pm
zone-www-dot-com Skipped Skipped May 4, 2026 9:01pm

Request Review

@supabase
Copy link
Copy Markdown

supabase Bot commented May 4, 2026

This pull request has been ignored for the connected project xguihxuzqibwxjnimxev because there are no changes detected in supabase directory. You can change this behaviour in Project Integrations Settings ↗︎.


Preview Branches by Supabase.
Learn more about Supabase Branching ↗︎.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0178e3ca-4e86-408e-bac0-3ea7c41d12ff

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mattrossman/ai-658-server-side-approval

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…I-658)

- Add isApprovalContinuation/getLastUserText helpers to message-utils.ts
- Merge duplicate JSX branches in MessagePartExecuteSql
- Fix existingFunction race in EdgeFunctionRenderer (restore eager fetch)
- Remove parent span threading so Turn 2 is its own root trace and
  online scorers see input/output on the root span
- Recover Turn 1 text parts from rawMessages so the full response
  (pre- and post-approval) appears in the output
- Use shared isApprovalContinuation/getLastUserText helpers
- Replace part.type.replace('tool-', '') with getToolName()
mattrossman added a commit that referenced this pull request May 12, 2026
…h needsApproval (#45654)

## Motivation

When Assistant runs a potentially destructive tool like `execute_sql`,
it stops the LLM request and prompts for client-side approval and
execution of the tool. After approval, a second request kicks off under
a separate trace. This has made scoring and
[Topics](https://www.braintrust.dev/blog/topics) classification
challenging, as the generated `output` is split across stateless
requests. The [span-level
scoring](https://www.braintrust.dev/docs/evaluate/custom-code#score-spans)
approach we've used thusfar (after the LLM call, we massage the result
into an `output` payload that's stuck onto the root span) has been
cumbersome and led to invalid scores / topics where only part of the
assistant response is considered. It's also inefficient, as we're
duplicating potentially large info (like the `search_docs` output) that
already exists within the trace.

An alternative to scoring spans is to [score
traces](https://www.braintrust.dev/docs/evaluate/custom-code#score-traces).
Braintrust [best
practices](https://www.braintrust.dev/docs/evaluate/score-online#best-practices)
advise:

> Use span scope for evaluating individual operations or outputs. Use
trace scope for evaluating multi-turn conversations, overall workflow
completion, or when your scorer needs access to the full execution
context.

We've also received [direct
guidance](https://supabase.slack.com/archives/C05QYJBLX89/p1777925770927149?thread_ts=1777905716.911979&cid=C05QYJBLX89)
from their team to use this approach.

## Changes

Migrates eval scorers from custom `AssistantEvalOutput` shape to
trace-level scoring via `trace.getThread()` / `trace.getSpans()`, with
thread parsing that scores the full latest Assistant turn and passes
prior conversation separately where relevant.

Moves `execute_sql` and `deploy_edge_function` from client-side
execution after approval to AI SDK `needsApproval` + server-side
`execute()`. SQL results returned to the model are gated by AI opt-in
level, so row data is only included with `schema_and_log_and_data`;
otherwise the tool returns the no-data-permissions sentinel.

Adds `metadata.isFinalStep` to disambiguate multiple LLM requests within
an "assistant" turn due to tool call requests/responses. For online
evals, this means we should configure automations to only score traces
with `metadata.isFinalStep = true` to ensure we're judging the complete
generated response.

Other minor kaizen changes:
- Renamed `promptProviderOptions` to `systemProviderOptions` to clarify
that this is associated with the "system" message and disambiguate from
the root `providerOptions`
- Adds `evals/trace-utils.ts` to handle Zod validation of the `unknown`
span shapes from Braintrust, to more easily access typed inputs/output
on tool spans.
- Bumps AI SDK floor version `^6.0.116` → `^6.0.174`
- Tweaked the "Conciseness" scorer to not unfairly dock points for the
new `[called tool_name]` labels in serialized assistant response

## Verification

In the studio staging build, I asked Assistant to create a todos table
with 3 sample todos. I manually approved the `execute_sql` call and saw
Assistant generate text before & after the call.

In Braintrust I verified two traces were produced (see [filtered
logs](https://www.braintrust.dev/app/supabase.io/p/Assistant/logs?v=Staging&tvt=trace&search={%22filter%22:[{%22text%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22label%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22originType%22:%22btql%22},{%22text%22:%22%2560Chat%2520ID%2560%2520%253D%2520%25221cb2ac45-e5e7-458c-9da4-3bf6863b8842%2522%22,%22label%22:%22Chat%2520ID%2520equals%25201cb2ac45-e5e7-458c-9da4-3bf6863b8842%22,%22originType%22:%22form%22}]})),
the first with `metadata.isFinalStep = false` and the second with
`metadata.isFinalStep = true`.

In the Braintrust staging scorers, I ran the preview Completeness scorer
on the second trace and verified it sees the complete Assistant response
including markers for tool calls ([link to
trace](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)/trace?object_type=project_logs&object_id=b5214b62-ad1e-4929-9d5b-40b1daebe948&r=0ed0a4f8-8aff-4a34-bb1d-1df1d88a5070&s=ff9015f8-6bf7-4ab3-83a9-ca4e69e27e82))

<img width="1193" height="960" alt="CleanShot 2026-05-07 at 11 27 10@2x"
src="https://github.com/user-attachments/assets/509d4858-c3a1-4068-986d-3aa4d5617d1a"
/>

I also tested the `deploy_edge_function` workflow and verified it still
prompts for permission and warns on deployment of existing functions.

**References**
- https://www.braintrust.dev/docs/evaluate/custom-code#score-traces
-
https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling#tool-execution-approval

Supercedes #45556 and
#45339

Closes AI-473

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Tool actions (SQL execution, edge-function deploy) now require
explicit user Approve/Deny before proceeding.

* **Improvements**
* Assistant pauses for approval responses before sending follow-ups,
giving clearer control over risky actions.
  * Deploy/replace flows show confirmation and clearer replace warnings.
* Evaluation/scoring updated to use richer trace data for more accurate
assistant performance signals.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant