Skip to content

Authentication and Authorization Design Proposal#241

Merged
volcano-sh-bot merged 13 commits intovolcano-sh:mainfrom
mahil-2040:proposal/auth-design
Apr 2, 2026
Merged

Authentication and Authorization Design Proposal#241
volcano-sh-bot merged 13 commits intovolcano-sh:mainfrom
mahil-2040:proposal/auth-design

Conversation

@mahil-2040
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This PR introduces a design proposal for implementing authentication and authorization in Agentcube.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Copilot AI review requested due to automatic review settings March 11, 2026 17:15
@volcano-sh-bot volcano-sh-bot added the kind/documentation Improvements or additions to documentation label Mar 11, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive design proposal to overhaul AgentCube's authentication and authorization mechanisms. It aims to establish a robust, unified security model by implementing zero-trust mTLS for internal component communication using SPIFFE/SPIRE, and integrating Keycloak for external user authentication and role-based access control. This strategic shift addresses existing security vulnerabilities, enhances system integrity, and provides a scalable, industry-standard foundation for managing identities and permissions across the platform.

Highlights

  • Internal mTLS Communication: Established zero-trust, mutually authenticated communication between all AgentCube internal components (Router, WorkloadManager, PicoD) using X.509 mTLS via SPIFFE/SPIRE.
  • External User Authentication: Provided external client/SDK authentication at the Router level using Keycloak (OIDC/OAuth2) for identity verification.
  • Role-Based Access Control (RBAC): Implemented role-based access control for external users leveraging Keycloak's built-in authorization capabilities.
  • Deprecation of Legacy Auth: Superseded the existing PicoD-Plain-Authentication key distribution mechanism with automated certificate lifecycle management provided by SPIRE, with a legacy flag for transition.
  • Opt-in Features: Ensured all new authentication and authorization features are opt-in behind configuration flags to avoid affecting existing deployments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/design/auth-proposal.md
    • Detailed the motivation and goals for a unified security model.
    • Outlined a four-layered design covering internal workload authentication (SPIRE), external user authentication (Keycloak), authorization (Keycloak RBAC), and cloud provider federation (stretch goal).
    • Provided in-depth specifications for SPIRE deployment, workload registration, and mTLS integration within AgentCube components.
    • Described Keycloak deployment, JWT validation, SDK changes, and role hierarchy for authorization.
    • Included architecture diagrams and code snippets to illustrate the proposed solutions.
    • Discussed the impact on existing authentication mechanisms and future enhancements like OPA integration.
Activity
  • No significant activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting /gemini review.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 11, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.35%. Comparing base (845b798) to head (8de1f54).
⚠️ Report is 133 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #241      +/-   ##
==========================================
+ Coverage   35.60%   43.35%   +7.74%     
==========================================
  Files          29       30       +1     
  Lines        2533     2611      +78     
==========================================
+ Hits          902     1132     +230     
+ Misses       1505     1358     -147     
+ Partials      126      121       -5     
Flag Coverage Δ
unittests 43.35% <ø> (+7.74%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design document proposing a unified authentication + authorization model for AgentCube, covering internal component identity and external user access at the Router boundary.

Changes:

  • Introduces an SPIFFE/SPIRE mTLS proposal for Router/WorkloadManager/PicoD internal traffic.
  • Proposes Keycloak-based external authentication (OIDC) and role-based authorization (RBAC) at the Router.
  • Describes migration away from the current PicoD plain-auth/JWT approach toward workload identity.

You can also share your feedback on Copilot code review. Take the survey.

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting the main design consistency issues I found in the auth proposal.

…rvices and update PicoD SPIRE selector without namespace constraint

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Copilot AI review requested due to automatic review settings March 17, 2026 05:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted a few design-consistency comments on the current PR head. The main issues are the missing explanation of how end-user identity/audit context survives the Router -> WorkloadManager hop once that channel becomes pure workload mTLS, and the security implications of using a label-only SPIRE selector for PicoD. I also left two smaller documentation comments around OAuth token refresh behavior and the OPA comparison wording.

```
SDK → Router: HTTPS + Keycloak JWT (see Section 2)
Router → WorkloadManager: mTLS (SPIRE SVIDs)
Router → PicoD: mTLS (SPIRE SVIDs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From above section, does all the picod share same spiffe id?

Previsouly we donot use tls between router and picod, one consideration is latency.
The other concern here is we also allow user build their customized code interperet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding custom code interpreters, could you clarify the concern here?

Is the concern about the added complexity for users building custom interpreters? because with mTLS enforced on Router→PicoD, custom interpreters would need to handle TLS certificates.

And can you suggest a approach for supporting custom interpreters alongside mTLS?

- Clarify SDK token lifecycle per grant type (client_credentials vs TokenAuth)
- Fix overstated OPA local-evaluation advantage over Keycloak RBAC
- Harden PicoD SPIRE selector with dedicated ServiceAccount
- Support file-based certs alongside SPIRE for mTLS

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
- Update SPIFFE IDs to use standard Istio-convention format (spiffe://trust-domain/ns/namespace/sa/service-account)
- Add explanation of SPIRE selectors for workload attestation
- Document SPIRE Controller Manager (ClusterSPIFFEID CRDs) as the recommended production approach for workload registration
- Replace go-spiffe dependency with standard Go crypto/tls, unifying mTLS configuration across SPIRE (via spiffe-helper) and file-based modes

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
- Update sequence diagram to show client_credentials grant details (client_id + client_secret) and Keycloak-side validation step
- Add JWKS cache fetch step showing where Router gets signing keys
- Include identity propagation headers on Router->WorkloadManager hop

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Copilot AI review requested due to automatic review settings March 25, 2026 13:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

…kloagmanager

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
… the statfulset of SPIRE server instead of separate deployment

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
@mahil-2040
Copy link
Copy Markdown
Contributor Author

mahil-2040 commented Mar 30, 2026

@hzxuzhonghu PTAL, i have made the changes according to your feedback.


1. **Workload Manager Auth** (`pkg/workloadmanager/auth.go`): Optional Kubernetes TokenReview-based ServiceAccount token validation, gated behind `config.EnableAuth`, plus per-sandbox ownership checks using the extracted user identity (effectively relying on Kubernetes RBAC when using the user-scoped client).
2. **Router → PicoD Auth** (`PicoD-Plain-Authentication-Design`): A custom RSA key-pair scheme where the Router signs JWTs and PicoD verifies them using a public key exposed via the `PICOD_AUTH_PUBLIC_KEY` environment variable. The key pair (`private.pem`, `public.pem`) is stored in the `picod-router-identity` Secret, and the WorkloadManager reads this Secret to inject the public key into PicoD pods. This works for the Router→PicoD channel but leaves other internal channels unauthenticated.
3. **Router → WorkloadManager**: Optional, one-sided authentication. `pkg/router/session_manager.go` can attach a `Authorization: Bearer <serviceaccount token>` header, and WorkloadManager can validate it when `--enable-auth` is enabled. This is not mutual workload identity or a zero-trust model, and when auth is disabled any pod on the cluster network can call the WorkloadManager API.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider also bring the user's identity in the token

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of having the Router extract the user identity and pass it blindly via X-AgentCube-User-ID headers to the WorkloadManager, the Router should perform an OAuth 2.0 Token Exchange (RFC 8693) with Keycloak.

The Router would exchange the incoming user token for a new downstream token (where the subject is the user, and the actor claim is the Router) and pass that nested token to the WorkloadManager.

is my understanding correct?

If yes, then Should I update the proposal to replace the custom HTTP header approach with this Token Exchange flow as the core requirement, or should we keep the header approach for now and mark Token Exchange as a Stretch goal in the proposal?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, i think both X-AgentCube-User-ID and token exchange make sense

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal not cover router -> agent runtime authentication. how could we delegate user's identity to the agent?

These gaps amount to three distinct problems:

- **Internal (machine-to-machine):** Components trust each other implicitly based on network reachability. A compromised or rogue pod on the same network can impersonate any component.
- **External (user-to-platform):** Anyone who can reach the Router endpoint can invoke sandboxes, with no identity verification and no audit trail.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can limit only the creator or the group can access a sandbox

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the proposal to make the tenant isolation explicit using Resource-Level Access Control (RLAC):

Creation: When a sandbox is created, the WorkloadManager tags all related resources (CRDs/Pods) with immutable labels like agentcube.io/owner: user-123, based on JWT claims (sub, groups).
Invocation: On each request, the Router checks the caller’s JWT against the sandbox’s labels. If they don’t match (not owner or allowed group), it returns 403 Forbidden.

This ensures strict tenant isolation , no user can access another user’s sandbox, even within the same namespace.


---

## 1. Internal Workload Authentication (X.509 mTLS)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through the whole section 1, i find it is only talking about x509 certs with spire. Seems not align with the desc below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new section (1.6 File-Based Provisioning with cert-manager) that shows how to achieve the same zero-trust mTLS setup with File-based provisioning. It includes a sample Certificate CRD (mapping uris to SPIFFE IDs) and how the generated Secret plugs into our --mtls-*-file args.

metadata:
name: agentcube-sandbox
spec:
spiffeIDTemplate: "spiffe://agentcube.local/sa/{{ .PodSpec.ServiceAccountName }}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so by ignoring namespace here, does it mean all picod across namespaces would receive svid?

And if yes, are they same?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to both questions.

  1. Because sandboxes are provisioned dynamically into arbitrary, user-specified namespaces, we must omit the namespaceSelector. This allows the SPIRE Agent to attest PicoD pods globally across the cluster, provided they possess the required ServiceAccount and label.
  2. Yes, every sandbox receives the exact same SPIFFE ID (spiffe://cluster.local/sa/agentcube-sandbox).

The shared SPIFFE ID purely provides machine-to-machine authentication (proving to the Router that the target is a legitimate, WorkloadManager provisioned PicoD pod).

The actual tenant isolation and authorization (ensuring User A cannot access User B's sandbox, even if they share the same machine identity) is strictly enforced at the Application Layer via the Resource-Level Access Control (RLAC) JWT claim checks detailed in Section 3.6. This prevents the SPIRE Server from being overwhelmed by creating/destroying unique cryptographic identities for every short-lived sandbox instance!


> **Note:** Manual CLI entries are stored in the SPIRE Server's datastore (SQLite by default). They survive restarts if the datastore is backed by persistent storage, but are harder to manage and audit compared to the declarative CRD approach.

#### Attestation Flow (Example: Router)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned the delay here, in agencube its low latency on sandbox provision should be highly prioritized. Especially for agentic rl scenario.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a critical issue, maybe we need also jwt based SVID

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into this carefully and The key insight is that JWT-SVIDs in SPIRE still go through the exact same Workload Attestation flow (Agent → kubelet query → selector matching), so the attestation latency is identical regardless of SVID format.

For PicoD sandboxes specifically, the spiffe-helper sidecar fetches the SVID concurrently with PicoD's own container initialization, so attestation (~200-500ms) overlaps with pod startup time rather than adding to it sequentially.

However, for high-throughput scenarios like agentic RL training loops where even this concurrent overhead is unacceptable, our architecture already has a built-in zero-latency alternative: the file-based certificate mode (Section 1.6). In this mode, certificates are pre-provisioned as Kubernetes Secrets and mounted as volumes — they are available the instant the container starts with absolutely no attestation step needed. This makes file-based mode the recommended choice for latency-critical sandbox provisioning.

I've added a dedicated "Latency Considerations for Sandbox Provisioning" note to the proposal to document this tradeoff explicitly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, the signing is similar flow as x509, the delay sometimes is un acceptable. In code interpreter, we usually need 100ms bootstrap latency. And in reality, tls handshake takes some time maybe tens of milliseconds.

With jwt, the handshake latency can be eliminated. if we have warmpool, i guess the latency is totally from runtime, not from auth system.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying. You're absolutely right that the real bottleneck for latency-critical scenarios like Code Interpreters isn't attestation but the TLS handshake overhead itself (~20-50ms per new connection). With JWT, the Router just sends a Bearer token over plain HTTP and PicoD does a ~1ms signature check - the handshake cost is completely eliminated.

Based on your feedback, I've updated the proposal to support two configurable Router→PicoD auth modes via a --picod-auth-mode flag:

  1. mtls (default): Full mutual TLS for security-hardened production deployments.
  2. jwt: Application-layer JWT authentication for latency-critical scenarios (Code Interpreters, agentic RL). Combined with warm pools, this ensures the auth system adds near-zero overhead.

Router↔WorkloadManager always uses mTLS since both are long-lived and the handshake cost is amortized. The existing PicoD-Plain-Auth code path is retained and improved as the JWT mode implementation rather than being deprecated.

Comment on lines +368 to +372
The cert/key/CA files can be provisioned by any mechanism:
- **SPIRE:** The [SPIFFE Helper](https://github.com/spiffe/spiffe-helper) sidecar fetches SVIDs from the Workload API and writes them to disk as PEM files. Certificates are rotated automatically.
- **cert-manager:** Automatically issues and rotates certificates, stored as Kubernetes Secrets and mounted into pods.
- **Self-signed CA:** Generated manually or via a script for development and testing environments.
- **Let's Encrypt / corporate PKI:** Externally issued certificates placed on disk or in Secrets.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you implement this way?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not implementing any integration code for these four mechanisms inside the AgentCube components.

The core of this design is that the AgentCube Go binaries will only read standard certificate files from the local disk using the new --mtls-cert-file, --mtls-key-file, and --mtls-ca-file flags.

Because AgentCube only cares about reading a file, we automatically support all four of those mechanisms out-of-the-box, as they all ultimately just write PEM files to a Kubernetes volume mount. Our implementation will strictly focus on adding the file-reading logic to the Go codebase, and providing example Kubernetes YAML manifests (like the spiffe-helper sidecar) to show users how to mount those files seamlessly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i was asking whether you would provide such integration guide


### 2.3 Keycloak Deployment

Keycloak is deployed as a `Deployment` in `agentcube-system`. A dedicated realm called `agentcube` is created during installation containing:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dedicated realm calledagentcube``, I would suggest letting users register to support multi tenants

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing in the proposal made it sound like the architecture was hardcoded to a single realm name - that was definitely my mistake!

But the Go implementation actually accepts the Keycloak Issuer URL and Realm dynamically through CLI configuration flags (e.g., --keycloak-realm). The agentcube realm is purely just the default name created by the "getting started" Helm chart. Cluster administrators can easily alter the deployment flags to point the Router at whatever custom or multiple distinct realms they need to achieve full multi-tenant isolation.

I have updated the proposal text in Section 2.3 to explicitly state this capability so there's no confusion!

@mahil-2040
Copy link
Copy Markdown
Contributor Author

This proposal not cover router -> agent runtime authentication. how could we delegate user's identity to the agent?
You're exactly right that relying solely on mTLS for the Router $\rightarrow$ Agent Runtime proxy connection drops the user context for the executing agent.

To solve this, we can use the exact same OAuth 2.0 Token Exchange (RFC 8693) pattern that I have mentioned in for the workloadmanager.

…support

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Copilot AI review requested due to automatic review settings March 31, 2026 16:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment on lines +307 to +311

// Load CA bundle for peer verification
caCert, err := os.ReadFile(caFile)
caPool := x509.NewCertPool()
caPool.AppendCertsFromPEM(caCert)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mTLS code snippet overwrites err and never checks the errors from tls.LoadX509KeyPair / os.ReadFile, and it ignores the boolean return from AppendCertsFromPEM. Since this is intended to be copy/paste-able guidance, it should show proper error handling and a failure case when the CA bundle can’t be parsed.

Suggested change
// Load CA bundle for peer verification
caCert, err := os.ReadFile(caFile)
caPool := x509.NewCertPool()
caPool.AppendCertsFromPEM(caCert)
if err != nil {
log.Fatalf("failed to load mTLS certificate/key pair (%s, %s): %v", certFile, keyFile, err)
}
// Load CA bundle for peer verification
caCert, err := os.ReadFile(caFile)
if err != nil {
log.Fatalf("failed to read mTLS CA bundle %s: %v", caFile, err)
}
caPool := x509.NewCertPool()
if ok := caPool.AppendCertsFromPEM(caCert); !ok {
log.Fatalf("failed to parse any CA certificates from %s", caFile)
}

Copilot uses AI. Check for mistakes.
Comment on lines +354 to +360
peerCert, err := x509.ParseCertificate(rawCerts[0])
if err != nil { return err }

// 1. Manually verify against our CA pool
opts := x509.VerifyOptions{ Roots: caPool }
if _, err := peerCert.Verify(opts); err != nil { return err }

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the client-side VerifyPeerCertificate example, peerCert.Verify(x509.VerifyOptions{Roots: caPool}) verifies only the leaf cert and doesn’t populate Intermediates from rawCerts[1:]. This will fail for real-world chains that include intermediates, and it also doesn’t constrain key usage (serverAuth). Consider building an intermediates pool from the remaining certs and setting appropriate KeyUsages in VerifyOptions.

Suggested change
peerCert, err := x509.ParseCertificate(rawCerts[0])
if err != nil { return err }
// 1. Manually verify against our CA pool
opts := x509.VerifyOptions{ Roots: caPool }
if _, err := peerCert.Verify(opts); err != nil { return err }
if len(rawCerts) == 0 {
return fmt.Errorf("no server certificates provided")
}
peerCert, err := x509.ParseCertificate(rawCerts[0])
if err != nil {
return err
}
// Build intermediates pool from remaining certificates in the chain, if any.
intermediates := x509.NewCertPool()
for _, certDER := range rawCerts[1:] {
cert, err := x509.ParseCertificate(certDER)
if err != nil {
return err
}
intermediates.AddCert(cert)
}
// 1. Manually verify against our CA pool with intermediates and serverAuth key usage.
opts := x509.VerifyOptions{
Roots: caPool,
Intermediates: intermediates,
KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
}
if _, err := peerCert.Verify(opts); err != nil {
return err
}

Copilot uses AI. Check for mistakes.

Keycloak realm roles are organized in a simple hierarchy:

| Role | Permissions | Inherits |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does Inherits means? Like sandbox:manage include permissions of sandbox:invoke plus its own permissions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly! In Keycloak this uses "Composite Roles. Assigning a user the higher-level sandbox:manage role automatically grants them the sandbox:invoke permissions as well, preventing us from having to assign both roles to administrators.

| Role | Permissions | Inherits |
|---|---|---|
| `sandbox:invoke` | Invoke agent runtimes, invoke code interpreters, list sessions | - |
| `sandbox:manage` | Create/delete AgentRuntime and CodeInterpreter CRDs | `sandbox:invoke` |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we create AgentRuntime and CodeInterpreter by running kubectl manually, not from SDK.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. currently AgentRuntime and CodeInterpreter CRDs are created via kubectl or the AgentCube CLI (agentcube publish), both of which go directly to the Kubernetes API and are governed by Kubernetes RBAC, not Keycloak.

The sandbox:manage role is defined as a forward-looking placeholder for when CRD lifecycle operations are eventually exposed through the Router. For now, only sandbox:invoke is actively enforced by the Keycloak middleware. I've added a clarifying note to the proposal to make this distinction explicit.


With the SPIRE Controller Manager deployed, registration entries are defined as Kubernetes CRDs. These are managed declaratively alongside other AgentCube manifests and are automatically synced to the SPIRE Server:

```yaml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume this is all installed beforehand, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite beforehand. In the spire-controller-manager we can bundle these ClusterSPIFFEID manifests directly into the main AgentCube Helm chart/release bundle.
When the admin runs helm install agentcube, the WorkloadManager, the Router, and these SPIFFE ID registrations are all applied to the cluster at the same time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, So please add a install option

- Added End-to-End Identity Workflow clarifying Agent vs Controller
- Made Keycloak realm configuration explicitly dynamic to support multi-tenancy
- Updated Go mTLS snippets to demonstrate dynamic zero-downtime certificate
- Added Latency Considerations note detailing file-based mode as the zero-delay solution for
RL scenarios

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…rhead

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Copilot AI review requested due to automatic review settings April 2, 2026 05:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

metadata:
name: agentcube-sandbox
spec:
spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}"
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClusterSPIFFEID example for PicoD only matches on app: picod, but earlier in this document you state PicoD identity should also be constrained by a dedicated ServiceAccount (and the manual CLI example includes k8s:sa:agentcube-sandbox). As written, this example would allow any pod that can set app: picod to obtain the sandbox SPIFFE ID; please add the ServiceAccount constraint (and any other required selectors) or note that this snippet is incomplete.

Suggested change
spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}"
spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}"
# NOTE: This example is intentionally minimal for illustration. In production,
# you MUST also constrain this identity by a dedicated ServiceAccount
# (for example, `agentcube-sandbox`) and any other required selectors so that
# only trusted PicoD pods can obtain the sandbox SPIFFE ID.

Copilot uses AI. Check for mistakes.
Comment on lines +376 to +382
peerCert, err := x509.ParseCertificate(rawCerts[0])
if err != nil { return err }

// 1. Manually verify against our CA pool
opts := x509.VerifyOptions{ Roots: caPool }
if _, err := peerCert.Verify(opts); err != nil { return err }

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the client-side TLS verification example, peerCert.Verify is called with only Roots: caPool. If the server cert chain includes intermediates (common with many PKIs/cert-manager issuers), verification will fail unless the intermediates from rawCerts[1:] are added to VerifyOptions.Intermediates. Also, when using InsecureSkipVerify: true, it’s important that the custom callback fully replicates normal TLS verification semantics.

Suggested change
peerCert, err := x509.ParseCertificate(rawCerts[0])
if err != nil { return err }
// 1. Manually verify against our CA pool
opts := x509.VerifyOptions{ Roots: caPool }
if _, err := peerCert.Verify(opts); err != nil { return err }
if len(rawCerts) == 0 {
return fmt.Errorf("no server certificates provided")
}
// Parse the leaf (peer) certificate
peerCert, err := x509.ParseCertificate(rawCerts[0])
if err != nil {
return err
}
// Parse and add any intermediate certificates presented by the server
intermediates := x509.NewCertPool()
for _, raw := range rawCerts[1:] {
cert, err := x509.ParseCertificate(raw)
if err != nil {
return err
}
intermediates.AddCert(cert)
}
// 1. Manually verify the certificate chain against our CA pool
opts := x509.VerifyOptions{
Roots: caPool,
Intermediates: intermediates,
}
if _, err := peerCert.Verify(opts); err != nil {
return err
}

Copilot uses AI. Check for mistakes.
Comment on lines +566 to +586
type KeycloakConfig struct {
IssuerURL string // e.g., "https://keycloak.agentcube-system.svc:8443"
Realm string // Default: "agentcube"
Audience string // expected audience claim
JWKSCacheTTL time.Duration // how often to refresh JWKS keys
}

func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) {
jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs",
cfg.IssuerURL, cfg.Realm)

// JWKS keys are cached locally - no per-request calls to Keycloak
keySet := jwk.NewCache(context.Background())
keySet.Register(jwksURL, jwk.WithRefreshInterval(cfg.JWKSCacheTTL))

return &KeycloakValidator{
issuer: fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm),
audience: cfg.Audience,
keySet: keySet,
}, nil
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JWKS caching snippet appears to use a jwk.NewCache API, but this repo doesn’t currently depend on a JWKS/JWK cache library and the snippet doesn’t show how the cached keys are actually used for JWT verification (and it also ignores the error return from Register if present). Please either mark this as pseudocode and name the intended library, or update to a concrete, correct example (e.g., using an OIDC/JWKS client that returns a keyfunc for github.com/golang-jwt/jwt/v5).

Suggested change
type KeycloakConfig struct {
IssuerURL string // e.g., "https://keycloak.agentcube-system.svc:8443"
Realm string // Default: "agentcube"
Audience string // expected audience claim
JWKSCacheTTL time.Duration // how often to refresh JWKS keys
}
func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) {
jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs",
cfg.IssuerURL, cfg.Realm)
// JWKS keys are cached locally - no per-request calls to Keycloak
keySet := jwk.NewCache(context.Background())
keySet.Register(jwksURL, jwk.WithRefreshInterval(cfg.JWKSCacheTTL))
return &KeycloakValidator{
issuer: fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm),
audience: cfg.Audience,
keySet: keySet,
}, nil
}
package auth
import (
"context"
"fmt"
"time"
"github.com/MicahParks/keyfunc"
"github.com/golang-jwt/jwt/v5"
)
// KeycloakConfig configures JWT validation against a Keycloak realm.
type KeycloakConfig struct {
IssuerURL string // e.g., "https://keycloak.agentcube-system.svc:8443"
Realm string // default: "agentcube"
Audience string // expected audience claim
JWKSCacheTTL time.Duration // how often to refresh JWKS keys
}
// KeycloakValidator validates JWTs using a cached JWKS from Keycloak.
type KeycloakValidator struct {
issuer string
audience string
keyFunc jwt.Keyfunc
}
func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) {
jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs",
cfg.IssuerURL, cfg.Realm)
// JWKS keys are cached locally by the keyfunc library - no per-request calls to Keycloak.
options := keyfunc.Options{
RefreshInterval: cfg.JWKSCacheTTL,
RefreshUnknownKID: true,
// In production code, wire this to your logger instead of ignoring the error.
RefreshErrorHandler: func(err error) {
// log refresh error
_ = err
},
}
jwks, err := keyfunc.Get(jwksURL, options)
if err != nil {
return nil, fmt.Errorf("create JWKS from %q: %w", jwksURL, err)
}
return &KeycloakValidator{
issuer: fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm),
audience: cfg.Audience,
keyFunc: jwks.Keyfunc,
}, nil
}
// ValidateToken verifies the signature and standard claims of a Keycloak-issued JWT.
func (v *KeycloakValidator) ValidateToken(ctx context.Context, rawToken string) (*jwt.RegisteredClaims, error) {
claims := &jwt.RegisteredClaims{}
token, err := jwt.ParseWithClaims(rawToken, claims, v.keyFunc,
jwt.WithIssuer(v.issuer),
jwt.WithAudience(v.audience),
)
if err != nil {
return nil, fmt.Errorf("parse token: %w", err)
}
if !token.Valid {
return nil, fmt.Errorf("invalid token")
}
return claims, nil
}

Copilot uses AI. Check for mistakes.

alt Valid token + authorized role
Note over Router, KC: Token Exchange (RFC 8693)
Router->>KC: POST /token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sequence diagram, the token exchange call is shown as POST /token, but elsewhere you use Keycloak’s token endpoint under /realms/<realm>/protocol/openid-connect/token. For Keycloak, token exchange uses the same token endpoint, so the diagram should use the full OIDC token URL to avoid misimplementation.

Suggested change
Router->>KC: POST /token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod
Router->>KC: POST /realms/<realm>/protocol/openid-connect/token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod

Copilot uses AI. Check for mistakes.

To enforce this resource-level tenant isolation:

1. **Creation Tagging:** When WorkloadManager creates an `AgentRuntime` or `CodeInterpreter`, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the generated Kubernetes CRDs and Pods (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`).
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This states “When WorkloadManager creates an AgentRuntime or CodeInterpreter…”, but in the current architecture WorkloadManager creates Sandbox/SandboxClaim resources based on existing AgentRuntime/CodeInterpreter CRDs (it doesn’t create those CRDs). Suggest rewording to reflect labeling the sandbox (and/or its PodTemplate) created for a given AgentRuntime/CodeInterpreter invocation.

Suggested change
1. **Creation Tagging:** When WorkloadManager creates an `AgentRuntime` or `CodeInterpreter`, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the generated Kubernetes CRDs and Pods (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`).
1. **Creation Tagging:** When WorkloadManager provisions a sandbox (e.g., `Sandbox` / `SandboxClaim`) for an invocation of an existing `AgentRuntime` or `CodeInterpreter` CRD, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the sandbox resources and their underlying Pods/PodTemplates (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`).

Copilot uses AI. Check for mistakes.
| **Mechanism** | Transport-layer mutual TLS | Application-layer `Authorization: Bearer` header |
| **First-connection overhead** | ~20-50ms (TLS handshake) | ~1ms (JWT signature verification) |
| **Certificate provisioning** | SPIRE SVID or file-based (cert-manager) | Router signs JWT; PicoD validates with pre-shared public key |
| **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Automatic (Router key rotation) |
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table claims JWT mode has “Automatic (Router key rotation)”, but the current Router JWT implementation stores a long-lived RSA keypair in the picod-router-identity Secret and doesn’t rotate it automatically. If the proposal includes adding rotation, it would help to call that out explicitly (how/when rotation happens and how PicoD gets the new public key) rather than implying it already exists.

Suggested change
| **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Automatic (Router key rotation) |
| **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Manual (long-lived Router keypair in `picod-router-identity` Secret; no automatic rotation yet) |

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally LG, @YaoZengzeng to have a another look

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@YaoZengzeng
Copy link
Copy Markdown
Member

/lgtm

@volcano-sh-bot volcano-sh-bot merged commit 4507820 into volcano-sh:main Apr 2, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved kind/documentation Improvements or additions to documentation lgtm size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants