Authentication and Authorization Design Proposal by mahil-2040 · Pull Request #241 · volcano-sh/agentcube

mahil-2040 · 2026-03-11T17:14:59Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This PR introduces a design proposal for implementing authentication and authorization in Agentcube.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

gemini-code-assist · 2026-03-11T17:15:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive design proposal to overhaul AgentCube's authentication and authorization mechanisms. It aims to establish a robust, unified security model by implementing zero-trust mTLS for internal component communication using SPIFFE/SPIRE, and integrating Keycloak for external user authentication and role-based access control. This strategic shift addresses existing security vulnerabilities, enhances system integrity, and provides a scalable, industry-standard foundation for managing identities and permissions across the platform.

Highlights

Internal mTLS Communication: Established zero-trust, mutually authenticated communication between all AgentCube internal components (Router, WorkloadManager, PicoD) using X.509 mTLS via SPIFFE/SPIRE.
External User Authentication: Provided external client/SDK authentication at the Router level using Keycloak (OIDC/OAuth2) for identity verification.
Role-Based Access Control (RBAC): Implemented role-based access control for external users leveraging Keycloak's built-in authorization capabilities.
Deprecation of Legacy Auth: Superseded the existing PicoD-Plain-Authentication key distribution mechanism with automated certificate lifecycle management provided by SPIRE, with a legacy flag for transition.
Opt-in Features: Ensured all new authentication and authorization features are opt-in behind configuration flags to avoid affecting existing deployments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/design/auth-proposal.md
- Detailed the motivation and goals for a unified security model.
- Outlined a four-layered design covering internal workload authentication (SPIRE), external user authentication (Keycloak), authorization (Keycloak RBAC), and cloud provider federation (stretch goal).
- Provided in-depth specifications for SPIRE deployment, workload registration, and mTLS integration within AgentCube components.
- Described Keycloak deployment, JWT validation, SDK changes, and role hierarchy for authorization.
- Included architecture diagrams and code snippets to illustrate the proposed solutions.
- Discussed the impact on existing authentication mechanisms and future enhancements like OPA integration.

Activity

No significant activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

gemini-code-assist · 2026-03-11T17:19:25Z

Warning

Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting /gemini review.

codecov-commenter · 2026-03-11T17:23:44Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.35%. Comparing base (845b798) to head (8de1f54).
⚠️ Report is 133 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #241      +/-   ##
==========================================
+ Coverage   35.60%   43.35%   +7.74%     
==========================================
  Files          29       30       +1     
  Lines        2533     2611      +78     
==========================================
+ Hits          902     1132     +230     
+ Misses       1505     1358     -147     
+ Partials      126      121       -5

Flag	Coverage Δ
unittests	`43.35% <ø> (+7.74%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds a new design document proposing a unified authentication + authorization model for AgentCube, covering internal component identity and external user access at the Router boundary.

Changes:

Introduces an SPIFFE/SPIRE mTLS proposal for Router/WorkloadManager/PicoD internal traffic.
Proposes Keycloak-based external authentication (OIDC) and role-based authorization (RBAC) at the Router.
Describes migration away from the current PicoD plain-auth/JWT approach toward workload identity.

You can also share your feedback on Copilot code review. Take the survey.

docs/design/auth-proposal.md

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

hzxuzhonghu

Posting the main design consistency issues I found in the auth proposal.

docs/design/auth-proposal.md

…rvices and update PicoD SPIRE selector without namespace constraint Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

You can also share your feedback on Copilot code review. Take the survey.

docs/design/auth-proposal.md

hzxuzhonghu

Posted a few design-consistency comments on the current PR head. The main issues are the missing explanation of how end-user identity/audit context survives the Router -> WorkloadManager hop once that channel becomes pure workload mTLS, and the security implications of using a label-only SPIRE selector for PicoD. I also left two smaller documentation comments around OAuth token refresh behavior and the OPA comparison wording.

docs/design/auth-proposal.md

hzxuzhonghu · 2026-03-18T07:41:11Z

docs/design/auth-proposal.md

+```
+SDK → Router:              HTTPS + Keycloak JWT (see Section 2)
+Router → WorkloadManager:  mTLS (SPIRE SVIDs)
+Router → PicoD:            mTLS (SPIRE SVIDs)


From above section, does all the picod share same spiffe id?

Previsouly we donot use tls between router and picod, one consideration is latency.
The other concern here is we also allow user build their customized code interperet.

Regarding custom code interpreters, could you clarify the concern here?

Is the concern about the added complexity for users building custom interpreters? because with mTLS enforced on Router→PicoD, custom interpreters would need to handle TLS certificates.

And can you suggest a approach for supporting custom interpreters alongside mTLS?

docs/design/auth-proposal.md

- Clarify SDK token lifecycle per grant type (client_credentials vs TokenAuth) - Fix overstated OPA local-evaluation advantage over Keycloak RBAC - Harden PicoD SPIRE selector with dedicated ServiceAccount - Support file-based certs alongside SPIRE for mTLS Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

- Update SPIFFE IDs to use standard Istio-convention format (spiffe://trust-domain/ns/namespace/sa/service-account) - Add explanation of SPIRE selectors for workload attestation - Document SPIRE Controller Manager (ClusterSPIFFEID CRDs) as the recommended production approach for workload registration - Replace go-spiffe dependency with standard Go crypto/tls, unifying mTLS configuration across SPIRE (via spiffe-helper) and file-based modes Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

- Update sequence diagram to show client_credentials grant details (client_id + client_secret) and Keycloak-side validation step - Add JWKS cache fetch step showing where Router gets signing keys - Include identity propagation headers on Router->WorkloadManager hop Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

docs/design/auth-proposal.md

…kloagmanager Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

… the statfulset of SPIRE server instead of separate deployment Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

mahil-2040 · 2026-03-30T05:07:20Z

@hzxuzhonghu PTAL, i have made the changes according to your feedback.

hzxuzhonghu · 2026-03-31T03:39:25Z

docs/design/auth-proposal.md

+
+1. **Workload Manager Auth** (`pkg/workloadmanager/auth.go`): Optional Kubernetes TokenReview-based ServiceAccount token validation, gated behind `config.EnableAuth`, plus per-sandbox ownership checks using the extracted user identity (effectively relying on Kubernetes RBAC when using the user-scoped client).
+2. **Router → PicoD Auth** (`PicoD-Plain-Authentication-Design`): A custom RSA key-pair scheme where the Router signs JWTs and PicoD verifies them using a public key exposed via the `PICOD_AUTH_PUBLIC_KEY` environment variable. The key pair (`private.pem`, `public.pem`) is stored in the `picod-router-identity` Secret, and the WorkloadManager reads this Secret to inject the public key into PicoD pods. This works for the Router→PicoD channel but leaves other internal channels unauthenticated.
+3. **Router → WorkloadManager**: Optional, one-sided authentication. `pkg/router/session_manager.go` can attach a `Authorization: Bearer <serviceaccount token>` header, and WorkloadManager can validate it when `--enable-auth` is enabled. This is not mutual workload identity or a zero-trust model, and when auth is disabled any pod on the cluster network can call the WorkloadManager API.


We should consider also bring the user's identity in the token

So instead of having the Router extract the user identity and pass it blindly via X-AgentCube-User-ID headers to the WorkloadManager, the Router should perform an OAuth 2.0 Token Exchange (RFC 8693) with Keycloak.

The Router would exchange the incoming user token for a new downstream token (where the subject is the user, and the actor claim is the Router) and pass that nested token to the WorkloadManager.

is my understanding correct?

If yes, then Should I update the proposal to replace the custom HTTP header approach with this Token Exchange flow as the core requirement, or should we keep the header approach for now and mark Token Exchange as a Stretch goal in the proposal?

Right, i think both X-AgentCube-User-ID and token exchange make sense

hzxuzhonghu

This proposal not cover router -> agent runtime authentication. how could we delegate user's identity to the agent?

hzxuzhonghu · 2026-03-31T03:40:34Z

docs/design/auth-proposal.md

+These gaps amount to three distinct problems:
+
+- **Internal (machine-to-machine):** Components trust each other implicitly based on network reachability. A compromised or rogue pod on the same network can impersonate any component.
+- **External (user-to-platform):** Anyone who can reach the Router endpoint can invoke sandboxes, with no identity verification and no audit trail.


Can limit only the creator or the group can access a sandbox

I’ve updated the proposal to make the tenant isolation explicit using Resource-Level Access Control (RLAC):

Creation: When a sandbox is created, the WorkloadManager tags all related resources (CRDs/Pods) with immutable labels like agentcube.io/owner: user-123, based on JWT claims (sub, groups).
Invocation: On each request, the Router checks the caller’s JWT against the sandbox’s labels. If they don’t match (not owner or allowed group), it returns 403 Forbidden.

This ensures strict tenant isolation , no user can access another user’s sandbox, even within the same namespace.

docs/design/auth-proposal.md

hzxuzhonghu · 2026-03-31T04:06:03Z

docs/design/auth-proposal.md

+
+---
+
+## 1. Internal Workload Authentication (X.509 mTLS)


Reading through the whole section 1, i find it is only talking about x509 certs with spire. Seems not align with the desc below

I added a new section (1.6 File-Based Provisioning with cert-manager) that shows how to achieve the same zero-trust mTLS setup with File-based provisioning. It includes a sample Certificate CRD (mapping uris to SPIFFE IDs) and how the generated Secret plugs into our --mtls-*-file args.

hzxuzhonghu · 2026-03-31T06:59:46Z

docs/design/auth-proposal.md

+metadata:
+  name: agentcube-sandbox
+spec:
+  spiffeIDTemplate: "spiffe://agentcube.local/sa/{{ .PodSpec.ServiceAccountName }}"


so by ignoring namespace here, does it mean all picod across namespaces would receive svid?

And if yes, are they same?

Yes, to both questions.

Because sandboxes are provisioned dynamically into arbitrary, user-specified namespaces, we must omit the namespaceSelector. This allows the SPIRE Agent to attest PicoD pods globally across the cluster, provided they possess the required ServiceAccount and label.

Yes, every sandbox receives the exact same SPIFFE ID (spiffe://cluster.local/sa/agentcube-sandbox).

The shared SPIFFE ID purely provides machine-to-machine authentication (proving to the Router that the target is a legitimate, WorkloadManager provisioned PicoD pod).

The actual tenant isolation and authorization (ensuring User A cannot access User B's sandbox, even if they share the same machine identity) is strictly enforced at the Application Layer via the Resource-Level Access Control (RLAC) JWT claim checks detailed in Section 3.6. This prevents the SPIRE Server from being overwhelmed by creating/destroying unique cryptographic identities for every short-lived sandbox instance!

hzxuzhonghu · 2026-03-31T07:05:15Z

docs/design/auth-proposal.md

+
+> **Note:** Manual CLI entries are stored in the SPIRE Server's datastore (SQLite by default). They survive restarts if the datastore is backed by persistent storage, but are harder to manage and audit compared to the declarative CRD approach.
+
+#### Attestation Flow (Example: Router)


I am concerned the delay here, in agencube its low latency on sandbox provision should be highly prioritized. Especially for agentic rl scenario.

This is a critical issue, maybe we need also jwt based SVID

I've looked into this carefully and The key insight is that JWT-SVIDs in SPIRE still go through the exact same Workload Attestation flow (Agent → kubelet query → selector matching), so the attestation latency is identical regardless of SVID format.

For PicoD sandboxes specifically, the spiffe-helper sidecar fetches the SVID concurrently with PicoD's own container initialization, so attestation (~200-500ms) overlaps with pod startup time rather than adding to it sequentially.

However, for high-throughput scenarios like agentic RL training loops where even this concurrent overhead is unacceptable, our architecture already has a built-in zero-latency alternative: the file-based certificate mode (Section 1.6). In this mode, certificates are pre-provisioned as Kubernetes Secrets and mounted as volumes — they are available the instant the container starts with absolutely no attestation step needed. This makes file-based mode the recommended choice for latency-critical sandbox provisioning.

I've added a dedicated "Latency Considerations for Sandbox Provisioning" note to the proposal to document this tradeoff explicitly.

you are right, the signing is similar flow as x509, the delay sometimes is un acceptable. In code interpreter, we usually need 100ms bootstrap latency. And in reality, tls handshake takes some time maybe tens of milliseconds.

With jwt, the handshake latency can be eliminated. if we have warmpool, i guess the latency is totally from runtime, not from auth system.

Thank you for clarifying. You're absolutely right that the real bottleneck for latency-critical scenarios like Code Interpreters isn't attestation but the TLS handshake overhead itself (~20-50ms per new connection). With JWT, the Router just sends a Bearer token over plain HTTP and PicoD does a ~1ms signature check - the handshake cost is completely eliminated.

Based on your feedback, I've updated the proposal to support two configurable Router→PicoD auth modes via a --picod-auth-mode flag:

mtls (default): Full mutual TLS for security-hardened production deployments.

jwt: Application-layer JWT authentication for latency-critical scenarios (Code Interpreters, agentic RL). Combined with warm pools, this ensures the auth system adds near-zero overhead.

Router↔WorkloadManager always uses mTLS since both are long-lived and the handshake cost is amortized. The existing PicoD-Plain-Auth code path is retained and improved as the JWT mode implementation rather than being deprecated.

hzxuzhonghu · 2026-03-31T07:06:44Z

docs/design/auth-proposal.md

+The cert/key/CA files can be provisioned by any mechanism:
+- **SPIRE:** The [SPIFFE Helper](https://github.com/spiffe/spiffe-helper) sidecar fetches SVIDs from the Workload API and writes them to disk as PEM files. Certificates are rotated automatically.
+- **cert-manager:** Automatically issues and rotates certificates, stored as Kubernetes Secrets and mounted into pods.
+- **Self-signed CA:** Generated manually or via a script for development and testing environments.
+- **Let's Encrypt / corporate PKI:** Externally issued certificates placed on disk or in Secrets.


would you implement this way?

we are not implementing any integration code for these four mechanisms inside the AgentCube components.

The core of this design is that the AgentCube Go binaries will only read standard certificate files from the local disk using the new --mtls-cert-file, --mtls-key-file, and --mtls-ca-file flags.

Because AgentCube only cares about reading a file, we automatically support all four of those mechanisms out-of-the-box, as they all ultimately just write PEM files to a Kubernetes volume mount. Our implementation will strictly focus on adding the file-reading logic to the Go codebase, and providing example Kubernetes YAML manifests (like the spiffe-helper sidecar) to show users how to mount those files seamlessly.

Yeah, i was asking whether you would provide such integration guide

docs/design/auth-proposal.md

hzxuzhonghu · 2026-03-31T07:35:15Z

docs/design/auth-proposal.md

+
+### 2.3 Keycloak Deployment
+
+Keycloak is deployed as a `Deployment` in `agentcube-system`. A dedicated realm called `agentcube` is created during installation containing:


A dedicated realm calledagentcube``, I would suggest letting users register to support multi tenants

The phrasing in the proposal made it sound like the architecture was hardcoded to a single realm name - that was definitely my mistake!

But the Go implementation actually accepts the Keycloak Issuer URL and Realm dynamically through CLI configuration flags (e.g., --keycloak-realm). The agentcube realm is purely just the default name created by the "getting started" Helm chart. Cluster administrators can easily alter the deployment flags to point the Router at whatever custom or multiple distinct realms they need to achieve full multi-tenant isolation.

I have updated the proposal text in Section 2.3 to explicitly state this capability so there's no confusion!

mahil-2040 · 2026-03-31T13:30:37Z

This proposal not cover router -> agent runtime authentication. how could we delegate user's identity to the agent?
You're exactly right that relying solely on mTLS for the Router $\rightarrow$ Agent Runtime proxy connection drops the user context for the executing agent.

To solve this, we can use the exact same OAuth 2.0 Token Exchange (RFC 8693) pattern that I have mentioned in for the workloadmanager.

…support Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Copilot · 2026-03-31T17:08:47Z

docs/design/auth-proposal.md

+
+// Load CA bundle for peer verification
+caCert, err := os.ReadFile(caFile)
+caPool := x509.NewCertPool()
+caPool.AppendCertsFromPEM(caCert)


The mTLS code snippet overwrites err and never checks the errors from tls.LoadX509KeyPair / os.ReadFile, and it ignores the boolean return from AppendCertsFromPEM. Since this is intended to be copy/paste-able guidance, it should show proper error handling and a failure case when the CA bundle can’t be parsed.

Suggested change

// Load CA bundle for peer verification

caCert, err := os.ReadFile(caFile)

caPool := x509.NewCertPool()

caPool.AppendCertsFromPEM(caCert)

if err != nil {

log.Fatalf("failed to load mTLS certificate/key pair (%s, %s): %v", certFile, keyFile, err)

}

// Load CA bundle for peer verification

caCert, err := os.ReadFile(caFile)

if err != nil {

log.Fatalf("failed to read mTLS CA bundle %s: %v", caFile, err)

}

caPool := x509.NewCertPool()

if ok := caPool.AppendCertsFromPEM(caCert); !ok {

log.Fatalf("failed to parse any CA certificates from %s", caFile)

}

Copilot · 2026-03-31T17:08:47Z

docs/design/auth-proposal.md

+        peerCert, err := x509.ParseCertificate(rawCerts[0])
+        if err != nil { return err }
+
+        // 1. Manually verify against our CA pool
+        opts := x509.VerifyOptions{ Roots: caPool }
+        if _, err := peerCert.Verify(opts); err != nil { return err }
+


In the client-side VerifyPeerCertificate example, peerCert.Verify(x509.VerifyOptions{Roots: caPool}) verifies only the leaf cert and doesn’t populate Intermediates from rawCerts[1:]. This will fail for real-world chains that include intermediates, and it also doesn’t constrain key usage (serverAuth). Consider building an intermediates pool from the remaining certs and setting appropriate KeyUsages in VerifyOptions.

Suggested change

peerCert, err := x509.ParseCertificate(rawCerts[0])

if err != nil { return err }

// 1. Manually verify against our CA pool

opts := x509.VerifyOptions{ Roots: caPool }

if _, err := peerCert.Verify(opts); err != nil { return err }

if len(rawCerts) == 0 {

return fmt.Errorf("no server certificates provided")

}

peerCert, err := x509.ParseCertificate(rawCerts[0])

if err != nil {

return err

}

// Build intermediates pool from remaining certificates in the chain, if any.

intermediates := x509.NewCertPool()

for _, certDER := range rawCerts[1:] {

cert, err := x509.ParseCertificate(certDER)

if err != nil {

return err

}

intermediates.AddCert(cert)

}

// 1. Manually verify against our CA pool with intermediates and serverAuth key usage.

opts := x509.VerifyOptions{

Roots: caPool,

Intermediates: intermediates,

KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},

}

if _, err := peerCert.Verify(opts); err != nil {

return err

}

YaoZengzeng · 2026-04-01T03:52:14Z

docs/design/auth-proposal.md

+
+Keycloak realm roles are organized in a simple hierarchy:
+
+| Role | Permissions | Inherits |


what does Inherits means? Like sandbox:manage include permissions of sandbox:invoke plus its own permissions?

Yes, exactly! In Keycloak this uses "Composite Roles. Assigning a user the higher-level sandbox:manage role automatically grants them the sandbox:invoke permissions as well, preventing us from having to assign both roles to administrators.

YaoZengzeng · 2026-04-01T03:53:18Z

docs/design/auth-proposal.md

+| Role | Permissions | Inherits |
+|---|---|---|
+| `sandbox:invoke` | Invoke agent runtimes, invoke code interpreters, list sessions | - |
+| `sandbox:manage` | Create/delete AgentRuntime and CodeInterpreter CRDs | `sandbox:invoke` |


Seems we create AgentRuntime and CodeInterpreter by running kubectl manually, not from SDK.

You're right. currently AgentRuntime and CodeInterpreter CRDs are created via kubectl or the AgentCube CLI (agentcube publish), both of which go directly to the Kubernetes API and are governed by Kubernetes RBAC, not Keycloak.

The sandbox:manage role is defined as a forward-looking placeholder for when CRD lifecycle operations are eventually exposed through the Router. For now, only sandbox:invoke is actively enforced by the Keycloak middleware. I've added a clarifying note to the proposal to make this distinction explicit.

hzxuzhonghu · 2026-04-01T07:56:55Z

docs/design/auth-proposal.md

+
+With the SPIRE Controller Manager deployed, registration entries are defined as Kubernetes CRDs. These are managed declaratively alongside other AgentCube manifests and are automatically synced to the SPIRE Server:
+
+```yaml


Assume this is all installed beforehand, right?

Not quite beforehand. In the spire-controller-manager we can bundle these ClusterSPIFFEID manifests directly into the main AgentCube Helm chart/release bundle.
When the admin runs helm install agentcube, the WorkloadManager, the Router, and these SPIFFE ID registrations are all applied to the cluster at the same time.

got it, So please add a install option

- Added End-to-End Identity Workflow clarifying Agent vs Controller - Made Keycloak realm configuration explicitly dynamic to support multi-tenancy - Updated Go mTLS snippets to demonstrate dynamic zero-downtime certificate - Added Latency Considerations note detailing file-based mode as the zero-delay solution for RL scenarios Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

…rhead Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

Copilot · 2026-04-02T05:25:45Z

docs/design/auth-proposal.md

+metadata:
+  name: agentcube-sandbox
+spec:
+  spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}"


The ClusterSPIFFEID example for PicoD only matches on app: picod, but earlier in this document you state PicoD identity should also be constrained by a dedicated ServiceAccount (and the manual CLI example includes k8s:sa:agentcube-sandbox). As written, this example would allow any pod that can set app: picod to obtain the sandbox SPIFFE ID; please add the ServiceAccount constraint (and any other required selectors) or note that this snippet is incomplete.

Suggested change

spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}"

spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}"

# NOTE: This example is intentionally minimal for illustration. In production,

# you MUST also constrain this identity by a dedicated ServiceAccount

# (for example, `agentcube-sandbox`) and any other required selectors so that

# only trusted PicoD pods can obtain the sandbox SPIFFE ID.

Copilot · 2026-04-02T05:25:46Z

docs/design/auth-proposal.md

+        peerCert, err := x509.ParseCertificate(rawCerts[0])
+        if err != nil { return err }
+
+        // 1. Manually verify against our CA pool
+        opts := x509.VerifyOptions{ Roots: caPool }
+        if _, err := peerCert.Verify(opts); err != nil { return err }
+


In the client-side TLS verification example, peerCert.Verify is called with only Roots: caPool. If the server cert chain includes intermediates (common with many PKIs/cert-manager issuers), verification will fail unless the intermediates from rawCerts[1:] are added to VerifyOptions.Intermediates. Also, when using InsecureSkipVerify: true, it’s important that the custom callback fully replicates normal TLS verification semantics.

Suggested change

peerCert, err := x509.ParseCertificate(rawCerts[0])

if err != nil { return err }

// 1. Manually verify against our CA pool

opts := x509.VerifyOptions{ Roots: caPool }

if _, err := peerCert.Verify(opts); err != nil { return err }

if len(rawCerts) == 0 {

return fmt.Errorf("no server certificates provided")

}

// Parse the leaf (peer) certificate

peerCert, err := x509.ParseCertificate(rawCerts[0])

if err != nil {

return err

}

// Parse and add any intermediate certificates presented by the server

intermediates := x509.NewCertPool()

for _, raw := range rawCerts[1:] {

cert, err := x509.ParseCertificate(raw)

if err != nil {

return err

}

intermediates.AddCert(cert)

}

// 1. Manually verify the certificate chain against our CA pool

opts := x509.VerifyOptions{

Roots: caPool,

Intermediates: intermediates,

}

if _, err := peerCert.Verify(opts); err != nil {

return err

}

Copilot · 2026-04-02T05:25:46Z

docs/design/auth-proposal.md

+type KeycloakConfig struct {
+    IssuerURL    string        // e.g., "https://keycloak.agentcube-system.svc:8443"
+    Realm        string        // Default: "agentcube"
+    Audience     string        // expected audience claim
+    JWKSCacheTTL time.Duration // how often to refresh JWKS keys
+}
+
+func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) {
+    jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs",
+        cfg.IssuerURL, cfg.Realm)
+
+    // JWKS keys are cached locally - no per-request calls to Keycloak
+    keySet := jwk.NewCache(context.Background())
+    keySet.Register(jwksURL, jwk.WithRefreshInterval(cfg.JWKSCacheTTL))
+
+    return &KeycloakValidator{
+        issuer:   fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm),
+        audience: cfg.Audience,
+        keySet:   keySet,
+    }, nil
+}


The JWKS caching snippet appears to use a jwk.NewCache API, but this repo doesn’t currently depend on a JWKS/JWK cache library and the snippet doesn’t show how the cached keys are actually used for JWT verification (and it also ignores the error return from Register if present). Please either mark this as pseudocode and name the intended library, or update to a concrete, correct example (e.g., using an OIDC/JWKS client that returns a keyfunc for github.com/golang-jwt/jwt/v5).

Suggested change

type KeycloakConfig struct {

IssuerURL string // e.g., "https://keycloak.agentcube-system.svc:8443"

Realm string // Default: "agentcube"

Audience string // expected audience claim

JWKSCacheTTL time.Duration // how often to refresh JWKS keys

}

func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) {

jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs",

cfg.IssuerURL, cfg.Realm)

// JWKS keys are cached locally - no per-request calls to Keycloak

keySet := jwk.NewCache(context.Background())

keySet.Register(jwksURL, jwk.WithRefreshInterval(cfg.JWKSCacheTTL))

return &KeycloakValidator{

issuer: fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm),

audience: cfg.Audience,

keySet: keySet,

}, nil

}

package auth

import (

"context"

"fmt"

"time"

"github.com/MicahParks/keyfunc"

"github.com/golang-jwt/jwt/v5"

)

// KeycloakConfig configures JWT validation against a Keycloak realm.

type KeycloakConfig struct {

IssuerURL string // e.g., "https://keycloak.agentcube-system.svc:8443"

Realm string // default: "agentcube"

Audience string // expected audience claim

JWKSCacheTTL time.Duration // how often to refresh JWKS keys

}

// KeycloakValidator validates JWTs using a cached JWKS from Keycloak.

type KeycloakValidator struct {

issuer string

audience string

keyFunc jwt.Keyfunc

}

func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) {

jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs",

cfg.IssuerURL, cfg.Realm)

// JWKS keys are cached locally by the keyfunc library - no per-request calls to Keycloak.

options := keyfunc.Options{

RefreshInterval: cfg.JWKSCacheTTL,

RefreshUnknownKID: true,

// In production code, wire this to your logger instead of ignoring the error.

RefreshErrorHandler: func(err error) {

// log refresh error

_ = err

},

}

jwks, err := keyfunc.Get(jwksURL, options)

if err != nil {

return nil, fmt.Errorf("create JWKS from %q: %w", jwksURL, err)

}

return &KeycloakValidator{

issuer: fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm),

audience: cfg.Audience,

keyFunc: jwks.Keyfunc,

}, nil

}

// ValidateToken verifies the signature and standard claims of a Keycloak-issued JWT.

func (v *KeycloakValidator) ValidateToken(ctx context.Context, rawToken string) (*jwt.RegisteredClaims, error) {

claims := &jwt.RegisteredClaims{}

token, err := jwt.ParseWithClaims(rawToken, claims, v.keyFunc,

jwt.WithIssuer(v.issuer),

jwt.WithAudience(v.audience),

)

if err != nil {

return nil, fmt.Errorf("parse token: %w", err)

}

if !token.Valid {

return nil, fmt.Errorf("invalid token")

}

return claims, nil

}

Copilot · 2026-04-02T05:25:46Z

docs/design/auth-proposal.md

+
+    alt Valid token + authorized role
+        Note over Router, KC: Token Exchange (RFC 8693)
+        Router->>KC: POST /token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod


In the sequence diagram, the token exchange call is shown as POST /token, but elsewhere you use Keycloak’s token endpoint under /realms/<realm>/protocol/openid-connect/token. For Keycloak, token exchange uses the same token endpoint, so the diagram should use the full OIDC token URL to avoid misimplementation.

Suggested change

Router->>KC: POST /token grant_type=urn:ietf:params:oauth:grant-type:token-exchange subject_token=<sdk_jwt> audience=workloadmanager,picod

Router->>KC: POST /realms/<realm>/protocol/openid-connect/token grant_type=urn:ietf:params:oauth:grant-type:token-exchange subject_token=<sdk_jwt> audience=workloadmanager,picod

Copilot · 2026-04-02T05:25:46Z

docs/design/auth-proposal.md

+
+To enforce this resource-level tenant isolation:
+
+1. **Creation Tagging:** When WorkloadManager creates an `AgentRuntime` or `CodeInterpreter`, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the generated Kubernetes CRDs and Pods (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`).


This states “When WorkloadManager creates an AgentRuntime or CodeInterpreter…”, but in the current architecture WorkloadManager creates Sandbox/SandboxClaim resources based on existing AgentRuntime/CodeInterpreter CRDs (it doesn’t create those CRDs). Suggest rewording to reflect labeling the sandbox (and/or its PodTemplate) created for a given AgentRuntime/CodeInterpreter invocation.

Suggested change

1. **Creation Tagging:** When WorkloadManager creates an `AgentRuntime` or `CodeInterpreter`, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the generated Kubernetes CRDs and Pods (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`).

1. **Creation Tagging:** When WorkloadManager provisions a sandbox (e.g., `Sandbox` / `SandboxClaim`) for an invocation of an existing `AgentRuntime` or `CodeInterpreter` CRD, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the sandbox resources and their underlying Pods/PodTemplates (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`).

Copilot · 2026-04-02T05:25:47Z

docs/design/auth-proposal.md

+| **Mechanism** | Transport-layer mutual TLS | Application-layer `Authorization: Bearer` header |
+| **First-connection overhead** | ~20-50ms (TLS handshake) | ~1ms (JWT signature verification) |
+| **Certificate provisioning** | SPIRE SVID or file-based (cert-manager) | Router signs JWT; PicoD validates with pre-shared public key |
+| **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Automatic (Router key rotation) |


The table claims JWT mode has “Automatic (Router key rotation)”, but the current Router JWT implementation stores a long-lived RSA keypair in the picod-router-identity Secret and doesn’t rotate it automatically. If the proposal includes adding rotation, it would help to call that out explicitly (how/when rotation happens and how PicoD gets the new public key) rather than implying it already exists.

Suggested change

| **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Automatic (Router key rotation) |

| **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Manual (long-lived Router keypair in `picod-router-identity` Secret; no automatic rotation yet) |

hzxuzhonghu

generally LG, @YaoZengzeng to have a another look

volcano-sh-bot · 2026-04-02T06:33:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~docs/OWNERS~~ [hzxuzhonghu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

YaoZengzeng · 2026-04-02T06:54:30Z

/lgtm

Copilot AI review requested due to automatic review settings March 11, 2026 17:15

volcano-sh-bot added the kind/documentation Improvements or additions to documentation label Mar 11, 2026

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu March 11, 2026 17:15

volcano-sh-bot added the size/XL label Mar 11, 2026

docs: added authentication and authorization design proposal

6bc442a

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

mahil-2040 force-pushed the proposal/auth-design branch from 671f0bd to 6bc442a Compare March 11, 2026 17:16

Copilot started reviewing on behalf of mahil-2040 March 11, 2026 17:17 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

made some changes suggested by copilot

7e4e48a

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

hzxuzhonghu reviewed Mar 16, 2026

View reviewed changes

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

Fix: clarify JWT role-based auth instead of Keycloak Authorization Se…

193f520

…rvices and update PicoD SPIRE selector without namespace constraint Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Copilot AI review requested due to automatic review settings March 17, 2026 05:35

Copilot started reviewing on behalf of mahil-2040 March 17, 2026 05:36 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

hzxuzhonghu reviewed Mar 18, 2026

View reviewed changes

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

hzxuzhonghu reviewed Mar 18, 2026

View reviewed changes

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

hzxuzhonghu reviewed Mar 18, 2026

View reviewed changes

mahil-2040 added 3 commits March 18, 2026 22:57

mahil-2040 mentioned this pull request Mar 23, 2026

LFX 2026 Term 1 Project - Establishing AgentCube's Authentication and Authorization Capabilities #243

Open

14 tasks

docs: added User Identity Propagation section

fd34a67

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Copilot AI review requested due to automatic review settings March 25, 2026 13:17

Copilot started reviewing on behalf of mahil-2040 March 25, 2026 13:18 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

docs/design/auth-proposal.md Outdated Show resolved Hide resolved

docs: changed name of agentcube-workload-manager back to original wor…

367a465

…kloagmanager Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

docs: changed docs to deploy SPIRE Controller Manager as a sidecar in…

0075001

… the statfulset of SPIRE server instead of separate deployment Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

hzxuzhonghu reviewed Mar 31, 2026

View reviewed changes

mahil-2040 added 2 commits March 31, 2026 21:58

docs: refine proposal with RLAC, mTLS verification, and cert-manager …

ef6b68c

…support Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

docs: removed unnecessry details about installing SPIRE

363d0b5

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Copilot AI review requested due to automatic review settings March 31, 2026 16:57

Copilot started reviewing on behalf of mahil-2040 March 31, 2026 16:58 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

YaoZengzeng reviewed Apr 1, 2026

View reviewed changes

hzxuzhonghu reviewed Apr 1, 2026

View reviewed changes

mahil-2040 added 2 commits April 2, 2026 01:19

docs: retained JWT auth for Router-PicoD to eliminate TLS latency ove…

8de1f54

…rhead Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>

Copilot AI review requested due to automatic review settings April 2, 2026 05:20

Copilot started reviewing on behalf of mahil-2040 April 2, 2026 05:21 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

hzxuzhonghu approved these changes Apr 2, 2026

View reviewed changes

volcano-sh-bot added the approved label Apr 2, 2026

volcano-sh-bot assigned YaoZengzeng Apr 2, 2026

volcano-sh-bot added the lgtm label Apr 2, 2026

volcano-sh-bot merged commit 4507820 into volcano-sh:main Apr 2, 2026
17 of 18 checks passed


		> Note: Manual CLI entries are stored in the SPIRE Server's datastore (SQLite by default). They survive restarts if the datastore is backed by persistent storage, but are harder to manage and audit compared to the declarative CRD approach.

		#### Attestation Flow (Example: Router)


		### 2.3 Keycloak Deployment

		Keycloak is deployed as a `Deployment` in `agentcube-system`. A dedicated realm called `agentcube` is created during installation containing:


		Keycloak realm roles are organized in a simple hierarchy:

		\| Role \| Permissions \| Inherits \|


		With the SPIRE Controller Manager deployed, registration entries are defined as Kubernetes CRDs. These are managed declaratively alongside other AgentCube manifests and are automatically synced to the SPIRE Server:

		```yaml

-  spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}"
+  spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}"
+  # NOTE: This example is intentionally minimal for illustration. In production,
+  # you MUST also constrain this identity by a dedicated ServiceAccount
+  # (for example, `agentcube-sandbox`) and any other required selectors so that
+  # only trusted PicoD pods can obtain the sandbox SPIFFE ID.

	Router->>KC: POST /token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod
	Router->>KC: POST /realms/<realm>/protocol/openid-connect/token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod


		To enforce this resource-level tenant isolation:

		1. Creation Tagging: When WorkloadManager creates an `AgentRuntime` or `CodeInterpreter`, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the generated Kubernetes CRDs and Pods (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`).

	\| Key rotation \| Automatic (SPIRE 1h TTL or cert-manager) \| Automatic (Router key rotation) \|
	\| Key rotation \| Automatic (SPIRE 1h TTL or cert-manager) \| Manual (long-lived Router keypair in `picod-router-identity` Secret; no automatic rotation yet) \|

Conversation

mahil-2040 commented Mar 11, 2026

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Uh oh!

codecov-commenter commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

mahil-2040 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 11, 2026 •

edited

Loading

mahil-2040 commented Mar 30, 2026 •

edited

Loading