Authentication and Authorization Design Proposal#241
Authentication and Authorization Design Proposal#241volcano-sh-bot merged 13 commits intovolcano-sh:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive design proposal to overhaul AgentCube's authentication and authorization mechanisms. It aims to establish a robust, unified security model by implementing zero-trust mTLS for internal component communication using SPIFFE/SPIRE, and integrating Keycloak for external user authentication and role-based access control. This strategic shift addresses existing security vulnerabilities, enhances system integrity, and provides a scalable, industry-standard foundation for managing identities and permissions across the platform. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
671f0bd to
6bc442a
Compare
|
Warning Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #241 +/- ##
==========================================
+ Coverage 35.60% 43.35% +7.74%
==========================================
Files 29 30 +1
Lines 2533 2611 +78
==========================================
+ Hits 902 1132 +230
+ Misses 1505 1358 -147
+ Partials 126 121 -5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new design document proposing a unified authentication + authorization model for AgentCube, covering internal component identity and external user access at the Router boundary.
Changes:
- Introduces an SPIFFE/SPIRE mTLS proposal for Router/WorkloadManager/PicoD internal traffic.
- Proposes Keycloak-based external authentication (OIDC) and role-based authorization (RBAC) at the Router.
- Describes migration away from the current PicoD plain-auth/JWT approach toward workload identity.
You can also share your feedback on Copilot code review. Take the survey.
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Posting the main design consistency issues I found in the auth proposal.
…rvices and update PicoD SPIRE selector without namespace constraint Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
You can also share your feedback on Copilot code review. Take the survey.
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Posted a few design-consistency comments on the current PR head. The main issues are the missing explanation of how end-user identity/audit context survives the Router -> WorkloadManager hop once that channel becomes pure workload mTLS, and the security implications of using a label-only SPIRE selector for PicoD. I also left two smaller documentation comments around OAuth token refresh behavior and the OPA comparison wording.
docs/design/auth-proposal.md
Outdated
| ``` | ||
| SDK → Router: HTTPS + Keycloak JWT (see Section 2) | ||
| Router → WorkloadManager: mTLS (SPIRE SVIDs) | ||
| Router → PicoD: mTLS (SPIRE SVIDs) |
There was a problem hiding this comment.
From above section, does all the picod share same spiffe id?
Previsouly we donot use tls between router and picod, one consideration is latency.
The other concern here is we also allow user build their customized code interperet.
There was a problem hiding this comment.
Regarding custom code interpreters, could you clarify the concern here?
Is the concern about the added complexity for users building custom interpreters? because with mTLS enforced on Router→PicoD, custom interpreters would need to handle TLS certificates.
And can you suggest a approach for supporting custom interpreters alongside mTLS?
- Clarify SDK token lifecycle per grant type (client_credentials vs TokenAuth) - Fix overstated OPA local-evaluation advantage over Keycloak RBAC - Harden PicoD SPIRE selector with dedicated ServiceAccount - Support file-based certs alongside SPIRE for mTLS Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
- Update SPIFFE IDs to use standard Istio-convention format (spiffe://trust-domain/ns/namespace/sa/service-account) - Add explanation of SPIRE selectors for workload attestation - Document SPIRE Controller Manager (ClusterSPIFFEID CRDs) as the recommended production approach for workload registration - Replace go-spiffe dependency with standard Go crypto/tls, unifying mTLS configuration across SPIRE (via spiffe-helper) and file-based modes Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
- Update sequence diagram to show client_credentials grant details (client_id + client_secret) and Keycloak-side validation step - Add JWKS cache fetch step showing where Router gets signing keys - Include identity propagation headers on Router->WorkloadManager hop Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…kloagmanager Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
… the statfulset of SPIRE server instead of separate deployment Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
|
@hzxuzhonghu PTAL, i have made the changes according to your feedback. |
|
|
||
| 1. **Workload Manager Auth** (`pkg/workloadmanager/auth.go`): Optional Kubernetes TokenReview-based ServiceAccount token validation, gated behind `config.EnableAuth`, plus per-sandbox ownership checks using the extracted user identity (effectively relying on Kubernetes RBAC when using the user-scoped client). | ||
| 2. **Router → PicoD Auth** (`PicoD-Plain-Authentication-Design`): A custom RSA key-pair scheme where the Router signs JWTs and PicoD verifies them using a public key exposed via the `PICOD_AUTH_PUBLIC_KEY` environment variable. The key pair (`private.pem`, `public.pem`) is stored in the `picod-router-identity` Secret, and the WorkloadManager reads this Secret to inject the public key into PicoD pods. This works for the Router→PicoD channel but leaves other internal channels unauthenticated. | ||
| 3. **Router → WorkloadManager**: Optional, one-sided authentication. `pkg/router/session_manager.go` can attach a `Authorization: Bearer <serviceaccount token>` header, and WorkloadManager can validate it when `--enable-auth` is enabled. This is not mutual workload identity or a zero-trust model, and when auth is disabled any pod on the cluster network can call the WorkloadManager API. |
There was a problem hiding this comment.
We should consider also bring the user's identity in the token
There was a problem hiding this comment.
So instead of having the Router extract the user identity and pass it blindly via X-AgentCube-User-ID headers to the WorkloadManager, the Router should perform an OAuth 2.0 Token Exchange (RFC 8693) with Keycloak.
The Router would exchange the incoming user token for a new downstream token (where the subject is the user, and the actor claim is the Router) and pass that nested token to the WorkloadManager.
is my understanding correct?
If yes, then Should I update the proposal to replace the custom HTTP header approach with this Token Exchange flow as the core requirement, or should we keep the header approach for now and mark Token Exchange as a Stretch goal in the proposal?
There was a problem hiding this comment.
Right, i think both X-AgentCube-User-ID and token exchange make sense
hzxuzhonghu
left a comment
There was a problem hiding this comment.
This proposal not cover router -> agent runtime authentication. how could we delegate user's identity to the agent?
| These gaps amount to three distinct problems: | ||
|
|
||
| - **Internal (machine-to-machine):** Components trust each other implicitly based on network reachability. A compromised or rogue pod on the same network can impersonate any component. | ||
| - **External (user-to-platform):** Anyone who can reach the Router endpoint can invoke sandboxes, with no identity verification and no audit trail. |
There was a problem hiding this comment.
Can limit only the creator or the group can access a sandbox
There was a problem hiding this comment.
I’ve updated the proposal to make the tenant isolation explicit using Resource-Level Access Control (RLAC):
Creation: When a sandbox is created, the WorkloadManager tags all related resources (CRDs/Pods) with immutable labels like agentcube.io/owner: user-123, based on JWT claims (sub, groups).
Invocation: On each request, the Router checks the caller’s JWT against the sandbox’s labels. If they don’t match (not owner or allowed group), it returns 403 Forbidden.
This ensures strict tenant isolation , no user can access another user’s sandbox, even within the same namespace.
|
|
||
| --- | ||
|
|
||
| ## 1. Internal Workload Authentication (X.509 mTLS) |
There was a problem hiding this comment.
Reading through the whole section 1, i find it is only talking about x509 certs with spire. Seems not align with the desc below
There was a problem hiding this comment.
I added a new section (1.6 File-Based Provisioning with cert-manager) that shows how to achieve the same zero-trust mTLS setup with File-based provisioning. It includes a sample Certificate CRD (mapping uris to SPIFFE IDs) and how the generated Secret plugs into our --mtls-*-file args.
docs/design/auth-proposal.md
Outdated
| metadata: | ||
| name: agentcube-sandbox | ||
| spec: | ||
| spiffeIDTemplate: "spiffe://agentcube.local/sa/{{ .PodSpec.ServiceAccountName }}" |
There was a problem hiding this comment.
so by ignoring namespace here, does it mean all picod across namespaces would receive svid?
And if yes, are they same?
There was a problem hiding this comment.
Yes, to both questions.
- Because sandboxes are provisioned dynamically into arbitrary, user-specified namespaces, we must omit the
namespaceSelector. This allows the SPIRE Agent to attest PicoD pods globally across the cluster, provided they possess the required ServiceAccount and label. - Yes, every sandbox receives the exact same SPIFFE ID (
spiffe://cluster.local/sa/agentcube-sandbox).
The shared SPIFFE ID purely provides machine-to-machine authentication (proving to the Router that the target is a legitimate, WorkloadManager provisioned PicoD pod).
The actual tenant isolation and authorization (ensuring User A cannot access User B's sandbox, even if they share the same machine identity) is strictly enforced at the Application Layer via the Resource-Level Access Control (RLAC) JWT claim checks detailed in Section 3.6. This prevents the SPIRE Server from being overwhelmed by creating/destroying unique cryptographic identities for every short-lived sandbox instance!
|
|
||
| > **Note:** Manual CLI entries are stored in the SPIRE Server's datastore (SQLite by default). They survive restarts if the datastore is backed by persistent storage, but are harder to manage and audit compared to the declarative CRD approach. | ||
|
|
||
| #### Attestation Flow (Example: Router) |
There was a problem hiding this comment.
I am concerned the delay here, in agencube its low latency on sandbox provision should be highly prioritized. Especially for agentic rl scenario.
There was a problem hiding this comment.
This is a critical issue, maybe we need also jwt based SVID
There was a problem hiding this comment.
I've looked into this carefully and The key insight is that JWT-SVIDs in SPIRE still go through the exact same Workload Attestation flow (Agent → kubelet query → selector matching), so the attestation latency is identical regardless of SVID format.
For PicoD sandboxes specifically, the spiffe-helper sidecar fetches the SVID concurrently with PicoD's own container initialization, so attestation (~200-500ms) overlaps with pod startup time rather than adding to it sequentially.
However, for high-throughput scenarios like agentic RL training loops where even this concurrent overhead is unacceptable, our architecture already has a built-in zero-latency alternative: the file-based certificate mode (Section 1.6). In this mode, certificates are pre-provisioned as Kubernetes Secrets and mounted as volumes — they are available the instant the container starts with absolutely no attestation step needed. This makes file-based mode the recommended choice for latency-critical sandbox provisioning.
I've added a dedicated "Latency Considerations for Sandbox Provisioning" note to the proposal to document this tradeoff explicitly.
There was a problem hiding this comment.
you are right, the signing is similar flow as x509, the delay sometimes is un acceptable. In code interpreter, we usually need 100ms bootstrap latency. And in reality, tls handshake takes some time maybe tens of milliseconds.
With jwt, the handshake latency can be eliminated. if we have warmpool, i guess the latency is totally from runtime, not from auth system.
There was a problem hiding this comment.
Thank you for clarifying. You're absolutely right that the real bottleneck for latency-critical scenarios like Code Interpreters isn't attestation but the TLS handshake overhead itself (~20-50ms per new connection). With JWT, the Router just sends a Bearer token over plain HTTP and PicoD does a ~1ms signature check - the handshake cost is completely eliminated.
Based on your feedback, I've updated the proposal to support two configurable Router→PicoD auth modes via a --picod-auth-mode flag:
- mtls (default): Full mutual TLS for security-hardened production deployments.
- jwt: Application-layer JWT authentication for latency-critical scenarios (Code Interpreters, agentic RL). Combined with warm pools, this ensures the auth system adds near-zero overhead.
Router↔WorkloadManager always uses mTLS since both are long-lived and the handshake cost is amortized. The existing PicoD-Plain-Auth code path is retained and improved as the JWT mode implementation rather than being deprecated.
| The cert/key/CA files can be provisioned by any mechanism: | ||
| - **SPIRE:** The [SPIFFE Helper](https://github.com/spiffe/spiffe-helper) sidecar fetches SVIDs from the Workload API and writes them to disk as PEM files. Certificates are rotated automatically. | ||
| - **cert-manager:** Automatically issues and rotates certificates, stored as Kubernetes Secrets and mounted into pods. | ||
| - **Self-signed CA:** Generated manually or via a script for development and testing environments. | ||
| - **Let's Encrypt / corporate PKI:** Externally issued certificates placed on disk or in Secrets. |
There was a problem hiding this comment.
would you implement this way?
There was a problem hiding this comment.
we are not implementing any integration code for these four mechanisms inside the AgentCube components.
The core of this design is that the AgentCube Go binaries will only read standard certificate files from the local disk using the new --mtls-cert-file, --mtls-key-file, and --mtls-ca-file flags.
Because AgentCube only cares about reading a file, we automatically support all four of those mechanisms out-of-the-box, as they all ultimately just write PEM files to a Kubernetes volume mount. Our implementation will strictly focus on adding the file-reading logic to the Go codebase, and providing example Kubernetes YAML manifests (like the spiffe-helper sidecar) to show users how to mount those files seamlessly.
There was a problem hiding this comment.
Yeah, i was asking whether you would provide such integration guide
docs/design/auth-proposal.md
Outdated
|
|
||
| ### 2.3 Keycloak Deployment | ||
|
|
||
| Keycloak is deployed as a `Deployment` in `agentcube-system`. A dedicated realm called `agentcube` is created during installation containing: |
There was a problem hiding this comment.
A dedicated realm calledagentcube``, I would suggest letting users register to support multi tenants
There was a problem hiding this comment.
The phrasing in the proposal made it sound like the architecture was hardcoded to a single realm name - that was definitely my mistake!
But the Go implementation actually accepts the Keycloak Issuer URL and Realm dynamically through CLI configuration flags (e.g., --keycloak-realm). The agentcube realm is purely just the default name created by the "getting started" Helm chart. Cluster administrators can easily alter the deployment flags to point the Router at whatever custom or multiple distinct realms they need to achieve full multi-tenant isolation.
I have updated the proposal text in Section 2.3 to explicitly state this capability so there's no confusion!
To solve this, we can use the exact same OAuth 2.0 Token Exchange (RFC 8693) pattern that I have mentioned in for the workloadmanager. |
…support Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
|
|
||
| // Load CA bundle for peer verification | ||
| caCert, err := os.ReadFile(caFile) | ||
| caPool := x509.NewCertPool() | ||
| caPool.AppendCertsFromPEM(caCert) |
There was a problem hiding this comment.
The mTLS code snippet overwrites err and never checks the errors from tls.LoadX509KeyPair / os.ReadFile, and it ignores the boolean return from AppendCertsFromPEM. Since this is intended to be copy/paste-able guidance, it should show proper error handling and a failure case when the CA bundle can’t be parsed.
| // Load CA bundle for peer verification | |
| caCert, err := os.ReadFile(caFile) | |
| caPool := x509.NewCertPool() | |
| caPool.AppendCertsFromPEM(caCert) | |
| if err != nil { | |
| log.Fatalf("failed to load mTLS certificate/key pair (%s, %s): %v", certFile, keyFile, err) | |
| } | |
| // Load CA bundle for peer verification | |
| caCert, err := os.ReadFile(caFile) | |
| if err != nil { | |
| log.Fatalf("failed to read mTLS CA bundle %s: %v", caFile, err) | |
| } | |
| caPool := x509.NewCertPool() | |
| if ok := caPool.AppendCertsFromPEM(caCert); !ok { | |
| log.Fatalf("failed to parse any CA certificates from %s", caFile) | |
| } |
| peerCert, err := x509.ParseCertificate(rawCerts[0]) | ||
| if err != nil { return err } | ||
|
|
||
| // 1. Manually verify against our CA pool | ||
| opts := x509.VerifyOptions{ Roots: caPool } | ||
| if _, err := peerCert.Verify(opts); err != nil { return err } | ||
|
|
There was a problem hiding this comment.
In the client-side VerifyPeerCertificate example, peerCert.Verify(x509.VerifyOptions{Roots: caPool}) verifies only the leaf cert and doesn’t populate Intermediates from rawCerts[1:]. This will fail for real-world chains that include intermediates, and it also doesn’t constrain key usage (serverAuth). Consider building an intermediates pool from the remaining certs and setting appropriate KeyUsages in VerifyOptions.
| peerCert, err := x509.ParseCertificate(rawCerts[0]) | |
| if err != nil { return err } | |
| // 1. Manually verify against our CA pool | |
| opts := x509.VerifyOptions{ Roots: caPool } | |
| if _, err := peerCert.Verify(opts); err != nil { return err } | |
| if len(rawCerts) == 0 { | |
| return fmt.Errorf("no server certificates provided") | |
| } | |
| peerCert, err := x509.ParseCertificate(rawCerts[0]) | |
| if err != nil { | |
| return err | |
| } | |
| // Build intermediates pool from remaining certificates in the chain, if any. | |
| intermediates := x509.NewCertPool() | |
| for _, certDER := range rawCerts[1:] { | |
| cert, err := x509.ParseCertificate(certDER) | |
| if err != nil { | |
| return err | |
| } | |
| intermediates.AddCert(cert) | |
| } | |
| // 1. Manually verify against our CA pool with intermediates and serverAuth key usage. | |
| opts := x509.VerifyOptions{ | |
| Roots: caPool, | |
| Intermediates: intermediates, | |
| KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth}, | |
| } | |
| if _, err := peerCert.Verify(opts); err != nil { | |
| return err | |
| } |
|
|
||
| Keycloak realm roles are organized in a simple hierarchy: | ||
|
|
||
| | Role | Permissions | Inherits | |
There was a problem hiding this comment.
what does Inherits means? Like sandbox:manage include permissions of sandbox:invoke plus its own permissions?
There was a problem hiding this comment.
Yes, exactly! In Keycloak this uses "Composite Roles. Assigning a user the higher-level sandbox:manage role automatically grants them the sandbox:invoke permissions as well, preventing us from having to assign both roles to administrators.
| | Role | Permissions | Inherits | | ||
| |---|---|---| | ||
| | `sandbox:invoke` | Invoke agent runtimes, invoke code interpreters, list sessions | - | | ||
| | `sandbox:manage` | Create/delete AgentRuntime and CodeInterpreter CRDs | `sandbox:invoke` | |
There was a problem hiding this comment.
Seems we create AgentRuntime and CodeInterpreter by running kubectl manually, not from SDK.
There was a problem hiding this comment.
You're right. currently AgentRuntime and CodeInterpreter CRDs are created via kubectl or the AgentCube CLI (agentcube publish), both of which go directly to the Kubernetes API and are governed by Kubernetes RBAC, not Keycloak.
The sandbox:manage role is defined as a forward-looking placeholder for when CRD lifecycle operations are eventually exposed through the Router. For now, only sandbox:invoke is actively enforced by the Keycloak middleware. I've added a clarifying note to the proposal to make this distinction explicit.
|
|
||
| With the SPIRE Controller Manager deployed, registration entries are defined as Kubernetes CRDs. These are managed declaratively alongside other AgentCube manifests and are automatically synced to the SPIRE Server: | ||
|
|
||
| ```yaml |
There was a problem hiding this comment.
Assume this is all installed beforehand, right?
There was a problem hiding this comment.
Not quite beforehand. In the spire-controller-manager we can bundle these ClusterSPIFFEID manifests directly into the main AgentCube Helm chart/release bundle.
When the admin runs helm install agentcube, the WorkloadManager, the Router, and these SPIFFE ID registrations are all applied to the cluster at the same time.
There was a problem hiding this comment.
got it, So please add a install option
- Added End-to-End Identity Workflow clarifying Agent vs Controller - Made Keycloak realm configuration explicitly dynamic to support multi-tenancy - Updated Go mTLS snippets to demonstrate dynamic zero-downtime certificate - Added Latency Considerations note detailing file-based mode as the zero-delay solution for RL scenarios Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…rhead Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
| metadata: | ||
| name: agentcube-sandbox | ||
| spec: | ||
| spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}" |
There was a problem hiding this comment.
The ClusterSPIFFEID example for PicoD only matches on app: picod, but earlier in this document you state PicoD identity should also be constrained by a dedicated ServiceAccount (and the manual CLI example includes k8s:sa:agentcube-sandbox). As written, this example would allow any pod that can set app: picod to obtain the sandbox SPIFFE ID; please add the ServiceAccount constraint (and any other required selectors) or note that this snippet is incomplete.
| spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}" | |
| spiffeIDTemplate: "spiffe://cluster.local/sa/{{ .PodSpec.ServiceAccountName }}" | |
| # NOTE: This example is intentionally minimal for illustration. In production, | |
| # you MUST also constrain this identity by a dedicated ServiceAccount | |
| # (for example, `agentcube-sandbox`) and any other required selectors so that | |
| # only trusted PicoD pods can obtain the sandbox SPIFFE ID. |
| peerCert, err := x509.ParseCertificate(rawCerts[0]) | ||
| if err != nil { return err } | ||
|
|
||
| // 1. Manually verify against our CA pool | ||
| opts := x509.VerifyOptions{ Roots: caPool } | ||
| if _, err := peerCert.Verify(opts); err != nil { return err } | ||
|
|
There was a problem hiding this comment.
In the client-side TLS verification example, peerCert.Verify is called with only Roots: caPool. If the server cert chain includes intermediates (common with many PKIs/cert-manager issuers), verification will fail unless the intermediates from rawCerts[1:] are added to VerifyOptions.Intermediates. Also, when using InsecureSkipVerify: true, it’s important that the custom callback fully replicates normal TLS verification semantics.
| peerCert, err := x509.ParseCertificate(rawCerts[0]) | |
| if err != nil { return err } | |
| // 1. Manually verify against our CA pool | |
| opts := x509.VerifyOptions{ Roots: caPool } | |
| if _, err := peerCert.Verify(opts); err != nil { return err } | |
| if len(rawCerts) == 0 { | |
| return fmt.Errorf("no server certificates provided") | |
| } | |
| // Parse the leaf (peer) certificate | |
| peerCert, err := x509.ParseCertificate(rawCerts[0]) | |
| if err != nil { | |
| return err | |
| } | |
| // Parse and add any intermediate certificates presented by the server | |
| intermediates := x509.NewCertPool() | |
| for _, raw := range rawCerts[1:] { | |
| cert, err := x509.ParseCertificate(raw) | |
| if err != nil { | |
| return err | |
| } | |
| intermediates.AddCert(cert) | |
| } | |
| // 1. Manually verify the certificate chain against our CA pool | |
| opts := x509.VerifyOptions{ | |
| Roots: caPool, | |
| Intermediates: intermediates, | |
| } | |
| if _, err := peerCert.Verify(opts); err != nil { | |
| return err | |
| } |
| type KeycloakConfig struct { | ||
| IssuerURL string // e.g., "https://keycloak.agentcube-system.svc:8443" | ||
| Realm string // Default: "agentcube" | ||
| Audience string // expected audience claim | ||
| JWKSCacheTTL time.Duration // how often to refresh JWKS keys | ||
| } | ||
|
|
||
| func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) { | ||
| jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs", | ||
| cfg.IssuerURL, cfg.Realm) | ||
|
|
||
| // JWKS keys are cached locally - no per-request calls to Keycloak | ||
| keySet := jwk.NewCache(context.Background()) | ||
| keySet.Register(jwksURL, jwk.WithRefreshInterval(cfg.JWKSCacheTTL)) | ||
|
|
||
| return &KeycloakValidator{ | ||
| issuer: fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm), | ||
| audience: cfg.Audience, | ||
| keySet: keySet, | ||
| }, nil | ||
| } |
There was a problem hiding this comment.
The JWKS caching snippet appears to use a jwk.NewCache API, but this repo doesn’t currently depend on a JWKS/JWK cache library and the snippet doesn’t show how the cached keys are actually used for JWT verification (and it also ignores the error return from Register if present). Please either mark this as pseudocode and name the intended library, or update to a concrete, correct example (e.g., using an OIDC/JWKS client that returns a keyfunc for github.com/golang-jwt/jwt/v5).
| type KeycloakConfig struct { | |
| IssuerURL string // e.g., "https://keycloak.agentcube-system.svc:8443" | |
| Realm string // Default: "agentcube" | |
| Audience string // expected audience claim | |
| JWKSCacheTTL time.Duration // how often to refresh JWKS keys | |
| } | |
| func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) { | |
| jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs", | |
| cfg.IssuerURL, cfg.Realm) | |
| // JWKS keys are cached locally - no per-request calls to Keycloak | |
| keySet := jwk.NewCache(context.Background()) | |
| keySet.Register(jwksURL, jwk.WithRefreshInterval(cfg.JWKSCacheTTL)) | |
| return &KeycloakValidator{ | |
| issuer: fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm), | |
| audience: cfg.Audience, | |
| keySet: keySet, | |
| }, nil | |
| } | |
| package auth | |
| import ( | |
| "context" | |
| "fmt" | |
| "time" | |
| "github.com/MicahParks/keyfunc" | |
| "github.com/golang-jwt/jwt/v5" | |
| ) | |
| // KeycloakConfig configures JWT validation against a Keycloak realm. | |
| type KeycloakConfig struct { | |
| IssuerURL string // e.g., "https://keycloak.agentcube-system.svc:8443" | |
| Realm string // default: "agentcube" | |
| Audience string // expected audience claim | |
| JWKSCacheTTL time.Duration // how often to refresh JWKS keys | |
| } | |
| // KeycloakValidator validates JWTs using a cached JWKS from Keycloak. | |
| type KeycloakValidator struct { | |
| issuer string | |
| audience string | |
| keyFunc jwt.Keyfunc | |
| } | |
| func NewKeycloakValidator(cfg KeycloakConfig) (*KeycloakValidator, error) { | |
| jwksURL := fmt.Sprintf("%s/realms/%s/protocol/openid-connect/certs", | |
| cfg.IssuerURL, cfg.Realm) | |
| // JWKS keys are cached locally by the keyfunc library - no per-request calls to Keycloak. | |
| options := keyfunc.Options{ | |
| RefreshInterval: cfg.JWKSCacheTTL, | |
| RefreshUnknownKID: true, | |
| // In production code, wire this to your logger instead of ignoring the error. | |
| RefreshErrorHandler: func(err error) { | |
| // log refresh error | |
| _ = err | |
| }, | |
| } | |
| jwks, err := keyfunc.Get(jwksURL, options) | |
| if err != nil { | |
| return nil, fmt.Errorf("create JWKS from %q: %w", jwksURL, err) | |
| } | |
| return &KeycloakValidator{ | |
| issuer: fmt.Sprintf("%s/realms/%s", cfg.IssuerURL, cfg.Realm), | |
| audience: cfg.Audience, | |
| keyFunc: jwks.Keyfunc, | |
| }, nil | |
| } | |
| // ValidateToken verifies the signature and standard claims of a Keycloak-issued JWT. | |
| func (v *KeycloakValidator) ValidateToken(ctx context.Context, rawToken string) (*jwt.RegisteredClaims, error) { | |
| claims := &jwt.RegisteredClaims{} | |
| token, err := jwt.ParseWithClaims(rawToken, claims, v.keyFunc, | |
| jwt.WithIssuer(v.issuer), | |
| jwt.WithAudience(v.audience), | |
| ) | |
| if err != nil { | |
| return nil, fmt.Errorf("parse token: %w", err) | |
| } | |
| if !token.Valid { | |
| return nil, fmt.Errorf("invalid token") | |
| } | |
| return claims, nil | |
| } |
|
|
||
| alt Valid token + authorized role | ||
| Note over Router, KC: Token Exchange (RFC 8693) | ||
| Router->>KC: POST /token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod |
There was a problem hiding this comment.
In the sequence diagram, the token exchange call is shown as POST /token, but elsewhere you use Keycloak’s token endpoint under /realms/<realm>/protocol/openid-connect/token. For Keycloak, token exchange uses the same token endpoint, so the diagram should use the full OIDC token URL to avoid misimplementation.
| Router->>KC: POST /token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod | |
| Router->>KC: POST /realms/<realm>/protocol/openid-connect/token<br/>grant_type=urn:ietf:params:oauth:grant-type:token-exchange<br/>subject_token=<sdk_jwt><br/>audience=workloadmanager,picod |
|
|
||
| To enforce this resource-level tenant isolation: | ||
|
|
||
| 1. **Creation Tagging:** When WorkloadManager creates an `AgentRuntime` or `CodeInterpreter`, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the generated Kubernetes CRDs and Pods (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`). |
There was a problem hiding this comment.
This states “When WorkloadManager creates an AgentRuntime or CodeInterpreter…”, but in the current architecture WorkloadManager creates Sandbox/SandboxClaim resources based on existing AgentRuntime/CodeInterpreter CRDs (it doesn’t create those CRDs). Suggest rewording to reflect labeling the sandbox (and/or its PodTemplate) created for a given AgentRuntime/CodeInterpreter invocation.
| 1. **Creation Tagging:** When WorkloadManager creates an `AgentRuntime` or `CodeInterpreter`, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the generated Kubernetes CRDs and Pods (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`). | |
| 1. **Creation Tagging:** When WorkloadManager provisions a sandbox (e.g., `Sandbox` / `SandboxClaim`) for an invocation of an existing `AgentRuntime` or `CodeInterpreter` CRD, it reads the `sub` (user ID) and optionally the `groups` claim from the exchanged Keycloak JWT. It applies these as immutable labels on the sandbox resources and their underlying Pods/PodTemplates (e.g., `agentcube.io/owner: user-123`, `agentcube.io/group: engineering`). |
| | **Mechanism** | Transport-layer mutual TLS | Application-layer `Authorization: Bearer` header | | ||
| | **First-connection overhead** | ~20-50ms (TLS handshake) | ~1ms (JWT signature verification) | | ||
| | **Certificate provisioning** | SPIRE SVID or file-based (cert-manager) | Router signs JWT; PicoD validates with pre-shared public key | | ||
| | **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Automatic (Router key rotation) | |
There was a problem hiding this comment.
The table claims JWT mode has “Automatic (Router key rotation)”, but the current Router JWT implementation stores a long-lived RSA keypair in the picod-router-identity Secret and doesn’t rotate it automatically. If the proposal includes adding rotation, it would help to call that out explicitly (how/when rotation happens and how PicoD gets the new public key) rather than implying it already exists.
| | **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Automatic (Router key rotation) | | |
| | **Key rotation** | Automatic (SPIRE 1h TTL or cert-manager) | Manual (long-lived Router keypair in `picod-router-identity` Secret; no automatic rotation yet) | |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
generally LG, @YaoZengzeng to have a another look
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
This PR introduces a design proposal for implementing authentication and authorization in Agentcube.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: