-
Notifications
You must be signed in to change notification settings - Fork 0
Security and Network
Klanker Maker uses explicit allowlists everywhere - if it's not in the policy, it's denied. There is no "default allow."
Sandboxes are accessed exclusively through AWS SSM Session Manager:
- Zero open inbound ports - Security Groups have no SSH ingress rules. Port 22 doesn't exist.
- No SSH keys to manage - no generation, rotation, distribution, or leaked keys on GitHub.
-
IAM-gated access - who can connect is controlled by IAM policy, not by who has a
.pemfile. - Full session audit - every session and every command is logged to CloudTrail and CloudWatch. There is no "off the record."
- No bastion hosts - no jump boxes, no VPN. SSM connects through the agent, even in private subnets with no internet access.
Even if a sandbox IAM role is misconfigured - or an agent finds a way to escalate within the application account - the Service Control Policy (SCP) is an org-level backstop that cannot be bypassed from within the account. SCPs are enforced by AWS Organizations at the API layer, before IAM policy evaluation.
The km-sandbox-containment SCP is deployed to the management account and attached to the application account. Six deny statements:
| Statement | What It Blocks | Why It Matters |
|---|---|---|
| DenyInfraAndStorage | SG mutation, VPC/subnet/route/IGW/NAT creation, VPC peering, Transit Gateway, snapshot/image creation and export | A compromised sandbox cannot open new network paths, create escape routes, peer with other VPCs, or exfiltrate data via EBS snapshots or AMI copies |
| DenyInstanceMutation |
RunInstances, ModifyInstanceAttribute, ModifyInstanceMetadataOptions
|
Prevents launching rogue EC2 instances or disabling IMDSv2 (which would enable SSRF credential theft via the metadata service) |
| DenyIAMEscalation |
CreateRole, AttachRolePolicy, DetachRolePolicy, PassRole, AssumeRole
|
Blocks the classic IAM privilege escalation chain: create a new admin role β attach AdministratorAccess β assume it |
| DenySSMPivot |
SendCommand, StartSession
|
Prevents a compromised sandbox from using SSM to pivot laterally into other sandbox instances |
| DenyOrgDiscovery |
organizations:List*, organizations:Describe*
|
Prevents enumeration of the org structure, other accounts, and OUs - information useful for targeting lateral movement |
| DenyOutsideRegion | All regional actions outside allowed regions | Region-locks the entire account to prevent resource creation in regions where there's no monitoring or VPC infrastructure |
Each statement uses ArnNotLike conditions to carve out trusted operator roles (SSO, provisioner, lifecycle handlers). The carve-outs are minimal - for example, the budget enforcer Lambda only gets an IAM carve-out (it needs AttachRolePolicy/DetachRolePolicy to revoke Bedrock access), not a network or instance carve-out.
The SCP is deployed via km bootstrap --dry-run=false. Run km bootstrap --show-prereqs to see the exact IAM role and trust policy that must be created in the management account first.
| Layer | Control | Enforcement |
|---|---|---|
| Organization | SCP sandbox containment | Org-level deny on SG/network/IAM/instance/SSM/region - cannot be bypassed from within the account |
| Account | Three-account isolation | Sandbox blast radius limited to Application account; state and DNS unreachable |
| Network | VPC Security Groups | Primary boundary - blocks all egress except proxy paths |
| DNS | DNS proxy sidecar / eBPF resolver | Allowlisted suffixes only; non-matching β NXDOMAIN |
| HTTP | HTTP proxy sidecar / eBPF connect4 | Allowlisted hosts only; non-matching β 403 / EPERM |
| eBPF | Cgroup BPF programs (connect4, sendmsg4, sockops, egress) | Kernel-level enforcement; LPM trie allowlist; ring buffer audit; no root bypass |
| Identity | Scoped IAM sessions | Region-locked, time-limited, minimal permissions |
| Ed25519 signed email | Per-sandbox key pairs; profile-controlled signing, verification, and encryption policies | |
| Slack | Bridge Lambda + signing secret | Outbound Ed25519-signed; inbound HMAC-verified; bot token never leaves AWS |
| Secrets | SSM Parameter Store + KMS | Allowlisted refs only; per-sandbox encryption key with auto-rotation |
| Metadata | IMDSv2 enforced | Token-required; blocks SSRF credential theft via instance metadata |
| Source | GitHub App scoped tokens | Per-repo, per-ref, per-permission; short-lived installation tokens refreshed via Lambda |
| Filesystem | Path-level enforcement | Writable vs read-only directories at OS level |
| Audit | Command + network logging | Secret-redacted; delivered to CloudWatch/S3 |
| TLS Observability | eBPF SSL uprobes (OpenSSL, Go, BoringSSL) | Passive plaintext capture without MITM certs; independent audit trail |
| Telemetry | OTEL observability | Claude Code prompts, tool calls, API requests, cost metrics β OTel Collector β S3 |
| Budget | Compute + AI spend tracking | DynamoDB real-time metering; proxy 403 + IAM revocation at ceiling |
Three modes via spec.network.enforcement:
-
proxy(default) - iptables DNAT redirects traffic to userspace proxy sidecars for MITM inspection. Traditional approach, works everywhere. -
ebpf- Cilium-style cgroup BPF programs enforce DNS/HTTP/TLS-SNI allowlists directly in the kernel. No iptables, no DNAT bypass possible (closes the root-user escape). -
both- eBPFconnect4as the primary block-mode enforcer, with selective DNAT rewrite to a transparent proxy for L7-required hosts (GitHub repo filtering, Bedrock token metering). Non-L7 traffic flows direct - never touches the proxy.
spec:
network:
enforcement: "both"
egress:
allowedDNSSuffixes: [".amazonaws.com", ".github.com", β¦]
allowedHosts: ["api.anthropic.com", β¦]When enforcement is ebpf or both, the sandbox uses Cilium-style cgroup BPF programs instead of (or alongside) iptables DNAT. Same approach Cilium uses in Kubernetes - attach BPF programs to a cgroup to intercept all network syscalls from processes in that group. E2E verified across 14+ iterations on AL2023 kernel 6.18.
Sandbox Cgroup (/sys/fs/cgroup/km.slice/km-{id}.scope)
β
βββ cgroup/connect4 - TCP connect() hook
β βββ Dual-PID exemption (enforcer + proxy sidecar)
β βββ LPM trie lookup: is dest IP in allowed_cidrs?
β βββ If denied β return EPERM (connection refused)
β βββ If allowed + proxy-marked β stash original dest, rewrite to 127.0.0.1:3128
β βββ Emit structured audit event to ring buffer
β
βββ cgroup/sendmsg4 - UDP sendmsg() hook
β βββ Intercept DNS (port 53)
β βββ Redirect to local resolver (127.0.0.1:53)
β
βββ sockops - TCP state transitions
β βββ Map source_port β socket_cookie (transparent proxy recovers real dest)
β
βββ cgroup_skb/egress - Packet-level backstop
βββ Parse IPv4 header, check allowed_cidrs
βββ Drop packets to non-allowlisted IPs (L3 defense-in-depth)
How the allowlist stays fresh: A userspace DNS resolver (127.0.0.1:53) checks every DNS query against the profile's allowedDNSSuffixes. Allowed queries are forwarded to VPC DNS; resolved IPs are injected into the BPF allowed_cidrs LPM trie map with TTL-based expiry. For L7-required hosts (GitHub, Bedrock), IPs are also inserted into http_proxy_ips for selective proxy redirect. The allowlist is dynamic - it grows as the agent resolves new hosts and shrinks as DNS TTLs expire.
Why cgroups? The BPF programs are scoped to the sandbox cgroup, not the whole instance. The enforcer process, SSM agent, and sidecars run outside the cgroup and are unaffected. Same isolation model that makes this approach portable to EKS pods, Docker cgroups, and other container runtimes in future substrates.
Transparent proxy (both mode): When connect4 rewrites a connection's destination to the local proxy, the sandbox app sends raw TLS (not HTTP CONNECT). A TransparentListener in the HTTP proxy peeks the first byte (0x16 = TLS ClientHello), then recovers the original destination via a three-step BPF map lookup chain: src_port_to_sock[peer_port] β sock_to_original_ip[cookie] β sock_to_original_port[cookie]. Enables L7 inspection (GitHub repo filtering, Bedrock token metering) without HTTP_PROXY env var cooperation from the client.
Editable diagram: docs/diagrams/ebpf-architecture.excalidraw
Alongside kernel-level enforcement, eBPF uprobes provide passive TLS plaintext capture for audit and observability - without MITM certificates. E2E verified on AL2023 with 8 probes attaching to OpenSSL 3.2.2:
| TLS Library | Used By | Uprobe Target | Status |
|---|---|---|---|
| OpenSSL (libssl.so.3) | curl, wget, Python, Ruby |
SSL_write / SSL_read / SSL_write_ex / SSL_read_ex
|
E2E verified (8 probes) |
| Go crypto/tls | Goose (if Go) |
writeRecordLocked / Read
|
Schema-ready (per-RET offsets, no uretprobe) |
| BoringSSL (Bun) | Claude Code | SSL_write |
Schema-ready (byte-pattern offset discovery) |
| rustls | Future Rust agents | rustls_connection_write_tls |
Schema-ready |
What uprobes add that MITM can't: Visibility into traffic that bypasses the proxy (if any), audit trail independent of proxy logs, plaintext capture without certificate trust issues. The observer logs structured JSON events with HTTP method, URL, host, and response status for every TLS connection. Git-smart-HTTP (clone/push) uses HTTP/1.1 and is captured correctly.
What uprobes can't replace: Active request blocking (uprobes are passive - they observe but cannot deny), HTTP/2 body parsing (GitHub API and Bedrock use HTTP/2 - uprobe captures HPACK-compressed binary, not parseable HTTP/1.1), and the transparent proxy's active enforcement (repo filtering, budget 403s).
Generate a minimal SandboxProfile from observed traffic:
km create profiles/learn.yaml # wide-open sandbox with learnMode + privileged
km shell --learn <sandbox-id> # observe traffic + commands, generate profile on exit
cat learned.*.yaml # annotated profile with DNS suffixes, initCommands
km validate learned.*.yaml # validate before useAdd --ami to km shell --learn to bake the running instance into a custom AMI on exit; the AMI ID is written into the generated profile's spec.runtime.ami.
Budget enforcement tracks two spend pools per sandbox, stored in a DynamoDB global table replicated to every region where agents run. Reads from within the sandbox hit the local regional replica with sub-millisecond latency.
Tracked as spot rate Γ elapsed minutes, sourced from the AWS Price List API at sandbox creation. Paused/hibernated intervals are excluded. When the compute budget is exhausted, the sandbox is suspended - not destroyed:
-
EC2:
StopInstancespreserves the EBS volume. No compute charges accrue while stopped. - ECS Fargate: Artifacts are uploaded, then the task is stopped. Re-provision from the stored S3 profile on top-up.
The HTTP proxy sidecar intercepts every AI API response - Bedrock (invoke-with-response-stream), Anthropic direct (api.anthropic.com, for Claude Code Max/API key users), and OpenAI-compatible endpoints. A tee-reader streams data through to the client without blocking, captures the full response, then extracts token counts asynchronously:
-
Bedrock streaming: base64-decodes
{"bytes":"<b64>"}event-stream wrappers to findmessage_start/message_deltapayloads -
Anthropic SSE: parses
data:lines for the same event types -
Non-streaming: reads
usagefrom the JSON response body
Tokens are priced against static model rates and atomically incremented in the DynamoDB spend counter.
Dual-layer enforcement at 100%:
- Proxy layer (immediate) - HTTP proxy returns 403 for subsequent AI calls
- IAM layer (backstop) - a Lambda revokes the sandbox IAM role's Bedrock permissions, catching calls that bypass the proxy
km status shows per-model AI spend grouped by provider; warnings fire at 80% (configurable via spec.budget.warningThreshold). km budget add unblocks the proxy, restores IAM, and restarts suspended compute in one command.
Claude Code running inside sandboxes exports OpenTelemetry (prompts, tool calls, API requests, token usage, cost metrics) through an OTel Collector sidecar to S3. Five views via km otel:
km otel <sandbox> # summary: budget + S3 + metrics
km otel <sandbox> --prompts # user prompts with timestamps
km otel <sandbox> --events # full event stream (API calls, tool calls)
km otel <sandbox> --tools # tool call history with parameters and duration
km otel <sandbox> --timeline # conversation turns with per-turn cost