Skip to content

Security and Network

KPH edited this page Jun 24, 2026 · 1 revision

Security & Network

Security Model

Klanker Maker uses explicit allowlists everywhere - if it's not in the policy, it's denied. There is no "default allow."

No SSH. No Bastion. No Keys.

Sandboxes are accessed exclusively through AWS SSM Session Manager:

  • Zero open inbound ports - Security Groups have no SSH ingress rules. Port 22 doesn't exist.
  • No SSH keys to manage - no generation, rotation, distribution, or leaked keys on GitHub.
  • IAM-gated access - who can connect is controlled by IAM policy, not by who has a .pem file.
  • Full session audit - every session and every command is logged to CloudTrail and CloudWatch. There is no "off the record."
  • No bastion hosts - no jump boxes, no VPN. SSM connects through the agent, even in private subnets with no internet access.

SCP Sandbox Containment

Even if a sandbox IAM role is misconfigured - or an agent finds a way to escalate within the application account - the Service Control Policy (SCP) is an org-level backstop that cannot be bypassed from within the account. SCPs are enforced by AWS Organizations at the API layer, before IAM policy evaluation.

The km-sandbox-containment SCP is deployed to the management account and attached to the application account. Six deny statements:

Statement What It Blocks Why It Matters
DenyInfraAndStorage SG mutation, VPC/subnet/route/IGW/NAT creation, VPC peering, Transit Gateway, snapshot/image creation and export A compromised sandbox cannot open new network paths, create escape routes, peer with other VPCs, or exfiltrate data via EBS snapshots or AMI copies
DenyInstanceMutation RunInstances, ModifyInstanceAttribute, ModifyInstanceMetadataOptions Prevents launching rogue EC2 instances or disabling IMDSv2 (which would enable SSRF credential theft via the metadata service)
DenyIAMEscalation CreateRole, AttachRolePolicy, DetachRolePolicy, PassRole, AssumeRole Blocks the classic IAM privilege escalation chain: create a new admin role β†’ attach AdministratorAccess β†’ assume it
DenySSMPivot SendCommand, StartSession Prevents a compromised sandbox from using SSM to pivot laterally into other sandbox instances
DenyOrgDiscovery organizations:List*, organizations:Describe* Prevents enumeration of the org structure, other accounts, and OUs - information useful for targeting lateral movement
DenyOutsideRegion All regional actions outside allowed regions Region-locks the entire account to prevent resource creation in regions where there's no monitoring or VPC infrastructure

Each statement uses ArnNotLike conditions to carve out trusted operator roles (SSO, provisioner, lifecycle handlers). The carve-outs are minimal - for example, the budget enforcer Lambda only gets an IAM carve-out (it needs AttachRolePolicy/DetachRolePolicy to revoke Bedrock access), not a network or instance carve-out.

The SCP is deployed via km bootstrap --dry-run=false. Run km bootstrap --show-prereqs to see the exact IAM role and trust policy that must be created in the management account first.

Defense in Depth

Layer Control Enforcement
Organization SCP sandbox containment Org-level deny on SG/network/IAM/instance/SSM/region - cannot be bypassed from within the account
Account Three-account isolation Sandbox blast radius limited to Application account; state and DNS unreachable
Network VPC Security Groups Primary boundary - blocks all egress except proxy paths
DNS DNS proxy sidecar / eBPF resolver Allowlisted suffixes only; non-matching β†’ NXDOMAIN
HTTP HTTP proxy sidecar / eBPF connect4 Allowlisted hosts only; non-matching β†’ 403 / EPERM
eBPF Cgroup BPF programs (connect4, sendmsg4, sockops, egress) Kernel-level enforcement; LPM trie allowlist; ring buffer audit; no root bypass
Identity Scoped IAM sessions Region-locked, time-limited, minimal permissions
Email Ed25519 signed email Per-sandbox key pairs; profile-controlled signing, verification, and encryption policies
Slack Bridge Lambda + signing secret Outbound Ed25519-signed; inbound HMAC-verified; bot token never leaves AWS
Secrets SSM Parameter Store + KMS Allowlisted refs only; per-sandbox encryption key with auto-rotation
Metadata IMDSv2 enforced Token-required; blocks SSRF credential theft via instance metadata
Source GitHub App scoped tokens Per-repo, per-ref, per-permission; short-lived installation tokens refreshed via Lambda
Filesystem Path-level enforcement Writable vs read-only directories at OS level
Audit Command + network logging Secret-redacted; delivered to CloudWatch/S3
TLS Observability eBPF SSL uprobes (OpenSSL, Go, BoringSSL) Passive plaintext capture without MITM certs; independent audit trail
Telemetry OTEL observability Claude Code prompts, tool calls, API requests, cost metrics β†’ OTel Collector β†’ S3
Budget Compute + AI spend tracking DynamoDB real-time metering; proxy 403 + IAM revocation at ceiling

Network Enforcement

Three modes via spec.network.enforcement:

  • proxy (default) - iptables DNAT redirects traffic to userspace proxy sidecars for MITM inspection. Traditional approach, works everywhere.
  • ebpf - Cilium-style cgroup BPF programs enforce DNS/HTTP/TLS-SNI allowlists directly in the kernel. No iptables, no DNAT bypass possible (closes the root-user escape).
  • both - eBPF connect4 as the primary block-mode enforcer, with selective DNAT rewrite to a transparent proxy for L7-required hosts (GitHub repo filtering, Bedrock token metering). Non-L7 traffic flows direct - never touches the proxy.
spec:
  network:
    enforcement: "both"
    egress:
      allowedDNSSuffixes: [".amazonaws.com", ".github.com", …]
      allowedHosts: ["api.anthropic.com", …]

eBPF deep dive

When enforcement is ebpf or both, the sandbox uses Cilium-style cgroup BPF programs instead of (or alongside) iptables DNAT. Same approach Cilium uses in Kubernetes - attach BPF programs to a cgroup to intercept all network syscalls from processes in that group. E2E verified across 14+ iterations on AL2023 kernel 6.18.

Sandbox Cgroup (/sys/fs/cgroup/km.slice/km-{id}.scope)
β”‚
β”œβ”€β”€ cgroup/connect4   - TCP connect() hook
β”‚   β”œβ”€β”€ Dual-PID exemption (enforcer + proxy sidecar)
β”‚   β”œβ”€β”€ LPM trie lookup: is dest IP in allowed_cidrs?
β”‚   β”œβ”€β”€ If denied β†’ return EPERM (connection refused)
β”‚   β”œβ”€β”€ If allowed + proxy-marked β†’ stash original dest, rewrite to 127.0.0.1:3128
β”‚   └── Emit structured audit event to ring buffer
β”‚
β”œβ”€β”€ cgroup/sendmsg4   - UDP sendmsg() hook
β”‚   β”œβ”€β”€ Intercept DNS (port 53)
β”‚   └── Redirect to local resolver (127.0.0.1:53)
β”‚
β”œβ”€β”€ sockops           - TCP state transitions
β”‚   └── Map source_port β†’ socket_cookie (transparent proxy recovers real dest)
β”‚
└── cgroup_skb/egress - Packet-level backstop
    β”œβ”€β”€ Parse IPv4 header, check allowed_cidrs
    └── Drop packets to non-allowlisted IPs (L3 defense-in-depth)

How the allowlist stays fresh: A userspace DNS resolver (127.0.0.1:53) checks every DNS query against the profile's allowedDNSSuffixes. Allowed queries are forwarded to VPC DNS; resolved IPs are injected into the BPF allowed_cidrs LPM trie map with TTL-based expiry. For L7-required hosts (GitHub, Bedrock), IPs are also inserted into http_proxy_ips for selective proxy redirect. The allowlist is dynamic - it grows as the agent resolves new hosts and shrinks as DNS TTLs expire.

Why cgroups? The BPF programs are scoped to the sandbox cgroup, not the whole instance. The enforcer process, SSM agent, and sidecars run outside the cgroup and are unaffected. Same isolation model that makes this approach portable to EKS pods, Docker cgroups, and other container runtimes in future substrates.

Transparent proxy (both mode): When connect4 rewrites a connection's destination to the local proxy, the sandbox app sends raw TLS (not HTTP CONNECT). A TransparentListener in the HTTP proxy peeks the first byte (0x16 = TLS ClientHello), then recovers the original destination via a three-step BPF map lookup chain: src_port_to_sock[peer_port] β†’ sock_to_original_ip[cookie] β†’ sock_to_original_port[cookie]. Enables L7 inspection (GitHub repo filtering, Bedrock token metering) without HTTP_PROXY env var cooperation from the client.

Editable diagram: docs/diagrams/ebpf-architecture.excalidraw

eBPF SSL uprobe observability

Alongside kernel-level enforcement, eBPF uprobes provide passive TLS plaintext capture for audit and observability - without MITM certificates. E2E verified on AL2023 with 8 probes attaching to OpenSSL 3.2.2:

TLS Library Used By Uprobe Target Status
OpenSSL (libssl.so.3) curl, wget, Python, Ruby SSL_write / SSL_read / SSL_write_ex / SSL_read_ex E2E verified (8 probes)
Go crypto/tls Goose (if Go) writeRecordLocked / Read Schema-ready (per-RET offsets, no uretprobe)
BoringSSL (Bun) Claude Code SSL_write Schema-ready (byte-pattern offset discovery)
rustls Future Rust agents rustls_connection_write_tls Schema-ready

What uprobes add that MITM can't: Visibility into traffic that bypasses the proxy (if any), audit trail independent of proxy logs, plaintext capture without certificate trust issues. The observer logs structured JSON events with HTTP method, URL, host, and response status for every TLS connection. Git-smart-HTTP (clone/push) uses HTTP/1.1 and is captured correctly.

What uprobes can't replace: Active request blocking (uprobes are passive - they observe but cannot deny), HTTP/2 body parsing (GitHub API and Bedrock use HTTP/2 - uprobe captures HPACK-compressed binary, not parseable HTTP/1.1), and the transparent proxy's active enforcement (repo filtering, budget 403s).

Learn mode

Generate a minimal SandboxProfile from observed traffic:

km create profiles/learn.yaml          # wide-open sandbox with learnMode + privileged
km shell --learn <sandbox-id>          # observe traffic + commands, generate profile on exit
cat learned.*.yaml                     # annotated profile with DNS suffixes, initCommands
km validate learned.*.yaml             # validate before use

Add --ami to km shell --learn to bake the running instance into a custom AMI on exit; the AMI ID is written into the generated profile's spec.runtime.ami.


Budget Enforcement

Budget enforcement tracks two spend pools per sandbox, stored in a DynamoDB global table replicated to every region where agents run. Reads from within the sandbox hit the local regional replica with sub-millisecond latency.

Budget Enforcement Flow - proxy metering, DynamoDB tracking, dual-layer enforcement

Compute budget

Tracked as spot rate Γ— elapsed minutes, sourced from the AWS Price List API at sandbox creation. Paused/hibernated intervals are excluded. When the compute budget is exhausted, the sandbox is suspended - not destroyed:

  • EC2: StopInstances preserves the EBS volume. No compute charges accrue while stopped.
  • ECS Fargate: Artifacts are uploaded, then the task is stopped. Re-provision from the stored S3 profile on top-up.

AI budget (Bedrock, Anthropic, OpenAI)

The HTTP proxy sidecar intercepts every AI API response - Bedrock (invoke-with-response-stream), Anthropic direct (api.anthropic.com, for Claude Code Max/API key users), and OpenAI-compatible endpoints. A tee-reader streams data through to the client without blocking, captures the full response, then extracts token counts asynchronously:

  • Bedrock streaming: base64-decodes {"bytes":"<b64>"} event-stream wrappers to find message_start/message_delta payloads
  • Anthropic SSE: parses data: lines for the same event types
  • Non-streaming: reads usage from the JSON response body

Tokens are priced against static model rates and atomically incremented in the DynamoDB spend counter.

Dual-layer enforcement at 100%:

  1. Proxy layer (immediate) - HTTP proxy returns 403 for subsequent AI calls
  2. IAM layer (backstop) - a Lambda revokes the sandbox IAM role's Bedrock permissions, catching calls that bypass the proxy

km status shows per-model AI spend grouped by provider; warnings fire at 80% (configurable via spec.budget.warningThreshold). km budget add unblocks the proxy, restores IAM, and restarts suspended compute in one command.

OTEL telemetry

Claude Code running inside sandboxes exports OpenTelemetry (prompts, tool calls, API requests, token usage, cost metrics) through an OTel Collector sidecar to S3. Five views via km otel:

km otel <sandbox>              # summary: budget + S3 + metrics
km otel <sandbox> --prompts    # user prompts with timestamps
km otel <sandbox> --events     # full event stream (API calls, tool calls)
km otel <sandbox> --tools      # tool call history with parameters and duration
km otel <sandbox> --timeline   # conversation turns with per-turn cost