Skip to content

Pluggable read-only cloud-context MCP (GCP and AWS) #44

@sourcehawk

Description

@sourcehawk

Problem

When triaging a Kubernetes incident, the operator agent can inspect the cluster (triagent-k8s) and reach it through Teleport (triagent-teleport), but it is blind to the cloud layer the cluster sits on. Reachability, permissions, managed-cluster config, logs, and "what changed right before this broke" all live in the cloud, not the cluster, and the two clouds we run on (GCP and AWS) answer them through different APIs.

This epic adds one read-only cloud-context MCP that gives the agent that context without ever being able to mutate cloud state or escalate its own privilege. Adding coverage is a config edit, not new Go; adding a cloud is a new provider behind one interface, not a parallel MCP.

In scope

  • A pkg/mcp/cloud/ package, provider-selected via --kind=cloud --provider=<gcp|aws>, aliased triagent-cloud-<alias>; thin typed tools (list_inventory, session_status) plus a gated read-only run_cli + list_allowed_commands for the long tail.
  • A bypass-resistant command harness: argv-only input, direct execve (no shell), a profile-overridable allowlist, a hardcoded deny floor (subcommands, flags, arg-prefixes) the config cannot re-enable, scope validation, and output truncation.
  • A GCP provider and an AWS provider implementing the interface over gcloud / aws.
  • Launcher integration and pre-session auth visibility: profile cloud: config, per-session aliasing + pinned-identity env injection, a preflight identity probe with visible degrade, and a read-only cloud status pill in the connections panel.

Out of scope

  • Any write, create, update, or delete operation against either cloud. Read-only by construction and by harness.
  • Clouds beyond GCP and AWS.
  • Reading secrets, downloading bucket objects, shelling into instances, or agent-chosen identity impersonation — all on the hardcoded deny floor.
  • OAuth / SSO login flows inside triagent (a deferred future enhancement); the static-key connection realization (deferred fallback).
  • Billing, cost, or quota reporting.

Risks & mitigations

  • The agent bypasses the command safety net. Structural defenses, not string filtering: no shell ever (argv + direct execve); a deny floor over subcommands, flags, and argument prefixes; scope validation. The deployment's read-only IAM grant on the pinned identity is an independent backstop.
  • Advertised commands drift from enforced commands. list_allowed_commands and run_cli read one config — the single source of truth.
  • The agent picks its own identity. The pinned identity and command allowlist load server-side from the profile; the agent can read them, never mutate them. Impersonation is pinned in harness-controlled env, never agent argv.
  • Raw CLI output blows the context budget. Output truncation plus typed tools for the orientation path.
  • Soft-degrade is new preflight behavior. Cloud-source-scoped and explicit; the existing k8s block-on-failure is unchanged.

Design overview

One package (pkg/mcp/cloud/), one case "cloud" in cmd/triagent-mcp/serve.go (ADR-0001), parameterized by --provider, aliased triagent-cloud-<alias> at the mcpconfig.go wiring layer (ADR-0003) — the git-MCP pattern with a cloud provider as the bound target. Deployment config loads from the runtime profile (ADR-0008).

Provider behaviour sits behind an injectable Provider interface (the teleport pattern); gcp and aws implementations live in subpackages wired by serve.go. All cloud access shells the provider CLI through one exec core; no cloud SDK dependency, so auth and impersonation stay uniform. The command allowlist mirrors pkg/mcp/k8s's LoadAllowlist: embedded default, profile-overridable, with a hardcoded floor the override can never re-enable (the way Secret is always filtered).

The pinned read-only identity is a deployment-chosen principal the agent can neither select nor authenticate. v1 realizes it via operator-ambient base auth plus harness-pinned impersonation injected through cmd.Env (CLOUDSDK_AUTH_IMPERSONATE_SERVICE_ACCOUNT for GCP; AWS_PROFILE with an assume-role profile for AWS); Workload Identity / IRSA falls out of the same env path for server deployments. A single whoami probe validates the identity chain and feeds three surfaces that therefore cannot disagree: the read-only connections pill (pre-session visibility), preflight.Run() (the gate), and the session_status tool. A failed cloud probe degrades the cloud source visibly rather than blocking the session, so Kubernetes triage proceeds.

flowchart TD
    operator[operator agent] --> typed["typed tools<br/>list_inventory · session_status"]
    operator --> disc["list_allowed_commands"]
    operator --> cli["run_cli<br/>(argv tokens only)"]
    typed --> iface{{Provider interface}}
    cli --> harness["safe harness<br/>no shell · fixed binary · allowlist<br/>+ deny floor (subcommands & flags)<br/>+ scope check + truncate"]
    cfg[("command allowlist<br/>embedded default,<br/>profile-overridable")] --> harness
    cfg --> disc
    harness --> iface
    iface --> gcp["gcp provider<br/>gcloud + defaults"]
    iface --> aws["aws provider<br/>aws + defaults"]
    id[("pinned read-only identity<br/>impersonated via harness env")] -.outer floor.-> gcp
    id -.outer floor.-> aws
Loading

Sub-issues

Linked below: scaffold + harness, GCP provider, AWS provider, launcher integration.

Metadata

Metadata

Assignees

Labels

epicParent issue grouping feature/task/bug sub-issues

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions