From 3c085b1721e883d9ee6688f02c07ede1286bf5bc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=86gir=20M=C3=A1ni=20Hauksson?= <54936225+sourcehawk@users.noreply.github.com> Date: Wed, 3 Jun 2026 13:42:56 +0200 Subject: [PATCH 1/3] docs: correct stale claims and tighten the docs site Remove things that no longer hold and fix consistency across the website content (docs/content/*, canonical source for the GitHub Pages site and the in-app docs mirror) plus the README entry point. - Teleport is not the default. The launcher's cluster picker uses the kubeconfig auth provider by default; Teleport is opt-in via the profile. Corrected in the end-to-end walk and the ownership table. - Drop the "cluster ID (required)" form field and the "namespace is derived per your profile" claim. The cluster is the dropdown, every form input is optional, and the agent narrows the namespace at runtime; it isn't bound at preflight. - Remove the "Why it exists" 7-step list from investigations.md; it duplicated overview's "The problem it solves". Point there instead. - Rename "What lives where" to "Separation of concerns" with a lead sentence so the ownership table's purpose is clear. - Drop the "Enable auto mode (coming soon)" placeholder; document the real paths (start screen, watch-driven, or restart). - overview: "Four surfaces" now lists four, with the MCP catalog reframed as the tool layer beneath them; add the Cloud providers integration; fix the "single cluster's namespace" binding claim; sentence-case "Alpha release"; note read-only GCP/AWS context. - README: fix the broken docs-site links (the /docs/
/ path segment 404s; the site serves /
/) and the stale "Enter the namespace" step. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 12 +++---- docs/content/investigations.md | 57 ++++++++++++---------------------- docs/content/overview.md | 30 ++++++++++-------- docs/content/profiles.md | 2 +- 4 files changed, 43 insertions(+), 58 deletions(-) diff --git a/README.md b/README.md index 289a9bee..cf0ce66c 100644 --- a/README.md +++ b/README.md @@ -26,10 +26,10 @@ It can also watch for trouble on its own. Point it at Slack channels or GitHub i Four surfaces, each documented in depth on the [docs site](https://sourcehawk.github.io/triagent/): -- **[Investigations](https://sourcehawk.github.io/triagent/docs/investigations/)**: the live triage view. Hand the assistant a symptom and some context (cluster, Slack thread, incident.io link, notes), watch it work through the diagnosis step by step, and ship the summary as markdown. -- **[Playbooks](https://sourcehawk.github.io/triagent/docs/playbooks/)**: the step-by-step troubleshooting procedures the assistant follows, defined as YAML. Write and edit them right in the browser, with an AI assistant helping. -- **[Wiki](https://sourcehawk.github.io/triagent/docs/wiki/)**: the team's lasting knowledge base of failure patterns and prior fixes, which the assistant can search. -- **[Watches](https://sourcehawk.github.io/triagent/docs/watches/)**: rules that turn Slack messages, GitHub issues, or alerts into proposed investigations on their own. +- **[Investigations](https://sourcehawk.github.io/triagent/investigations/)**: the live triage view. Hand the assistant a symptom and some context (cluster, Slack thread, incident.io link, notes), watch it work through the diagnosis step by step, and ship the summary as markdown. +- **[Playbooks](https://sourcehawk.github.io/triagent/playbooks/)**: the step-by-step troubleshooting procedures the assistant follows, defined as YAML. Write and edit them right in the browser, with an AI assistant helping. +- **[Wiki](https://sourcehawk.github.io/triagent/wiki/)**: the team's lasting knowledge base of failure patterns and prior fixes, which the assistant can search. +- **[Watches](https://sourcehawk.github.io/triagent/watches/)**: rules that turn Slack messages, GitHub issues, or alerts into proposed investigations on their own. @@ -117,9 +117,9 @@ This boots a localhost HTTP server, prints its URL with a per-launch token, and In the browser: -1. **Pick a cluster**: directly from kubeconfig, or via Teleport. +1. **Pick a cluster** from the dropdown (sourced from your kubeconfig by default; Teleport if your profile uses it). 2. **Log in** if prompted (SSO/2FA prompts go to the launcher terminal). -3. **Enter the namespace** and optional notes, Slack channel, or incident URL. +3. **Add context** (all optional): a sentence on the symptom, a Slack channel, or an incident URL. The assistant narrows down the namespace itself. 4. **Investigate**: the assistant works through the playbook, uses its tools, and writes a summary you can copy or push upstream as a PR (once you've wired an upstream repo; see below). ### A few useful commands diff --git a/docs/content/investigations.md b/docs/content/investigations.md index 952eac3f..120ccaf4 100644 --- a/docs/content/investigations.md +++ b/docs/content/investigations.md @@ -18,31 +18,8 @@ The result of a typical session is a tidy markdown summary the operator can past likely root cause, evidence, recommended next steps. The activity panel keeps every tool call visible, so operators can audit the chain or interrupt with a follow-up at any point. -## Why it exists - -Cluster triage isn't a `kubectl` command; it's a cross-source scramble. A typical incident looks like this: - -1. **Alert lands.** Pager, Slack `@`-mention, customer ticket. You were probably already on something else. -2. **Catch up on the channel.** What has the customer / oncall / support already said? What's been ruled out? Who - else is looking? -3. **Read the cluster state.** Pods, events, logs, the failing pod's owner CR, the Crossplane composite, the - backup status, the gateway service. -4. **Check what changed.** Recent deploys, spec bumps, controller version skews, last week's incident - write-up that mentioned the same component. -5. **Pull metrics.** Prometheus for saturation, incident.io for the ongoing-incident timeline. -6. **Recall prior art.** Have we seen this exact pattern before, and what fixed it? -7. **Synthesise.** Hold the cross-references in your head, decide which thread to pull next, write up a conclusion - someone else can act on. - -Each step is mechanical for an experienced operator, but the tabs multiply and the synthesis is slow. Worse, the -patterns drift as new operators rotate in, and the artefact at the end is a Slack message that decays the moment the -channel scrolls. - -This tool collapses steps 2–6 into one conversation against one audit trail. The walker knows which sources to consult -for which failure shapes; the MCP catalog turns each query into a single typed tool call; the summary in step 7 falls -out of the walker's terminal node. Operators stay in the loop: every tool call is visible in the activity panel, the -conclusion is editable before sharing, and you can step in mid-session whenever the walker hits something it doesn't -recognise. +For the broader problem this surface addresses — the cross-source scramble a typical incident turns into — see +[Overview → The problem it solves](/docs/overview#the-problem-it-solves). ## How it works @@ -86,10 +63,11 @@ the token falls out of the address bar. The launcher stays alive in the terminal ### One investigation, end-to-end -1. **Pick a cluster.** The launcher queries the configured provider (Teleport by default) for the operator's - reachable clusters, then calls the provider's `Login` to obtain a kubeconfig context. -2. **Preflight.** Confirms the namespace exists, RBAC permits pod listing, and writes a per-session `mcp.json` - describing which triagent-mcp servers to spawn. +1. **Pick a cluster.** The launcher queries the configured provider (kubeconfig by default, Teleport when the profile + selects it) for the operator's reachable clusters, then calls the provider's `Login` to obtain a kubeconfig context. +2. **Preflight.** Confirms the cluster is reachable and RBAC permits read access, then writes a per-session `mcp.json` + describing which triagent-mcp servers to spawn. The agent narrows down the namespace at runtime via the k8s tools; + it isn't fixed at preflight. 3. **Spawn the agent.** Claude is launched with that `mcp.json` plus a system prompt that points the agent at the `investigation` playbook. The agent is told nothing product-specific in prose; the playbooks carry the procedural knowledge. @@ -101,19 +79,22 @@ the token falls out of the address bar. The launcher stays alive in the terminal 6. **Follow up or close.** The operator can keep chatting (clarifying questions, deeper dives); those route through the `followup_conversation` meta-playbook so the response shape stays coherent. -### What lives where +### Separation of concerns + +Each part of the system owns exactly one job, so any one can change without touching the others. The launcher itself +contains no decision logic — it wires processes together and streams the result to the browser. | Concern | Owner | | ---------------------- | -------------------------------------------------------------- | -| Cluster picker / login | Provider plugin (Teleport by default) | +| Cluster picker / login | Auth provider (kubeconfig by default, Teleport optional) | | Tool execution | triagent-mcp servers (k8s, strategies, git, wiki, ...) | | Decision logic | YAML playbooks (the strategies MCP walks them) | | Reasoning | Claude CLI (the agent invoking tools) | | UI | Next.js SPA (this app), embedded in the launcher binary | | Authentication | Per-launch random token + cookie | -The launcher itself contains zero decision logic. Playbooks own the procedure, triagent-mcp owns tool semantics, -claude owns judgment. Each piece is editable independently. +Playbooks own the procedure, triagent-mcp owns tool semantics, claude owns judgment. Each piece is editable +independently. ## Using the tool @@ -126,8 +107,8 @@ claude owns judgment. Each piece is editable independently. 3. Click **+ new investigation** in the sidebar (or navigate to `/investigations/new`) to start a fresh one. Pick a cluster from the dropdown. If the provider isn't logged in, you'll be prompted to authenticate (SSO/2FA prompts surface in the terminal where you ran `triagent start`, not the browser). -4. Fill in the form: - - **cluster ID** (required when using the cluster_id profile input). The data namespace is derived per your profile. +4. Fill in the form. The fields below are individually optional, but the investigation needs at least one starting + point — the cluster you picked above, or one of these: - **incident URL** (optional). Pasted verbatim into the agent's prompt as context, useful for incident.io links so the agent can pull the corresponding incident metadata if the incident.io MCP is connected. - **Slack channel** (optional). When Slack is connected, the field becomes a channel picker (search by name); the @@ -232,9 +213,9 @@ have it and is trained to yield to you in those cases. ### Enabling -- **Start screen:** tick **Run in auto mode** before submitting. -- **Mid-session:** press **Enable auto mode** on the session header - (coming soon; for now, restart with auto mode on). +Tick **Run in auto mode** on the start screen before submitting. A watch can also start a session in auto mode +directly (see [Watches](/docs/watches#two-toggles-auto-ingest-and-auto-start)). To hand an already-running manual +session to the operator agent, restart it with auto mode on. ### Take over diff --git a/docs/content/overview.md b/docs/content/overview.md index 307752fc..1a645896 100644 --- a/docs/content/overview.md +++ b/docs/content/overview.md @@ -3,9 +3,9 @@ Agentic Incident Investigation, driven from your browser. Triagent is a localhost web app that pairs the Claude reasoning agent with read-only Kubernetes access, an extensible -MCP catalog (Prometheus, Slack, GitHub, incident.io, your own), a guided playbook walker, and a persistent wiki, all -bound to a single cluster's namespace per session. You run `triagent start`, it opens a browser, you hand it the -symptom, and it drives a focused diagnosis you can paste into a ticket when it's done. +MCP catalog (Prometheus, Slack, GitHub, incident.io, read-only GCP/AWS context, your own), a guided playbook walker, +and a persistent wiki, all scoped to a single cluster per session. You run `triagent start`, it opens a browser, you +hand it the symptom, and it drives a focused diagnosis you can paste into a ticket when it's done. ## The problem it solves @@ -60,7 +60,8 @@ New failure shape on Tuesday → playbook PR on Wednesday → every operator has ## What's in the box -Four surfaces, each with a dedicated section in these docs. +Four operator-facing surfaces, each with a dedicated section in these docs: **Investigations**, **Watches**, +**Playbooks**, and **Wiki**. Underneath them sits the **MCP tool catalog** every surface is built on. ### [Investigations](/docs/investigations) @@ -88,14 +89,6 @@ full investigation, so the launcher reaches you before the pager does. Each signal carries a back-reference to the watch and items that produced it; manual start is a click for the ones the agent flagged as `unclear`. -### [MCP servers](/docs/mcp) - -A tool catalog the agent reads like a map, and the same map an operator reads when authoring a playbook. Exposed as -curated tools rather than a raw shell, so the agent never gets to run arbitrary commands. The catalog grows as we wire -in new sources (Kubernetes, Prometheus, the playbook walker, linked git repos, the wiki, Slack, incident.io, …); rather -than enumerate it here, browse the live list at [**/mcp**](/mcp). The catalog reflects exactly what the launcher -loaded for this build. - ### [Playbooks](/docs/playbooks) Procedural knowledge as data. Each playbook is a YAML graph that encodes one failure shape's triage path: read step @@ -111,7 +104,15 @@ real git repo, indexed for the agent to consult during triage. Link density comp canonical entity names, the better the agent's "have we seen this before?" recall gets. Procedure belongs in playbooks; facts belong in the wiki. -## Alpha Release +### [The MCP tool catalog](/docs/mcp) + +The layer beneath all four surfaces. A tool catalog the agent reads like a map, and the same map an operator reads when +authoring a playbook. Exposed as curated tools rather than a raw shell, so the agent never gets to run arbitrary +commands. The catalog grows as we wire in new sources (Kubernetes, Prometheus, the playbook walker, linked git repos, +the wiki, Slack, incident.io, …); rather than enumerate it here, browse the live list at [**/mcp**](/mcp). The catalog +reflects exactly what the launcher loaded for this build. + +## Alpha release This is alpha. Expect rough edges, breaking config changes between versions, and the occasional walker dead-end. Some things are stable enough to plan around: @@ -135,6 +136,9 @@ on file. Each integration has its own page: - **[Slack and incident.io](/docs/connections)** — credentials stored in `~/.config/triagent/credentials.json` (mode 0600), validated against the upstream before saving. +- **[Cloud providers](/docs/cloud-providers)** — read-only GCP or AWS context (reachability, IAM, GKE/EKS config, + logs, audit) so a Kubernetes thread can follow down into the cloud layer. Pinned to a read-only identity in the + profile, never entered in the UI. - **[GitHub repositories](/docs/repos)** — linked over SSH for clone, `gh` CLI for the *Push as PR* flows. Defaults ship via the profile's `linked_repos`; personal repos persist per-machine. - **[Profiles](/docs/profiles)** — the deployment-specific config bundle that wires upstream repos for playbooks / diff --git a/docs/content/profiles.md b/docs/content/profiles.md index df415788..1f2d1612 100644 --- a/docs/content/profiles.md +++ b/docs/content/profiles.md @@ -304,7 +304,7 @@ investigation_inputs: | `text` | single-line input | `{{.value}}` | | `url` | single-line input, light URL validation | `{{.value}}` | | `textarea` | multi-line textarea | `{{.value}}` | -| `cluster_id` | cluster picker bound to detected kube contexts | `{{.value}}` | +| `cluster_id` | cluster picker bound to the provider's clusters | `{{.value}}` | | `slack_channel` | channel picker (filtered by `slack.channel_prefix`) | `{{.id}}`, `{{.name}}`, `{{.url}}` | Required (`optional: false`) inputs must be non-empty at preflight or the investigation refuses to start. For From af189033ff18ba110265ee45a6b41f45a107a1fc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=86gir=20M=C3=A1ni=20Hauksson?= <54936225+sourcehawk@users.noreply.github.com> Date: Wed, 3 Jun 2026 14:20:47 +0200 Subject: [PATCH 2/3] docs: clarify cluster selection is optional at investigation start Step 1 of the end-to-end walkthrough framed "Pick a cluster" as the mandatory entry point, but ValidateInputValues only enforces inputs the profile marks non-optional, and the canonical profile marks all four (cluster, incident URL, Slack channel, notes) optional. Lead with the real rule (at least one starting point) and note the agent infers a cluster via switch_context when none is picked up front. Gate preflight's reachability/RBAC check on a cluster actually being selected. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/content/investigations.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/docs/content/investigations.md b/docs/content/investigations.md index 120ccaf4..067c5819 100644 --- a/docs/content/investigations.md +++ b/docs/content/investigations.md @@ -63,11 +63,14 @@ the token falls out of the address bar. The launcher stays alive in the terminal ### One investigation, end-to-end -1. **Pick a cluster.** The launcher queries the configured provider (kubeconfig by default, Teleport when the profile - selects it) for the operator's reachable clusters, then calls the provider's `Login` to obtain a kubeconfig context. -2. **Preflight.** Confirms the cluster is reachable and RBAC permits read access, then writes a per-session `mcp.json` - describing which triagent-mcp servers to spawn. The agent narrows down the namespace at runtime via the k8s tools; - it isn't fixed at preflight. +1. **Provide a starting point.** An investigation needs at least one input: a cluster, an incident URL, a Slack thread, + or free-form notes. Picking a cluster is optional. When one is picked, the launcher queries the configured provider + (kubeconfig by default, Teleport when the profile selects it) for the operator's reachable clusters and calls the + provider's `Login` to obtain a kubeconfig context. With no cluster up front, the agent infers one from the remaining + inputs and calls `switch_context` at runtime. +2. **Preflight.** When a cluster was picked, confirms it is reachable and RBAC permits read access. Either way it writes + a per-session `mcp.json` describing which triagent-mcp servers to spawn. The agent narrows down the namespace at + runtime via the k8s tools; it isn't fixed at preflight. 3. **Spawn the agent.** Claude is launched with that `mcp.json` plus a system prompt that points the agent at the `investigation` playbook. The agent is told nothing product-specific in prose; the playbooks carry the procedural knowledge. From d690fd0961e7f59e1f38335eb532227244b70dfb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=86gir=20M=C3=A1ni?= <54936225+sourcehawk@users.noreply.github.com> Date: Wed, 3 Jun 2026 14:30:49 +0200 Subject: [PATCH 3/3] Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --- docs/content/investigations.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/investigations.md b/docs/content/investigations.md index 067c5819..3861fb54 100644 --- a/docs/content/investigations.md +++ b/docs/content/investigations.md @@ -96,7 +96,7 @@ contains no decision logic — it wires processes together and streams the resul | UI | Next.js SPA (this app), embedded in the launcher binary | | Authentication | Per-launch random token + cookie | -Playbooks own the procedure, triagent-mcp owns tool semantics, claude owns judgment. Each piece is editable +Playbooks own the procedure, triagent-mcp owns tool semantics, Claude owns judgment. Each piece is editable independently. ## Using the tool