Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ It can also watch for trouble on its own. Point it at Slack channels or GitHub i

Four surfaces, each documented in depth on the [docs site](https://sourcehawk.github.io/triagent/):

- **[Investigations](https://sourcehawk.github.io/triagent/docs/investigations/)**: the live triage view. Hand the assistant a symptom and some context (cluster, Slack thread, incident.io link, notes), watch it work through the diagnosis step by step, and ship the summary as markdown.
- **[Playbooks](https://sourcehawk.github.io/triagent/docs/playbooks/)**: the step-by-step troubleshooting procedures the assistant follows, defined as YAML. Write and edit them right in the browser, with an AI assistant helping.
- **[Wiki](https://sourcehawk.github.io/triagent/docs/wiki/)**: the team's lasting knowledge base of failure patterns and prior fixes, which the assistant can search.
- **[Watches](https://sourcehawk.github.io/triagent/docs/watches/)**: rules that turn Slack messages, GitHub issues, or alerts into proposed investigations on their own.
- **[Investigations](https://sourcehawk.github.io/triagent/investigations/)**: the live triage view. Hand the assistant a symptom and some context (cluster, Slack thread, incident.io link, notes), watch it work through the diagnosis step by step, and ship the summary as markdown.
- **[Playbooks](https://sourcehawk.github.io/triagent/playbooks/)**: the step-by-step troubleshooting procedures the assistant follows, defined as YAML. Write and edit them right in the browser, with an AI assistant helping.
- **[Wiki](https://sourcehawk.github.io/triagent/wiki/)**: the team's lasting knowledge base of failure patterns and prior fixes, which the assistant can search.
- **[Watches](https://sourcehawk.github.io/triagent/watches/)**: rules that turn Slack messages, GitHub issues, or alerts into proposed investigations on their own.
Comment on lines +29 to +32

<table>
<tr>
Expand Down Expand Up @@ -117,9 +117,9 @@ This boots a localhost HTTP server, prints its URL with a per-launch token, and

In the browser:

1. **Pick a cluster**: directly from kubeconfig, or via Teleport.
1. **Pick a cluster** from the dropdown (sourced from your kubeconfig by default; Teleport if your profile uses it).
2. **Log in** if prompted (SSO/2FA prompts go to the launcher terminal).
3. **Enter the namespace** and optional notes, Slack channel, or incident URL.
3. **Add context** (all optional): a sentence on the symptom, a Slack channel, or an incident URL. The assistant narrows down the namespace itself.
4. **Investigate**: the assistant works through the playbook, uses its tools, and writes a summary you can copy or push upstream as a PR (once you've wired an upstream repo; see below).

### A few useful commands
Expand Down
60 changes: 22 additions & 38 deletions docs/content/investigations.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,31 +18,8 @@ The result of a typical session is a tidy markdown summary the operator can past
likely root cause, evidence, recommended next steps. The activity panel keeps every tool call visible, so operators can
audit the chain or interrupt with a follow-up at any point.

## Why it exists

Cluster triage isn't a `kubectl` command; it's a cross-source scramble. A typical incident looks like this:

1. **Alert lands.** Pager, Slack `@`-mention, customer ticket. You were probably already on something else.
2. **Catch up on the channel.** What has the customer / oncall / support already said? What's been ruled out? Who
else is looking?
3. **Read the cluster state.** Pods, events, logs, the failing pod's owner CR, the Crossplane composite, the
backup status, the gateway service.
4. **Check what changed.** Recent deploys, spec bumps, controller version skews, last week's incident
write-up that mentioned the same component.
5. **Pull metrics.** Prometheus for saturation, incident.io for the ongoing-incident timeline.
6. **Recall prior art.** Have we seen this exact pattern before, and what fixed it?
7. **Synthesise.** Hold the cross-references in your head, decide which thread to pull next, write up a conclusion
someone else can act on.

Each step is mechanical for an experienced operator, but the tabs multiply and the synthesis is slow. Worse, the
patterns drift as new operators rotate in, and the artefact at the end is a Slack message that decays the moment the
channel scrolls.

This tool collapses steps 2–6 into one conversation against one audit trail. The walker knows which sources to consult
for which failure shapes; the MCP catalog turns each query into a single typed tool call; the summary in step 7 falls
out of the walker's terminal node. Operators stay in the loop: every tool call is visible in the activity panel, the
conclusion is editable before sharing, and you can step in mid-session whenever the walker hits something it doesn't
recognise.
For the broader problem this surface addresses — the cross-source scramble a typical incident turns into — see
[Overview → The problem it solves](/docs/overview#the-problem-it-solves).

## How it works

Expand Down Expand Up @@ -86,10 +63,14 @@ the token falls out of the address bar. The launcher stays alive in the terminal

### One investigation, end-to-end

1. **Pick a cluster.** The launcher queries the configured provider (Teleport by default) for the operator's
reachable clusters, then calls the provider's `Login` to obtain a kubeconfig context.
2. **Preflight.** Confirms the namespace exists, RBAC permits pod listing, and writes a per-session `mcp.json`
describing which triagent-mcp servers to spawn.
1. **Provide a starting point.** An investigation needs at least one input: a cluster, an incident URL, a Slack thread,
or free-form notes. Picking a cluster is optional. When one is picked, the launcher queries the configured provider
(kubeconfig by default, Teleport when the profile selects it) for the operator's reachable clusters and calls the
provider's `Login` to obtain a kubeconfig context. With no cluster up front, the agent infers one from the remaining
inputs and calls `switch_context` at runtime.
2. **Preflight.** When a cluster was picked, confirms it is reachable and RBAC permits read access. Either way it writes
a per-session `mcp.json` describing which triagent-mcp servers to spawn. The agent narrows down the namespace at
runtime via the k8s tools; it isn't fixed at preflight.
3. **Spawn the agent.** Claude is launched with that `mcp.json` plus a system prompt that points the agent at the
`investigation` playbook. The agent is told nothing product-specific in prose; the playbooks carry the procedural
knowledge.
Expand All @@ -101,19 +82,22 @@ the token falls out of the address bar. The launcher stays alive in the terminal
6. **Follow up or close.** The operator can keep chatting (clarifying questions, deeper dives); those route through the
`followup_conversation` meta-playbook so the response shape stays coherent.

### What lives where
### Separation of concerns

Each part of the system owns exactly one job, so any one can change without touching the others. The launcher itself
contains no decision logic — it wires processes together and streams the result to the browser.

| Concern | Owner |
| ---------------------- | -------------------------------------------------------------- |
| Cluster picker / login | Provider plugin (Teleport by default) |
| Cluster picker / login | Auth provider (kubeconfig by default, Teleport optional) |
| Tool execution | triagent-mcp servers (k8s, strategies, git, wiki, ...) |
| Decision logic | YAML playbooks (the strategies MCP walks them) |
| Reasoning | Claude CLI (the agent invoking tools) |
| UI | Next.js SPA (this app), embedded in the launcher binary |
| Authentication | Per-launch random token + cookie |

The launcher itself contains zero decision logic. Playbooks own the procedure, triagent-mcp owns tool semantics,
claude owns judgment. Each piece is editable independently.
Playbooks own the procedure, triagent-mcp owns tool semantics, Claude owns judgment. Each piece is editable
independently.

## Using the tool

Expand All @@ -126,8 +110,8 @@ claude owns judgment. Each piece is editable independently.
3. Click **+ new investigation** in the sidebar (or navigate to `/investigations/new`) to start a fresh one. Pick a
cluster from the dropdown. If the provider isn't logged in, you'll be prompted to authenticate (SSO/2FA prompts
surface in the terminal where you ran `triagent start`, not the browser).
4. Fill in the form:
- **cluster ID** (required when using the cluster_id profile input). The data namespace is derived per your profile.
4. Fill in the form. The fields below are individually optional, but the investigation needs at least one starting
point — the cluster you picked above, or one of these:
Comment on lines 110 to +114
- **incident URL** (optional). Pasted verbatim into the agent's prompt as context, useful for incident.io links
so the agent can pull the corresponding incident metadata if the incident.io MCP is connected.
- **Slack channel** (optional). When Slack is connected, the field becomes a channel picker (search by name); the
Expand Down Expand Up @@ -232,9 +216,9 @@ have it and is trained to yield to you in those cases.

### Enabling

- **Start screen:** tick **Run in auto mode** before submitting.
- **Mid-session:** press **Enable auto mode** on the session header
(coming soon; for now, restart with auto mode on).
Tick **Run in auto mode** on the start screen before submitting. A watch can also start a session in auto mode
directly (see [Watches](/docs/watches#two-toggles-auto-ingest-and-auto-start)). To hand an already-running manual
session to the operator agent, restart it with auto mode on.

### Take over

Expand Down
30 changes: 17 additions & 13 deletions docs/content/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
Agentic Incident Investigation, driven from your browser.

Triagent is a localhost web app that pairs the Claude reasoning agent with read-only Kubernetes access, an extensible
MCP catalog (Prometheus, Slack, GitHub, incident.io, your own), a guided playbook walker, and a persistent wiki, all
bound to a single cluster's namespace per session. You run `triagent start`, it opens a browser, you hand it the
symptom, and it drives a focused diagnosis you can paste into a ticket when it's done.
MCP catalog (Prometheus, Slack, GitHub, incident.io, read-only GCP/AWS context, your own), a guided playbook walker,
and a persistent wiki, all scoped to a single cluster per session. You run `triagent start`, it opens a browser, you
hand it the symptom, and it drives a focused diagnosis you can paste into a ticket when it's done.
Comment on lines 5 to +8

## The problem it solves

Expand Down Expand Up @@ -60,7 +60,8 @@ New failure shape on Tuesday → playbook PR on Wednesday → every operator has

## What's in the box

Four surfaces, each with a dedicated section in these docs.
Four operator-facing surfaces, each with a dedicated section in these docs: **Investigations**, **Watches**,
**Playbooks**, and **Wiki**. Underneath them sits the **MCP tool catalog** every surface is built on.
Comment on lines +63 to +64

### [Investigations](/docs/investigations)

Expand Down Expand Up @@ -88,14 +89,6 @@ full investigation, so the launcher reaches you before the pager does. Each
signal carries a back-reference to the watch and items that produced it;
manual start is a click for the ones the agent flagged as `unclear`.

### [MCP servers](/docs/mcp)

A tool catalog the agent reads like a map, and the same map an operator reads when authoring a playbook. Exposed as
curated tools rather than a raw shell, so the agent never gets to run arbitrary commands. The catalog grows as we wire
in new sources (Kubernetes, Prometheus, the playbook walker, linked git repos, the wiki, Slack, incident.io, …); rather
than enumerate it here, browse the live list at [**/mcp**](/mcp). The catalog reflects exactly what the launcher
loaded for this build.

### [Playbooks](/docs/playbooks)

Procedural knowledge as data. Each playbook is a YAML graph that encodes one failure shape's triage path: read step
Expand All @@ -111,7 +104,15 @@ real git repo, indexed for the agent to consult during triage. Link density comp
canonical entity names, the better the agent's "have we seen this before?" recall gets. Procedure belongs in playbooks;
facts belong in the wiki.

## Alpha Release
### [The MCP tool catalog](/docs/mcp)

The layer beneath all four surfaces. A tool catalog the agent reads like a map, and the same map an operator reads when
authoring a playbook. Exposed as curated tools rather than a raw shell, so the agent never gets to run arbitrary
commands. The catalog grows as we wire in new sources (Kubernetes, Prometheus, the playbook walker, linked git repos,
the wiki, Slack, incident.io, …); rather than enumerate it here, browse the live list at [**/mcp**](/mcp). The catalog
reflects exactly what the launcher loaded for this build.

## Alpha release

This is alpha. Expect rough edges, breaking config changes between versions, and the occasional walker dead-end. Some
things are stable enough to plan around:
Expand All @@ -135,6 +136,9 @@ on file. Each integration has its own page:

- **[Slack and incident.io](/docs/connections)** — credentials stored in `~/.config/triagent/credentials.json` (mode
0600), validated against the upstream before saving.
- **[Cloud providers](/docs/cloud-providers)** — read-only GCP or AWS context (reachability, IAM, GKE/EKS config,
logs, audit) so a Kubernetes thread can follow down into the cloud layer. Pinned to a read-only identity in the
profile, never entered in the UI.
- **[GitHub repositories](/docs/repos)** — linked over SSH for clone, `gh` CLI for the *Push as PR* flows. Defaults
ship via the profile's `linked_repos`; personal repos persist per-machine.
- **[Profiles](/docs/profiles)** — the deployment-specific config bundle that wires upstream repos for playbooks /
Expand Down
2 changes: 1 addition & 1 deletion docs/content/profiles.md
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,7 @@ investigation_inputs:
| `text` | single-line input | `{{.value}}` |
| `url` | single-line input, light URL validation | `{{.value}}` |
| `textarea` | multi-line textarea | `{{.value}}` |
| `cluster_id` | cluster picker bound to detected kube contexts | `{{.value}}` |
| `cluster_id` | cluster picker bound to the provider's clusters | `{{.value}}` |
| `slack_channel` | channel picker (filtered by `slack.channel_prefix`) | `{{.id}}`, `{{.name}}`, `{{.url}}` |

Required (`optional: false`) inputs must be non-empty at preflight or the investigation refuses to start. For
Expand Down
Loading