trace-guard

Real-time failure detection, root-cause diagnosis, and mitigation suggestions for LLM multi-agent systems.

What it does

TraceGuard is a sidecar service that ingests OpenTelemetry-compatible agent traces, encodes each reasoning step via contrastive embeddings inspired by the EAGER paper (arXiv:2603.21522), and detects failures in under 50ms per step. It exposes a REST API that agent orchestrators (LangChain, Dify, CrewAI) can query to get real-time alerts, root-cause reports, and reflexive mitigation plans — enabling self-healing without per-trace LLM calls.

Install

go install github.com/timholm/trace-guard@latest

Or build from source:

git clone https://github.com/timholm/trace-guard.git
cd trace-guard
make build

Usage

Ingest a native span and check for anomalies:

curl -X POST http://localhost:8080/v1/ingest/spans \
  -H 'Content-Type: application/json' \
  -d '[{
    "trace_id": "abc123",
    "span_id": "span-1",
    "name": "agent_step",
    "kind": "agent_step",
    "start_time": "2026-03-24T10:00:00Z",
    "end_time": "2026-03-24T10:00:01Z",
    "status": "ok",
    "input": "What is the capital of France?",
    "output": "The capital of France is Berlin."
  }]'

Response:

{
  "ingested": 1,
  "results": [{
    "step": {"trace_id": "abc123", "span_id": "span-1", ...},
    "is_anomaly": true,
    "anomaly_score": 0.72,
    "failure_type": "hallucination",
    "confidence": 0.85,
    "latency_ms": 1.2
  }]
}

Retrieve recent alerts, filtered by failure type:

curl 'http://localhost:8080/v1/alerts?failure_type=reasoning_loop'

Response:

{
  "alerts": [{
    "trace_id": "abc123",
    "span_id": "span-5",
    "failure_type": "reasoning_loop",
    "score": 0.81,
    "confidence": 0.91,
    "detected_at": "2026-03-24T10:01:00Z",
    "latency_ms": 3.4
  }],
  "count": 1
}

Get a mitigation plan for a detected failure:

curl -X POST http://localhost:8080/v1/mitigate \
  -H 'Content-Type: application/json' \
  -d '{"failure_type": "hallucination", "trace_id": "abc123", "score": 0.72}'

Response:

{
  "failure_type": "hallucination",
  "trace_id": "abc123",
  "summary": "4 mitigation action(s) suggested for hallucination failure.",
  "suggestions": [
    {"action": "enable_grounding", "description": "Enable retrieval-augmented generation (RAG) to ground LLM output in verified sources.", "priority": 1, "automated": true},
    {"action": "reduce_temperature", "description": "Reduce LLM sampling temperature to decrease output randomness.", "priority": 3, "automated": true}
  ]
}

API

`POST /v1/ingest/otel`

Ingest an OTLP/JSON trace export.

Request body: OTLP JSON (ResourceSpans array)

Response:

{"ingested": <int>, "results": [<DetectionResult>, ...]}

`POST /v1/ingest/spans`

Ingest an array of native Span objects.

Request body:

[{
  "trace_id": "string",
  "span_id": "string",
  "parent_id": "string",        // optional
  "name": "string",
  "kind": "llm_call|tool_call|agent_step|orchestrator|unknown",
  "start_time": "RFC3339",
  "end_time": "RFC3339",
  "status": "ok|error|timeout",
  "attributes": {"key": "value"},  // optional
  "events": [...],                 // optional
  "input": "string",               // optional
  "output": "string",              // optional
  "error_message": "string"        // optional
}]

Response:

{"ingested": <int>, "results": [<DetectionResult>, ...]}

DetectionResult fields:

step — the Step derived from the span
is_anomaly — whether the step was flagged
anomaly_score — cosine-distance-based score (0–1)
matched_pattern — pattern ID that matched, if any
failure_type — one of the failure type constants below
confidence — detector confidence (0–1)
latency_ms — detection latency in milliseconds

`GET /v1/alerts`

Retrieve recent alerts (up to last 1000).

Query parameters:

failure_type — optional filter; one of hallucination, deadlock, reasoning_loop, tool_error, coordination_failure, context_overflow, timeout, unknown

Response:

{"alerts": [<Alert>, ...], "count": <int>}

`POST /v1/diagnose`

Perform root-cause analysis on an alert.

Request body:

{
  "alert": <Alert>,
  "context": [<Step>, ...]   // optional; falls back to alert.context
}

Response: RootCauseReport

{
  "trace_id": "string",
  "span_id": "string",
  "failure_type": "string",
  "summary": "string",
  "contributing_spans": ["span-id", ...],
  "evidence": ["anomaly score: 0.720 ...", ...],
  "failure_chain": ["step 1 description", ...]
}

`POST /v1/mitigate`

Get a mitigation plan for a failure.

Request body:

{"failure_type": "string", "trace_id": "string", "score": 0.72}

Response: Plan

{
  "failure_type": "string",
  "trace_id": "string",
  "suggestions": [{"action": "string", "description": "string", "priority": 1, "automated": true}],
  "summary": "string"
}

`GET /healthz`

Health check.

Response: {"status": "ok", "time": "RFC3339"}

Architecture

internal/
  api/
    handlers.go         — HTTP handlers wiring all subsystems; registers routes for ingest, alerts, diagnose, mitigate, healthz
  config/
    config.go           — Config structs and JSON load/save; defaults: HTTP :8080, gRPC :9090, anomaly threshold 0.35, window 10
  detector/
    patterns.go         — FailureType constants, Pattern struct, builtin keyword-based seed patterns, KeywordMatch helper
    anomaly.go          — AnomalyScorer: cosine similarity scoring of embeddings against pattern library
    stream.go           — StreamDetector: per-step encoding + hybrid scoring + sliding context window; targets <50ms
  diagnosis/
    failure_library.go  — Library: in-memory pattern store with hit-count-based eviction and JSON persistence
    root_cause.go       — Analyzer: evidence collection, failure chain construction, contributing span identification
  encoder/
    embedding.go        — Vector type, Vocabulary, TFIDFEmbedder (sparse TF-IDF), SimpleEmbedder
    contrastive.go      — ContrastiveEncoder: NT-Xent linear projection; EmbedStep pipeline (TF-IDF → projection)
  mitigation/
    suggester.go        — Suggester: maps each FailureType to a prioritised list of Suggestions (automated + manual)
  trace/
    model.go            — Core types: Span, Trace, Step, Event, DetectionResult, SpanKind, Status constants
    parser.go           — ParseOTel (OTLP/JSON), ParseSpans (native JSON array), SpansToSteps conversion
    collector.go        — Collector: accumulates spans by trace ID

References

Research Papers

EAGER: Embedding-Augmented GEneral Reasoning — Contrastive learning objective for reasoning-scoped embeddings; the NT-Xent contrastive loss applied to agent trace steps is the core technique used by TraceGuard's ContrastiveEncoder.

Related Projects

ollama/ollama — Local LLM serving patterns; informed the sidecar deployment model
langgenius/dify — Multi-agent orchestration; reference for trace schema design and integration points
langchain-ai/langchain — Agent step patterns and tool-call conventions that TraceGuard's span kinds model
open-webui/open-webui — UI/UX reference for agent observability dashboards
huggingface/transformers — Embedding model reference for the TF-IDF + projection pipeline
f/prompts.chat — Prompt engineering patterns referenced in mitigation suggestions
Shubhamsaboo/awesome-llm-apps — Survey of production LLM app patterns that informed the failure taxonomy

Market Analysis

The buyer is platform engineering teams at companies running multi-agent LLM systems in production (fintech, customer support automation, coding assistants) — the same teams already paying for Datadog and PagerDuty but getting zero visibility into agent-specific failures. Revenue model is open-core: the single-binary daemon with local detection is free (drives adoption), while a SaaS tier ($500–2000/mo per org) adds a hosted failure pattern library trained across anonymized customer traces, team dashboards, and compliance-grade audit trails. The moat is the proprietary failure embedding model that improves with every customer's traces — a classic data flywheel that new entrants cannot replicate without comparable deployment scale.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
internal		internal
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SPEC.md		SPEC.md
go.mod		go.mod
llms.txt		llms.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

trace-guard

What it does

Install

Usage

API

`POST /v1/ingest/otel`

`POST /v1/ingest/spans`

`GET /v1/alerts`

`POST /v1/diagnose`

`POST /v1/mitigate`

`GET /healthz`

Architecture

References

Research Papers

Related Projects

Market Analysis

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

trace-guard

What it does

Install

Usage

API

POST /v1/ingest/otel

POST /v1/ingest/spans

GET /v1/alerts

POST /v1/diagnose

POST /v1/mitigate

GET /healthz

Architecture

References

Research Papers

Related Projects

Market Analysis

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /v1/ingest/otel`

`POST /v1/ingest/spans`

`GET /v1/alerts`

`POST /v1/diagnose`

`POST /v1/mitigate`

`GET /healthz`

Packages