Skip to content

timholm/trace-guard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

trace-guard

Real-time failure detection, root-cause diagnosis, and mitigation suggestions for LLM multi-agent systems.

What it does

TraceGuard is a sidecar service that ingests OpenTelemetry-compatible agent traces, encodes each reasoning step via contrastive embeddings inspired by the EAGER paper (arXiv:2603.21522), and detects failures in under 50ms per step. It exposes a REST API that agent orchestrators (LangChain, Dify, CrewAI) can query to get real-time alerts, root-cause reports, and reflexive mitigation plans — enabling self-healing without per-trace LLM calls.

Install

go install github.com/timholm/trace-guard@latest

Or build from source:

git clone https://github.com/timholm/trace-guard.git
cd trace-guard
make build

Usage

Ingest a native span and check for anomalies:

curl -X POST http://localhost:8080/v1/ingest/spans \
  -H 'Content-Type: application/json' \
  -d '[{
    "trace_id": "abc123",
    "span_id": "span-1",
    "name": "agent_step",
    "kind": "agent_step",
    "start_time": "2026-03-24T10:00:00Z",
    "end_time": "2026-03-24T10:00:01Z",
    "status": "ok",
    "input": "What is the capital of France?",
    "output": "The capital of France is Berlin."
  }]'

Response:

{
  "ingested": 1,
  "results": [{
    "step": {"trace_id": "abc123", "span_id": "span-1", ...},
    "is_anomaly": true,
    "anomaly_score": 0.72,
    "failure_type": "hallucination",
    "confidence": 0.85,
    "latency_ms": 1.2
  }]
}

Retrieve recent alerts, filtered by failure type:

curl 'http://localhost:8080/v1/alerts?failure_type=reasoning_loop'

Response:

{
  "alerts": [{
    "trace_id": "abc123",
    "span_id": "span-5",
    "failure_type": "reasoning_loop",
    "score": 0.81,
    "confidence": 0.91,
    "detected_at": "2026-03-24T10:01:00Z",
    "latency_ms": 3.4
  }],
  "count": 1
}

Get a mitigation plan for a detected failure:

curl -X POST http://localhost:8080/v1/mitigate \
  -H 'Content-Type: application/json' \
  -d '{"failure_type": "hallucination", "trace_id": "abc123", "score": 0.72}'

Response:

{
  "failure_type": "hallucination",
  "trace_id": "abc123",
  "summary": "4 mitigation action(s) suggested for hallucination failure.",
  "suggestions": [
    {"action": "enable_grounding", "description": "Enable retrieval-augmented generation (RAG) to ground LLM output in verified sources.", "priority": 1, "automated": true},
    {"action": "reduce_temperature", "description": "Reduce LLM sampling temperature to decrease output randomness.", "priority": 3, "automated": true}
  ]
}

API

POST /v1/ingest/otel

Ingest an OTLP/JSON trace export.

Request body: OTLP JSON (ResourceSpans array)

Response:

{"ingested": <int>, "results": [<DetectionResult>, ...]}

POST /v1/ingest/spans

Ingest an array of native Span objects.

Request body:

[{
  "trace_id": "string",
  "span_id": "string",
  "parent_id": "string",        // optional
  "name": "string",
  "kind": "llm_call|tool_call|agent_step|orchestrator|unknown",
  "start_time": "RFC3339",
  "end_time": "RFC3339",
  "status": "ok|error|timeout",
  "attributes": {"key": "value"},  // optional
  "events": [...],                 // optional
  "input": "string",               // optional
  "output": "string",              // optional
  "error_message": "string"        // optional
}]

Response:

{"ingested": <int>, "results": [<DetectionResult>, ...]}

DetectionResult fields:

  • step — the Step derived from the span
  • is_anomaly — whether the step was flagged
  • anomaly_score — cosine-distance-based score (0–1)
  • matched_pattern — pattern ID that matched, if any
  • failure_type — one of the failure type constants below
  • confidence — detector confidence (0–1)
  • latency_ms — detection latency in milliseconds

GET /v1/alerts

Retrieve recent alerts (up to last 1000).

Query parameters:

  • failure_type — optional filter; one of hallucination, deadlock, reasoning_loop, tool_error, coordination_failure, context_overflow, timeout, unknown

Response:

{"alerts": [<Alert>, ...], "count": <int>}

POST /v1/diagnose

Perform root-cause analysis on an alert.

Request body:

{
  "alert": <Alert>,
  "context": [<Step>, ...]   // optional; falls back to alert.context
}

Response: RootCauseReport

{
  "trace_id": "string",
  "span_id": "string",
  "failure_type": "string",
  "summary": "string",
  "contributing_spans": ["span-id", ...],
  "evidence": ["anomaly score: 0.720 ...", ...],
  "failure_chain": ["step 1 description", ...]
}

POST /v1/mitigate

Get a mitigation plan for a failure.

Request body:

{"failure_type": "string", "trace_id": "string", "score": 0.72}

Response: Plan

{
  "failure_type": "string",
  "trace_id": "string",
  "suggestions": [{"action": "string", "description": "string", "priority": 1, "automated": true}],
  "summary": "string"
}

GET /healthz

Health check.

Response: {"status": "ok", "time": "RFC3339"}

Architecture

internal/
  api/
    handlers.go         — HTTP handlers wiring all subsystems; registers routes for ingest, alerts, diagnose, mitigate, healthz
  config/
    config.go           — Config structs and JSON load/save; defaults: HTTP :8080, gRPC :9090, anomaly threshold 0.35, window 10
  detector/
    patterns.go         — FailureType constants, Pattern struct, builtin keyword-based seed patterns, KeywordMatch helper
    anomaly.go          — AnomalyScorer: cosine similarity scoring of embeddings against pattern library
    stream.go           — StreamDetector: per-step encoding + hybrid scoring + sliding context window; targets <50ms
  diagnosis/
    failure_library.go  — Library: in-memory pattern store with hit-count-based eviction and JSON persistence
    root_cause.go       — Analyzer: evidence collection, failure chain construction, contributing span identification
  encoder/
    embedding.go        — Vector type, Vocabulary, TFIDFEmbedder (sparse TF-IDF), SimpleEmbedder
    contrastive.go      — ContrastiveEncoder: NT-Xent linear projection; EmbedStep pipeline (TF-IDF → projection)
  mitigation/
    suggester.go        — Suggester: maps each FailureType to a prioritised list of Suggestions (automated + manual)
  trace/
    model.go            — Core types: Span, Trace, Step, Event, DetectionResult, SpanKind, Status constants
    parser.go           — ParseOTel (OTLP/JSON), ParseSpans (native JSON array), SpansToSteps conversion
    collector.go        — Collector: accumulates spans by trace ID

References

Research Papers

  • EAGER: Embedding-Augmented GEneral Reasoning — Contrastive learning objective for reasoning-scoped embeddings; the NT-Xent contrastive loss applied to agent trace steps is the core technique used by TraceGuard's ContrastiveEncoder.

Related Projects

Market Analysis

The buyer is platform engineering teams at companies running multi-agent LLM systems in production (fintech, customer support automation, coding assistants) — the same teams already paying for Datadog and PagerDuty but getting zero visibility into agent-specific failures. Revenue model is open-core: the single-binary daemon with local detection is free (drives adoption), while a SaaS tier ($500–2000/mo per org) adds a hosted failure pattern library trained across anonymized customer traces, team dashboards, and compliance-grade audit trails. The moat is the proprietary failure embedding model that improves with every customer's traces — a classic data flywheel that new entrants cannot replicate without comparable deployment scale.

License

MIT

About

Teams running LLM multi-agent systems in production have no way to detect, diagnose, or auto-remediate agent failures in real time — they discover broken reasoning chains, coordination deadlocks, and hallucination cascades only after costly downstream damage.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors