Real-time failure detection, root-cause diagnosis, and mitigation suggestions for LLM multi-agent systems.
TraceGuard is a sidecar service that ingests OpenTelemetry-compatible agent traces, encodes each reasoning step via contrastive embeddings inspired by the EAGER paper (arXiv:2603.21522), and detects failures in under 50ms per step. It exposes a REST API that agent orchestrators (LangChain, Dify, CrewAI) can query to get real-time alerts, root-cause reports, and reflexive mitigation plans — enabling self-healing without per-trace LLM calls.
go install github.com/timholm/trace-guard@latest
Or build from source:
git clone https://github.com/timholm/trace-guard.git
cd trace-guard
make build
Ingest a native span and check for anomalies:
curl -X POST http://localhost:8080/v1/ingest/spans \
-H 'Content-Type: application/json' \
-d '[{
"trace_id": "abc123",
"span_id": "span-1",
"name": "agent_step",
"kind": "agent_step",
"start_time": "2026-03-24T10:00:00Z",
"end_time": "2026-03-24T10:00:01Z",
"status": "ok",
"input": "What is the capital of France?",
"output": "The capital of France is Berlin."
}]'
Response:
{
"ingested": 1,
"results": [{
"step": {"trace_id": "abc123", "span_id": "span-1", ...},
"is_anomaly": true,
"anomaly_score": 0.72,
"failure_type": "hallucination",
"confidence": 0.85,
"latency_ms": 1.2
}]
}Retrieve recent alerts, filtered by failure type:
curl 'http://localhost:8080/v1/alerts?failure_type=reasoning_loop'
Response:
{
"alerts": [{
"trace_id": "abc123",
"span_id": "span-5",
"failure_type": "reasoning_loop",
"score": 0.81,
"confidence": 0.91,
"detected_at": "2026-03-24T10:01:00Z",
"latency_ms": 3.4
}],
"count": 1
}Get a mitigation plan for a detected failure:
curl -X POST http://localhost:8080/v1/mitigate \
-H 'Content-Type: application/json' \
-d '{"failure_type": "hallucination", "trace_id": "abc123", "score": 0.72}'
Response:
{
"failure_type": "hallucination",
"trace_id": "abc123",
"summary": "4 mitigation action(s) suggested for hallucination failure.",
"suggestions": [
{"action": "enable_grounding", "description": "Enable retrieval-augmented generation (RAG) to ground LLM output in verified sources.", "priority": 1, "automated": true},
{"action": "reduce_temperature", "description": "Reduce LLM sampling temperature to decrease output randomness.", "priority": 3, "automated": true}
]
}Ingest an OTLP/JSON trace export.
Request body: OTLP JSON (ResourceSpans array)
Response:
{"ingested": <int>, "results": [<DetectionResult>, ...]}Ingest an array of native Span objects.
Request body:
[{
"trace_id": "string",
"span_id": "string",
"parent_id": "string", // optional
"name": "string",
"kind": "llm_call|tool_call|agent_step|orchestrator|unknown",
"start_time": "RFC3339",
"end_time": "RFC3339",
"status": "ok|error|timeout",
"attributes": {"key": "value"}, // optional
"events": [...], // optional
"input": "string", // optional
"output": "string", // optional
"error_message": "string" // optional
}]Response:
{"ingested": <int>, "results": [<DetectionResult>, ...]}DetectionResult fields:
step— the Step derived from the spanis_anomaly— whether the step was flaggedanomaly_score— cosine-distance-based score (0–1)matched_pattern— pattern ID that matched, if anyfailure_type— one of the failure type constants belowconfidence— detector confidence (0–1)latency_ms— detection latency in milliseconds
Retrieve recent alerts (up to last 1000).
Query parameters:
failure_type— optional filter; one ofhallucination,deadlock,reasoning_loop,tool_error,coordination_failure,context_overflow,timeout,unknown
Response:
{"alerts": [<Alert>, ...], "count": <int>}Perform root-cause analysis on an alert.
Request body:
{
"alert": <Alert>,
"context": [<Step>, ...] // optional; falls back to alert.context
}Response: RootCauseReport
{
"trace_id": "string",
"span_id": "string",
"failure_type": "string",
"summary": "string",
"contributing_spans": ["span-id", ...],
"evidence": ["anomaly score: 0.720 ...", ...],
"failure_chain": ["step 1 description", ...]
}Get a mitigation plan for a failure.
Request body:
{"failure_type": "string", "trace_id": "string", "score": 0.72}Response: Plan
{
"failure_type": "string",
"trace_id": "string",
"suggestions": [{"action": "string", "description": "string", "priority": 1, "automated": true}],
"summary": "string"
}Health check.
Response: {"status": "ok", "time": "RFC3339"}
internal/
api/
handlers.go — HTTP handlers wiring all subsystems; registers routes for ingest, alerts, diagnose, mitigate, healthz
config/
config.go — Config structs and JSON load/save; defaults: HTTP :8080, gRPC :9090, anomaly threshold 0.35, window 10
detector/
patterns.go — FailureType constants, Pattern struct, builtin keyword-based seed patterns, KeywordMatch helper
anomaly.go — AnomalyScorer: cosine similarity scoring of embeddings against pattern library
stream.go — StreamDetector: per-step encoding + hybrid scoring + sliding context window; targets <50ms
diagnosis/
failure_library.go — Library: in-memory pattern store with hit-count-based eviction and JSON persistence
root_cause.go — Analyzer: evidence collection, failure chain construction, contributing span identification
encoder/
embedding.go — Vector type, Vocabulary, TFIDFEmbedder (sparse TF-IDF), SimpleEmbedder
contrastive.go — ContrastiveEncoder: NT-Xent linear projection; EmbedStep pipeline (TF-IDF → projection)
mitigation/
suggester.go — Suggester: maps each FailureType to a prioritised list of Suggestions (automated + manual)
trace/
model.go — Core types: Span, Trace, Step, Event, DetectionResult, SpanKind, Status constants
parser.go — ParseOTel (OTLP/JSON), ParseSpans (native JSON array), SpansToSteps conversion
collector.go — Collector: accumulates spans by trace ID
- EAGER: Embedding-Augmented GEneral Reasoning — Contrastive learning objective for reasoning-scoped embeddings; the NT-Xent contrastive loss applied to agent trace steps is the core technique used by TraceGuard's
ContrastiveEncoder.
- ollama/ollama — Local LLM serving patterns; informed the sidecar deployment model
- langgenius/dify — Multi-agent orchestration; reference for trace schema design and integration points
- langchain-ai/langchain — Agent step patterns and tool-call conventions that TraceGuard's span kinds model
- open-webui/open-webui — UI/UX reference for agent observability dashboards
- huggingface/transformers — Embedding model reference for the TF-IDF + projection pipeline
- f/prompts.chat — Prompt engineering patterns referenced in mitigation suggestions
- Shubhamsaboo/awesome-llm-apps — Survey of production LLM app patterns that informed the failure taxonomy
The buyer is platform engineering teams at companies running multi-agent LLM systems in production (fintech, customer support automation, coding assistants) — the same teams already paying for Datadog and PagerDuty but getting zero visibility into agent-specific failures. Revenue model is open-core: the single-binary daemon with local detection is free (drives adoption), while a SaaS tier ($500–2000/mo per org) adds a hosted failure pattern library trained across anonymized customer traces, team dashboards, and compliance-grade audit trails. The moat is the proprietary failure embedding model that improves with every customer's traces — a classic data flywheel that new entrants cannot replicate without comparable deployment scale.
MIT