LangGraph-powered agent that analyzes Prometheus metrics, detects anomalies, and generates structured incident summaries.
An AI agent built with LangGraph that connects to a Prometheus instance, queries metrics using PromQL, applies anomaly detection, and produces structured incident analysis reports. Designed to complement container-observability-stack.
+------------------+ +-----------------+ +------------------+
| User Query |----->| LangGraph |----->| Prometheus |
| "analyze memory | | Agent Graph | | (PromQL API) |
| last 30min" | | |<-----| |
+------------------+ | +----------+ | +------------------+
| | analyze | |
| | detect | | +------------------+
| | summarize| |----->| LLM (OpenAI) |
| +----------+ |<-----| Structured |
+-----------------+ | Output |
+------------------+
- PromQL Query Execution: Fetches metrics (memory, error rate, latency) from Prometheus HTTP API
- Anomaly Detection: Statistical analysis with z-score and threshold-based detection
- Structured Incident Summary: LLM generates severity, root cause hypothesis, and recommended actions
- Multi-Step Agent Graph: LangGraph state machine with conditional routing based on findings
- CLI Interface: Run from terminal with natural language queries
# Prerequisites: Prometheus running at localhost:9090
# (from container-observability-stack: docker compose up -d)
# Install dependencies
pip install -r requirements.txt
# Set API key
export OPENAI_API_KEY="your-key"
# Run analysis
python -m src.main "analyze memory usage for go-api in the last 30 minutes"
python -m src.main "check error rate and latency"
python -m src.main "full incident analysis"=== Incident Analysis ===
Severity: P2 (Warning)
Metric: container_memory_usage_bytes{name="go-api"}
Findings:
- Memory increased from 32MB to 148MB over 25 minutes
- Growth rate: 4.6MB/min (linear, consistent with leak pattern)
- Z-score: 3.2 (anomalous, >2.0 threshold)
Root Cause Hypothesis:
Unbounded memory accumulation in application heap.
Pattern consistent with append-only data structure without eviction.
Recommended Actions:
1. Capture pprof heap profile: curl localhost:8080/debug/pprof/heap > heap.prof
2. Identify top allocator: go tool pprof -top heap.prof
3. Clear leak store: curl -X POST localhost:8080/reset
4. Investigate leakHandler in main.go for unbounded slice growth
observability-agent/
├── src/
│ ├── __init__.py
│ ├── main.py # CLI entry point
│ ├── graph.py # LangGraph agent definition
│ ├── nodes.py # Agent node functions (fetch, analyze, summarize)
│ ├── promql.py # Prometheus HTTP API client
│ ├── detector.py # Anomaly detection (z-score, threshold)
│ └── models.py # Pydantic models for structured output
├── tests/
│ ├── __init__.py
│ ├── test_promql.py
│ ├── test_detector.py
│ └── test_graph.py
├── data/
│ └── sample_metrics.json # Sample Prometheus response for testing
├── requirements.txt
├── .gitignore
└── README.md
- Agent Framework: LangGraph (stateful multi-step agent)
- LLM: OpenAI GPT-4 (structured output with Pydantic)
- Metrics: Prometheus HTTP API (PromQL)
- Anomaly Detection: NumPy / SciPy (z-score, rolling statistics)
- Language: Python 3.11+
- Parse — User query is interpreted to determine which metrics to analyze
- Fetch — PromQL queries are executed against Prometheus HTTP API
- Detect — Statistical anomaly detection on the returned time series
- Summarize — LLM generates structured incident summary with severity, hypothesis, and actions
- Route — If anomalies found, agent deepens analysis; otherwise reports healthy status
Sumin Kim — Applied Statistics, Yonsei University