Observability Agent

LangGraph-powered agent that analyzes Prometheus metrics, detects anomalies, and generates structured incident summaries.

What This Is

An AI agent built with LangGraph that connects to a Prometheus instance, queries metrics using PromQL, applies anomaly detection, and produces structured incident analysis reports. Designed to complement container-observability-stack.

Architecture

+------------------+      +-----------------+      +------------------+
|   User Query     |----->|   LangGraph     |----->|   Prometheus     |
|  "analyze memory |      |   Agent Graph   |      |   (PromQL API)   |
|   last 30min"    |      |                 |<-----|                  |
+------------------+      |  +----------+   |      +------------------+
                          |  | analyze  |   |
                          |  | detect   |   |      +------------------+
                          |  | summarize|   |----->|   LLM (OpenAI)   |
                          |  +----------+   |<-----|   Structured     |
                          +-----------------+      |   Output         |
                                                   +------------------+

Features

PromQL Query Execution: Fetches metrics (memory, error rate, latency) from Prometheus HTTP API
Anomaly Detection: Statistical analysis with z-score and threshold-based detection
Structured Incident Summary: LLM generates severity, root cause hypothesis, and recommended actions
Multi-Step Agent Graph: LangGraph state machine with conditional routing based on findings
CLI Interface: Run from terminal with natural language queries

Quickstart

# Prerequisites: Prometheus running at localhost:9090
# (from container-observability-stack: docker compose up -d)

# Install dependencies
pip install -r requirements.txt

# Set API key
export OPENAI_API_KEY="your-key"

# Run analysis
python -m src.main "analyze memory usage for go-api in the last 30 minutes"
python -m src.main "check error rate and latency"
python -m src.main "full incident analysis"

Example Output

=== Incident Analysis ===
Severity: P2 (Warning)
Metric: container_memory_usage_bytes{name="go-api"}

Findings:
  - Memory increased from 32MB to 148MB over 25 minutes
  - Growth rate: 4.6MB/min (linear, consistent with leak pattern)
  - Z-score: 3.2 (anomalous, >2.0 threshold)

Root Cause Hypothesis:
  Unbounded memory accumulation in application heap.
  Pattern consistent with append-only data structure without eviction.

Recommended Actions:
  1. Capture pprof heap profile: curl localhost:8080/debug/pprof/heap > heap.prof
  2. Identify top allocator: go tool pprof -top heap.prof
  3. Clear leak store: curl -X POST localhost:8080/reset
  4. Investigate leakHandler in main.go for unbounded slice growth

Project Structure

observability-agent/
├── src/
│   ├── __init__.py
│   ├── main.py              # CLI entry point
│   ├── graph.py             # LangGraph agent definition
│   ├── nodes.py             # Agent node functions (fetch, analyze, summarize)
│   ├── promql.py            # Prometheus HTTP API client
│   ├── detector.py          # Anomaly detection (z-score, threshold)
│   └── models.py            # Pydantic models for structured output
├── tests/
│   ├── __init__.py
│   ├── test_promql.py
│   ├── test_detector.py
│   └── test_graph.py
├── data/
│   └── sample_metrics.json  # Sample Prometheus response for testing
├── requirements.txt
├── .gitignore
└── README.md

Tech Stack

Agent Framework: LangGraph (stateful multi-step agent)
LLM: OpenAI GPT-4 (structured output with Pydantic)
Metrics: Prometheus HTTP API (PromQL)
Anomaly Detection: NumPy / SciPy (z-score, rolling statistics)
Language: Python 3.11+

How It Works

Parse — User query is interpreted to determine which metrics to analyze
Fetch — PromQL queries are executed against Prometheus HTTP API
Detect — Statistical anomaly detection on the returned time series
Summarize — LLM generates structured incident summary with severity, hypothesis, and actions
Route — If anomalies found, agent deepens analysis; otherwise reports healthy status

Author

Sumin Kim — Applied Statistics, Yonsei University

GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Observability Agent

What This Is

Architecture

Features

Quickstart

Example Output

Project Structure

Tech Stack

How It Works

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
data		data
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Observability Agent

What This Is

Architecture

Features

Quickstart

Example Output

Project Structure

Tech Stack

How It Works

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages