Skip to content

sumin-world/observability-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Observability Agent

LangGraph-powered agent that analyzes Prometheus metrics, detects anomalies, and generates structured incident summaries.

What This Is

An AI agent built with LangGraph that connects to a Prometheus instance, queries metrics using PromQL, applies anomaly detection, and produces structured incident analysis reports. Designed to complement container-observability-stack.

Architecture

+------------------+      +-----------------+      +------------------+
|   User Query     |----->|   LangGraph     |----->|   Prometheus     |
|  "analyze memory |      |   Agent Graph   |      |   (PromQL API)   |
|   last 30min"    |      |                 |<-----|                  |
+------------------+      |  +----------+   |      +------------------+
                          |  | analyze  |   |
                          |  | detect   |   |      +------------------+
                          |  | summarize|   |----->|   LLM (OpenAI)   |
                          |  +----------+   |<-----|   Structured     |
                          +-----------------+      |   Output         |
                                                   +------------------+

Features

  • PromQL Query Execution: Fetches metrics (memory, error rate, latency) from Prometheus HTTP API
  • Anomaly Detection: Statistical analysis with z-score and threshold-based detection
  • Structured Incident Summary: LLM generates severity, root cause hypothesis, and recommended actions
  • Multi-Step Agent Graph: LangGraph state machine with conditional routing based on findings
  • CLI Interface: Run from terminal with natural language queries

Quickstart

# Prerequisites: Prometheus running at localhost:9090
# (from container-observability-stack: docker compose up -d)

# Install dependencies
pip install -r requirements.txt

# Set API key
export OPENAI_API_KEY="your-key"

# Run analysis
python -m src.main "analyze memory usage for go-api in the last 30 minutes"
python -m src.main "check error rate and latency"
python -m src.main "full incident analysis"

Example Output

=== Incident Analysis ===
Severity: P2 (Warning)
Metric: container_memory_usage_bytes{name="go-api"}

Findings:
  - Memory increased from 32MB to 148MB over 25 minutes
  - Growth rate: 4.6MB/min (linear, consistent with leak pattern)
  - Z-score: 3.2 (anomalous, >2.0 threshold)

Root Cause Hypothesis:
  Unbounded memory accumulation in application heap.
  Pattern consistent with append-only data structure without eviction.

Recommended Actions:
  1. Capture pprof heap profile: curl localhost:8080/debug/pprof/heap > heap.prof
  2. Identify top allocator: go tool pprof -top heap.prof
  3. Clear leak store: curl -X POST localhost:8080/reset
  4. Investigate leakHandler in main.go for unbounded slice growth

Project Structure

observability-agent/
├── src/
│   ├── __init__.py
│   ├── main.py              # CLI entry point
│   ├── graph.py             # LangGraph agent definition
│   ├── nodes.py             # Agent node functions (fetch, analyze, summarize)
│   ├── promql.py            # Prometheus HTTP API client
│   ├── detector.py          # Anomaly detection (z-score, threshold)
│   └── models.py            # Pydantic models for structured output
├── tests/
│   ├── __init__.py
│   ├── test_promql.py
│   ├── test_detector.py
│   └── test_graph.py
├── data/
│   └── sample_metrics.json  # Sample Prometheus response for testing
├── requirements.txt
├── .gitignore
└── README.md

Tech Stack

  • Agent Framework: LangGraph (stateful multi-step agent)
  • LLM: OpenAI GPT-4 (structured output with Pydantic)
  • Metrics: Prometheus HTTP API (PromQL)
  • Anomaly Detection: NumPy / SciPy (z-score, rolling statistics)
  • Language: Python 3.11+

How It Works

  1. Parse — User query is interpreted to determine which metrics to analyze
  2. Fetch — PromQL queries are executed against Prometheus HTTP API
  3. Detect — Statistical anomaly detection on the returned time series
  4. Summarize — LLM generates structured incident summary with severity, hypothesis, and actions
  5. Route — If anomalies found, agent deepens analysis; otherwise reports healthy status

Author

Sumin Kim — Applied Statistics, Yonsei University

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages