---

## Summary

You now have a **production-grade monitoring system** for RAG applications:

✅ **Metrics collection** with Prometheus client library  
✅ **Structured logging** for cloud environments  
✅ **Automatic instrumentation** via decorators  
✅ **Visualization** with Grafana dashboards  
✅ **Alert templates** for critical issues  

**Next steps:**
1. Run `docker compose -f docker-compose.monitoring.yml up -d`
2. Start your RAG service with metrics enabled
3. Import `grafana_dash.json` into Grafana
4. Set up 3-5 critical alerts
5. Monitor for 1 week and adjust thresholds

**Remember:**
- Start simple (3-5 metrics, 3-5 alerts)
- Tune thresholds based on real traffic
- Consider managed alternatives if infrastructure burden is too high
- Monitor your monitoring (storage, query performance)

**Total setup time:** 8-12 hours initial + 2-4 hours/month maintenance  
**Value:** Real-time visibility into performance, costs, and quality 🚀

### Common Failure Modes to Watch For

**1. Metric Cardinality Explosion**
- **Problem:** Using user IDs or query text as labels creates millions of unique metrics
- **Solution:** Use bounded labels (endpoint, model, error_type) and aggregate high-cardinality data in logs

**2. Dashboard Query Timeouts**
- **Problem:** Complex PromQL queries on large time ranges
- **Solution:** Use recording rules to pre-compute expensive aggregations

**3. Memory Leaks from Unbounded Labels**
- **Problem:** Labels with infinite possible values (timestamps, UUIDs)
- **Solution:** Validate label cardinality < 1000 per metric

**4. Alert Fatigue**
- **Problem:** Too many alerts or poorly calibrated thresholds
- **Solution:** Start with 3-5 critical alerts, tune thresholds over 1 week of real traffic

**5. Metric Gaps After Deployment**
- **Problem:** Metrics server restarts lose in-memory state
- **Solution:** Use Prometheus federation or remote storage for persistence

### Example Alert Rules

Here are 5 critical alerts to set up (via Prometheus Alertmanager or Grafana Alerts):

```yaml
# Alert 1: High Latency
alert: HighQueryLatency
expr: histogram_quantile(0.95, rate(rag_query_latency_seconds_bucket[5m])) > 2
for: 5m
annotations:
  summary: "p95 latency exceeds 2 seconds"

# Alert 2: High Error Rate
alert: HighErrorRate
expr: rate(rag_requests_total{status="error"}[5m]) / rate(rag_requests_total[5m]) > 0.05
for: 2m
annotations:
  summary: "Error rate above 5%"

# Alert 3: Low Cache Hit Rate
alert: LowCacheHitRate
expr: rag_cache_hit_rate < 0.30
for: 10m
annotations:
  summary: "Cache hit rate below 30%"

# Alert 4: High Cost Burn
alert: HighCostBurn
expr: sum(rate(rag_total_cost_usd[1h])) * 24 > 100
for: 15m
annotations:
  summary: "Daily spend projected to exceed $100"

# Alert 5: Rate Limit Warning
alert: RateLimitWarning
expr: rag_rate_limit_remaining < 100
for: 1m
annotations:
  summary: "Less than 100 API calls remaining"
```

### Importing the Dashboard

**Steps in Grafana:**

1. Navigate to http://localhost:3000
2. Log in (admin/admin)
3. Click **+** → **Import dashboard**
4. Upload `grafana_dash.json`
5. Select Prometheus as the data source
6. Click **Import**

**Result:** You'll see all 10 panels populated with real-time data from your metrics endpoint!

In [None]:
# Dashboard panels we've configured
dashboard_panels = {
    'Panel': [
        'Query Latency (p50/p95/p99)',
        'Request Rate & Error Rate',
        'Token Usage (Input vs Output)',
        'Cost per Query & Total Spend',
        'Cache Hit Rate %',
        'Error Rate %',
        'Active Requests',
        'Rate Limit Remaining',
        'Response Relevance Score',
        'Error Breakdown by Type'
    ],
    'Visualization': [
        'Time series graph',
        'Time series graph',
        'Time series graph',
        'Time series graph',
        'Gauge',
        'Stat panel',
        'Stat panel',
        'Gauge',
        'Time series graph',
        'Pie chart'
    ],
    'Purpose': [
        'Track performance trends',
        'Monitor traffic & failures',
        'Understand token consumption',
        'Control spending',
        'Optimize caching strategy',
        'Set SLA alerts',
        'Detect traffic spikes',
        'Prevent rate limit hits',
        'Monitor answer quality',
        'Debug error patterns'
    ]
}

df_panels = pd.DataFrame(dashboard_panels)
print(df_panels.to_string(index=False))

# Expected: Table of 10 dashboard panels

## 7. Dashboards & Alerts (Conceptual)

Once Grafana is running, you can import `grafana_dash.json` to get a pre-built dashboard.

### Prometheus Scrape Configuration

The `prometheus.yml` file tells Prometheus where to collect metrics:

```yaml
scrape_configs:
  - job_name: 'rag-service'
    static_configs:
      - targets: ['host.docker.internal:8000']
    scrape_interval: 10s
```

**Key settings:**
- `targets`: Where to scrape (our metrics endpoint)
- `scrape_interval`: How often to collect (10s = every 10 seconds)
- `retention`: How long to store data (default 30 days)

**Prometheus queries (PromQL):**
- `rag_query_latency_seconds` → Raw histogram data
- `histogram_quantile(0.95, ...)` → Calculate p95 latency
- `rate(rag_requests_total[5m])` → Requests per second over 5min

In [None]:
import subprocess
import os

# Check if docker-compose file exists
compose_file = "docker-compose.monitoring.yml"
prometheus_config = "prometheus.yml"

if os.path.exists(compose_file) and os.path.exists(prometheus_config):
    print(f"✓ Found {compose_file}")
    print(f"✓ Found {prometheus_config}")
    print("\nReady to start monitoring stack!")
    print("\nRun in terminal:")
    print(f"  docker compose -f {compose_file} up -d")
    print("\nThen visit:")
    print("  • Prometheus: http://localhost:9090")
    print("  • Grafana: http://localhost:3000 (admin/admin)")
else:
    print("⚠ Docker compose files not found")
    print("  Expected: docker-compose.monitoring.yml and prometheus.yml")

# Expected: Success message with URLs

### Starting the Monitoring Stack

**If Docker is installed**, run:

```bash
# Start Prometheus + Grafana
docker compose -f docker-compose.monitoring.yml up -d

# Check status
docker compose -f docker-compose.monitoring.yml ps
```

**Services:**
- 🔍 **Prometheus**: http://localhost:9090 (metrics database)
- 📊 **Grafana**: http://localhost:3000 (login: admin/admin)

**If Docker is NOT running:**
- You can still use the metrics endpoint for testing
- Cloud providers offer managed Prometheus/Grafana alternatives
- See README.md for installation instructions

## 6. Prometheus + Grafana Setup

To visualize metrics, we need to run Prometheus (metrics storage) and Grafana (dashboards).

### Example JSON Log Output

```json
{
  "timestamp": "2025-11-06T12:34:56.789Z",
  "level": "INFO",
  "logger": "__main__",
  "message": "Response completed",
  "service": "rag-service",
  "environment": "development",
  "event_type": "response",
  "duration_ms": 1250.5,
  "input_tokens": 450,
  "output_tokens": 180,
  "cost_usd": 0.0234,
  "success": true,
  "model": "gpt-4"
}
```

**Benefits:**
- Cloud logging services can parse and index these automatically
- Query by `cost_usd > 0.05` or `error_type = "RateLimitError"`
- Correlate with metrics using timestamps

In [None]:
from src.m2_3_monitoring import StructuredLogger

# Create a structured logger
logger = StructuredLogger(__name__)

# Log a request
logger.log_request(
    query="What are the latest sales figures?",
    user_id="alice",
    session_id="sess_12345",
    endpoint="/api/query"
)

# Simulate processing and log response
logger.log_response(
    duration_ms=1250.5,
    tokens={'input': 450, 'output': 180},
    cost=0.0234,
    success=True,
    model="gpt-4",
    cache_hit=False
)

# Log an error scenario
try:
    raise ValueError("Simulated error: API rate limit exceeded")
except Exception as e:
    logger.log_error(e, context={
        'user_id': 'alice',
        'endpoint': '/api/query',
        'retry_attempt': 3
    })

print("\n✓ Structured logs emitted (check console output above)")
print("  Each log is a JSON object with timestamp, level, service, and custom fields")

# Expected: 3 JSON log lines printed to console

## 5. Structured Logging

Production systems need **structured logs** (JSON format) for:
- Cloud log aggregation (CloudWatch, Stackdriver, Datadog)
- Searchability by fields (user_id, error_type, cost, etc.)
- Correlation with metrics (matching timestamps)

Traditional logs: `"User alice made query at 2023-10-15"` ❌  
Structured logs: `{"timestamp": "...", "user_id": "alice", "event": "query"}` ✅

### How the Decorator Works

The `@monitored_query` decorator automatically:

1. ✅ **Starts a timer** when the function is called
2. ✅ **Increments active_requests** gauge
3. ✅ **Records latency** in the histogram when complete
4. ✅ **Extracts token counts** from the return value
5. ✅ **Calculates cost** based on token usage
6. ✅ **Records relevance scores** if present
7. ✅ **Catches errors** and records them separately

**No manual metric recording needed** - just return a dict with the right keys!

In [None]:
from src.m2_3_monitoring import monitored_query, track_cache_operation
import time
import random

@monitored_query(operation="rag_pipeline", model="gpt-4")
def simulated_rag_query(query: str, use_cache: bool = False):
    """
    Simulates a RAG query with retrieval, LLM generation, and metrics.
    """
    
    # Simulate cache check
    cache_hit = use_cache and random.random() > 0.3
    track_cache_operation(hit=cache_hit, cache_type="semantic")
    
    if cache_hit:
        # Fast path - return cached result
        time.sleep(0.05)
        return {
            'answer': 'Cached answer',
            'input_tokens': 450,
            'output_tokens': 120,
            'relevance_score': 0.88,
            'cached': True
        }
    
    # Simulate retrieval (100-500ms)
    retrieval_time = random.uniform(0.1, 0.5)
    metrics.retrieval_latency.observe(retrieval_time)
    time.sleep(retrieval_time)
    
    # Simulate LLM generation (1-3s)
    llm_time = random.uniform(1.0, 3.0)
    metrics.llm_latency.observe(llm_time)
    time.sleep(llm_time)
    
    # Return realistic metrics
    return {
        'answer': f'Generated answer for: {query[:30]}...',
        'input_tokens': random.randint(400, 1500),
        'output_tokens': random.randint(100, 500),
        'relevance_score': random.uniform(0.7, 0.95),
        'cached': False
    }

# Run sample queries
print("Running 5 sample RAG queries...\n")

for i in range(5):
    use_cache = (i % 2 == 0)  # Alternate cache usage
    result = simulated_rag_query(
        query=f"What is the status of project {i}?",
        use_cache=use_cache
    )
    
    print(f"Query {i+1}: tokens={result['input_tokens']}+{result['output_tokens']}, "
          f"relevance={result['relevance_score']:.2f}, cached={result['cached']}")

print("\n✓ Metrics recorded and available at /metrics endpoint")

# Expected: 5 query results with varying token counts and relevance scores

## 4. Instrumenting a RAG Query

Let's simulate a complete RAG pipeline and automatically record metrics using the `@monitored_query` decorator.

### What Prometheus Sees

When Prometheus scrapes the `/metrics` endpoint, it gets text output like:

```
# HELP rag_query_latency_seconds End-to-end RAG query latency
# TYPE rag_query_latency_seconds histogram
rag_query_latency_seconds_bucket{le="0.1",operation="query"} 5
rag_query_latency_seconds_bucket{le="0.5",operation="query"} 12
rag_query_latency_seconds_bucket{le="1.0",operation="query"} 18
rag_query_latency_seconds_sum{operation="query"} 15.3
rag_query_latency_seconds_count{operation="query"} 20
```

This is the **Prometheus exposition format** - a simple text protocol that Grafana queries.

In [None]:
from src.m2_3_monitoring import start_metrics_server, metrics
import config

# Start the Prometheus metrics HTTP server
# This will serve metrics at http://localhost:8000/metrics
port = config.METRICS_PORT
success = start_metrics_server(port)

if success:
    print(f"✓ Metrics endpoint running at http://localhost:{port}/metrics")
    print(f"✓ Prometheus will scrape this endpoint every 15s")
else:
    print(f"⚠ Metrics server may already be running on port {port}")

# Expected: Success message with URL

## 3. Start Metrics Endpoint

Our `m2_3_monitoring.py` module provides a simple `start_metrics_server()` function that exposes metrics on `/metrics`.

### Understanding Metric Types

**Prometheus supports 4 metric types:**

1. **Counter** - Only goes up (total requests, cumulative cost)
2. **Gauge** - Can go up/down (active connections, cache hit rate)
3. **Histogram** - Samples observations and buckets them (latency, token counts)
4. **Summary** - Similar to histogram but calculates quantiles client-side (less common)

**Why histograms for latency?**
- Allows querying p50/p95/p99 percentiles
- Better than averages (which hide outliers)
- Example: p95 = 2s means "95% of requests complete within 2 seconds"

In [None]:
# Define our metric categories
metrics_catalog = {
    'Category': [
        'Performance', 'Performance', 'Performance',
        'Cost', 'Cost',
        'Quality', 'Quality',
        'System Health', 'System Health', 'System Health'
    ],
    'Metric': [
        'Query Latency (p50/p95/p99)', 'Retrieval Time', 'LLM Generation Time',
        'Cost per Query', 'Daily Spend by Model',
        'Relevance Score', 'Context Precision',
        'Cache Hit Rate', 'Error Rate', 'Rate Limit Headroom'
    ],
    'Type': [
        'Histogram', 'Histogram', 'Histogram',
        'Histogram', 'Counter',
        'Histogram', 'Histogram',
        'Gauge', 'Counter', 'Gauge'
    ],
    'Alert Threshold': [
        'p95 > 2s', 'p95 > 500ms', 'p95 > 5s',
        'p95 > $0.10', 'Daily > $100',
        'p50 < 0.7', 'p50 < 0.6',
        '< 30%', '> 5%', '< 100 requests'
    ]
}

df_metrics = pd.DataFrame(metrics_catalog)
print(df_metrics.to_string(index=False))

# Expected: Table of 10 metrics with types and alert thresholds

## 2. Metrics We'll Collect & Why

A production RAG system needs 4 categories of metrics: **Performance**, **Cost**, **Quality**, and **System Health**.

### When Self-Hosted Monitoring is Overkill

**Don't use Prometheus/Grafana if:**
- You're processing < 500 queries/day → Use CloudWatch/Stackdriver logs
- Your team has no DevOps experience → Use Datadog/New Relic
- You need results in < 2 hours → Start with basic logging first

**DO use it when:**
- High query volumes justify the infrastructure cost
- You need granular control over data retention
- Your team can maintain the monitoring stack
- You want industry-standard open-source tools

In [None]:
import pandas as pd

# Three monitoring approaches compared
approaches = {
    'Approach': ['Self-Hosted (Prometheus)', 'Managed APM', 'Native Cloud Logging'],
    'Setup Time': ['8-12 hours', '30 minutes', '1 hour'],
    'Monthly Cost': ['$50-200', '$15-31/host', '$5-20'],
    'Storage Growth': ['1-5 GB/day', 'Vendor-managed', '~100 MB/day'],
    'Best For': ['>1K queries/day', 'Non-technical teams', '<500 queries/day'],
    'Learning Curve': ['Steep', 'Minimal', 'Low']
}

df = pd.DataFrame(approaches)
print(df.to_string(index=False))

# Expected: Table showing 3 approaches with tradeoffs

## 1. Reality Check & When This is Overkill

Before diving into Prometheus and Grafana, let's understand the **trade-offs** of self-hosted monitoring.

# M2.3 — Production Monitoring Dashboard

**Goal:** Implement production-grade monitoring with Prometheus & Grafana for RAG systems.

**Topics:**
- Performance metrics (latency, tokens, costs)
- Structured logging for cloud environments
- Dashboard visualization and alerting strategies
- When to use managed alternatives

**Duration:** ~40 minutes

---