# Module 7.2: Application Performance Monitoring

**Duration:** 38 minutes  
**Prerequisites:** Level 1 M2.3 (Prometheus/Grafana) + Level 2 M7.1 (OpenTelemetry Tracing)

## Learning Objectives

By the end of this notebook, you will:
- Integrate Datadog APM with existing OpenTelemetry tracing (without double instrumentation)
- Profile production code to find bottlenecks down to the function level (with <5% overhead)
- Detect memory leaks and CPU hotspots in live systems (without crashing production)
- Optimize database queries using APM query analysis
- Understand when APM is overkill and what cheaper alternatives exist

## 📚 Learning Arc

**Purpose**: Master Application Performance Monitoring (APM) to identify code-level bottlenecks in production RAG systems beyond what distributed tracing reveals.

**Concepts Covered**:
- Datadog APM integration with OpenTelemetry bridge (M7.1 compatibility)
- Production-safe profiling configuration (1% capture, 10% sampling, <5% CPU overhead)
- Memory leak detection with tracemalloc and continuous monitoring
- Cost optimization strategies and alternative solutions (Grafana Tempo, py-spy)
- Decision framework: When APM is essential vs. when it's premature optimization

**After Completing**: You'll be able to deploy APM in production without crushing performance, detect memory leaks before they cause OOM kills, and make informed cost/benefit decisions about APM tooling.

**Context in Track L3.M7**: Builds on M7.1 (Distributed Tracing) by adding function-level profiling. APM shows *why* a span is slow, while tracing shows *which* spans are slow.

## Section 2: Prerequisites & Setup

This section verifies your environment is ready for APM profiling and establishes OFFLINE mode for local development.

### Starting Point Verification

Your Level 2 M7.1 system currently has:
- OpenTelemetry tracing showing request flows
- Traces visible in Jaeger UI
- Spans tagged with custom attributes
- Trace sampling configured (10-20%)

**The gap:** When Jaeger shows a slow span, you can't see what's happening inside at the code level. APM fills this gap with function-level profiling.

### Dependencies Installation

We'll use these libraries (all optional - the module works without them):
- `ddtrace`: Datadog APM library (agentless approach)
- `py-spy`: Low-overhead profiling tool for spot checks
- `memory-profiler`: Memory leak detection utilities

# Environment setup and OFFLINE mode check
import os
import sys

# OFFLINE mode for L3 consistency
OFFLINE = os.getenv("OFFLINE", "false").lower() == "true"
APM_ENABLED = os.getenv("APM_ENABLED", "false").lower() == "true"

if OFFLINE or not APM_ENABLED:
    print("⚠️  Running in OFFLINE/APM_DISABLED mode")
    print("   Telemetry will not be exported to Datadog")
    print("   The module will demonstrate APM concepts without external services")
    print()

print("Checking dependencies...")

# Check ddtrace (optional - APM will be disabled if missing)
try:
    import ddtrace
    print(f"✓ ddtrace: {ddtrace.__version__}")
except ImportError:
    print("⚠️  ddtrace not available - APM features disabled")

# Check OpenTelemetry (M7.1 prerequisite)
try:
    import opentelemetry
    print(f"✓ OpenTelemetry: available")
except ImportError:
    print("⚠️  OpenTelemetry not available - install for M7.1 compatibility")

# Expected: ddtrace 2.x.x or higher, OpenTelemetry available
# Note: APM will work without these libraries in demonstration mode

In [None]:
# Verify dependencies installation
import sys

print("Checking dependencies...")

# Check ddtrace (optional - APM will be disabled if missing)
try:
    import ddtrace
    print(f"✓ ddtrace: {ddtrace.__version__}")
except ImportError:
    print("⚠️  ddtrace not available - APM features disabled")

# Check OpenTelemetry (M7.1 prerequisite)
try:
    import opentelemetry
    print(f"✓ OpenTelemetry: available")
except ImportError:
    print("⚠️  OpenTelemetry not available - install for M7.1 compatibility")

# Expected: ddtrace 2.x.x or higher, OpenTelemetry available

## Section 3: Theory Foundation

### APM vs Tracing

**Analogy: Debugging a Traffic Jam**

- **Metrics** (Prometheus): "Highway has 1,000 cars/hour, average speed 20mph"
- **Tracing** (OpenTelemetry): "Car #47 took 45 minutes from entrance to exit, passing through zones A → B → C"
- **APM**: "Car #47 spent 30 of those 45 minutes stopped in zone B because the left lane was blocked by a stalled truck at mile marker 23.7"

### How APM Works

```
Your Python App
├── OpenTelemetry (Traces)
│   └── Span: "process_query" - 2.5s
│
└── Datadog APM (Profiling)
    └── WITHIN that span:
        ├── Function: embedding_model() - 200ms
        ├── Function: chunk_filter() - 2.1s ⚠️
        │   └── Line 187: nested loop - 1.8s ⚠️⚠️
        └── Function: format_response() - 200ms
```

**Process:**
1. APM agent samples your Python process (default: 100 samples/second)
2. Each sample captures the call stack (which functions are executing)
3. Over time, you get a statistical profile: "85% of time is in chunk_filter()"
4. APM correlates this with your OpenTelemetry traces

### Key Distinction

**APM COMPLEMENTS tracing, doesn't replace it:**
- Tracing: Shows request flow between services (the 'what' and 'where')
- APM: Shows code execution within a service (the 'why')

In [None]:
## Section 4: Hands-On Implementation

In this section, you'll initialize APM and run profiled RAG queries to see code-level bottleneck detection in action.

### Step 1: Initialize APM Manager

The APM Manager handles:
- Datadog tracer configuration with production-safe defaults
- OpenTelemetry compatibility bridge (reuses M7.1 traces)
- Continuous profiler startup/shutdown with graceful degradation
- Production safety limits (max 5% CPU overhead, 1% sampling)

## Section 4: Hands-On Implementation

### Step 1: Initialize APM Manager

The APM Manager handles:
- Datadog tracer configuration
- OpenTelemetry compatibility bridge
- Continuous profiler startup/shutdown
- Production safety limits (max 5% CPU overhead)

In [None]:
# Initialize APM Manager
from src.l3_m7_application_performance_monitoring import apm_manager

print("Initializing APM...")
success = apm_manager.initialize()

if success:
    print("✅ APM initialized successfully")
    print(f"   Service: {apm_config.DD_SERVICE}")
    print(f"   Environment: {apm_config.DD_ENV}")
else:
    print("⚠️  APM initialization skipped")
    print("   Reason: No DD_API_KEY configured or ddtrace not installed")
    print("   Pipeline will work without APM profiling")

# Expected: APM initializes if keys configured, otherwise gracefully skips

### Step 2: Profiled RAG Pipeline

The ProfiledRAGPipeline demonstrates:
- Custom profiling with `@tracer.wrap()` decorators
- Span tagging (user_id, query_length)
- Exception tracking
- O(n²) bottleneck simulation for APM detection

**What APM will show:**
```
Span: rag.query - 2,547ms
├─ Span: rag.embed_query - 201ms
├─ Span: rag.search_vectordb - 304ms
├─ Span: rag.process_context - 1,893ms ⚠️
│  └─ Profile: _remove_overlapping_chunks() - 1,750ms
│     └─ Line 503: nested loop hotspot
└─ Span: rag.generate_response - 503ms
```

In [None]:
## Section 5: Memory Profiling & Leak Detection

Memory leaks are production incidents waiting to happen. This section demonstrates continuous memory monitoring to catch leaks before they cause OOM kills.

### Why Memory Leaks Matter

Memory leaks are silent killers in production:
- Gradual memory growth over hours/days
- Process eventually OOM killed (Out Of Memory)
- Difficult to debug without profiling tools

### Detection Strategy

1. **Baseline tracking**: Record memory at startup
2. **Periodic sampling**: Check memory every N requests
3. **Growth analysis**: Alert if growth >10MB/hour
4. **Leak identification**: Use objgraph to find growing objects

### Memory Profiling with tracemalloc

Python's built-in tracemalloc module provides:
- Memory allocation tracking
- Low overhead (<5% CPU impact)
- Peak memory and growth metrics
- Line-level attribution

## Section 5: Memory Profiling & Leak Detection

### Why Memory Leaks Matter

Memory leaks are silent killers in production:
- Gradual memory growth over hours/days
- Process eventually OOM killed
- Difficult to debug without profiling

### Detection Strategy

1. **Baseline tracking**: Record memory at startup
2. **Periodic sampling**: Check memory every N requests
3. **Growth analysis**: Alert if growth >10MB/hour
4. **Leak identification**: Use objgraph to find growing objects

### Memory Profiling with tracemalloc

- Python built-in module
- Tracks memory allocations
- Low overhead (<5%)
- Shows peak memory and growth

In [None]:
## Section 6: Reality Check

This section is critical: understanding what APM can't do and when it breaks is as important as knowing what it can do. Don't skip this.

### What This DOESN'T Do

1. **Replace code optimization**: APM shows you the problem, you still have to fix it
2. **Eliminate load testing**: Profiling shows behavior under load, not capacity limits
3. **Diagnose network issues**: Use distributed tracing (M7.1) for cross-service problems

### Trade-offs You Accepted

Every tool has costs. APM's trade-offs:
1. **Performance overhead**: 2-5% CPU even with conservative sampling (1% profiling, 10% traces)
2. **Cost**: $51-100/month minimum, rising to $300+ at scale
3. **Complexity**: 300+ lines of config code, requires profiling expertise
4. **Data privacy**: Telemetry sent to Datadog (third-party service)

### When This Approach Breaks

Real-world failure scenarios with solutions:

**Scenario 1: APM Overhead Crushes Performance**
- Production-safe config: 1% profiling, 10% sampling
- If you increase to 10% profiling + 100% sampling → 5-15% slowdown
- **Solution**: Always use production-safe defaults, load test first

**Scenario 2: Cost Explosion**
- Expected: $51/month for small deployment
- At 100K requests/hour: $300-500/month due to per-span fees ($5 per 1M spans)
- **Solution**: Adaptive sampling (reduce rate at high traffic)

**Scenario 3: Memory Leak Not Detected**
- APM samples periodically, may miss slow retention leaks
- Leaks that grow <1MB/hour are hard to detect
- **Solution**: Long-term monitoring (hours/days), use objgraph for deep analysis

## Section 6: Reality Check

### What This DOESN'T Do

1. **Replace code optimization**: APM shows you the problem, you still have to fix it
2. **Eliminate load testing**: Profiling shows behavior under load, not capacity limits
3. **Diagnose network issues**: Use distributed tracing (M7.1) for cross-service problems

### Trade-offs You Accepted

1. **Performance overhead**: 2-5% CPU even with conservative sampling (1% profiling, 10% traces)
2. **Cost**: $51-100/month minimum, rising to $300+ at scale
3. **Complexity**: 300+ lines of config code, requires profiling expertise
4. **Data privacy**: Telemetry sent to Datadog (third-party service)

### When This Approach Breaks

**Scenario 1: APM Overhead Crushes Performance**
- Production-safe config: 1% profiling, 10% sampling
- If you increase to 10% profiling + 100% sampling → 5-15% slowdown
- **Solution**: Always use production-safe defaults, load test first

**Scenario 2: Cost Explosion**
- Expected: $51/month for small deployment
- At 100K requests/hour: $300-500/month due to per-span fees ($5 per 1M spans)
- **Solution**: Adaptive sampling (reduce rate at high traffic)

**Scenario 3: Memory Leak Not Detected**
- APM samples periodically, may miss slow retention leaks
- Leaks that grow <1MB/hour are hard to detect
- **Solution**: Long-term monitoring (hours/days), use objgraph for deep analysis

## Section 7: Alternative Solutions

### Alternative 1: Open-Source APM (Grafana Tempo + Grafana)

**Cost**: $0 (self-hosted) or $50/month (Grafana Cloud)

**Pros:**
- Full control over data
- No vendor lock-in
- Integrates with existing Grafana dashboards

**Cons:**
- Manual setup (2-3 days)
- Less powerful profiling than Datadog
- No automatic code-level flame graphs

**When to use**: Budget <$100/month, need data sovereignty, already using Grafana

---

### Alternative 2: Cloud Provider APM

**Options:**
- AWS X-Ray: $5 per 1M requests
- GCP Cloud Profiler: Free for GCP users
- Azure Application Insights: Pay-per-use

**Pros:**
- Native cloud integration
- Simpler if already on AWS/GCP/Azure
- Often cheaper at low scale

**Cons:**
- Vendor lock-in
- Limited cross-cloud visibility
- Less powerful than Datadog

**When to use**: Single cloud deployment, already invested in cloud ecosystem

---

### Alternative 3: Manual Profiling (py-spy)

**Cost**: $0

**Pros:**
- Zero overhead when not profiling
- On-demand profiling
- Great for one-time investigations

**Cons:**
- Manual process
- No continuous monitoring
- No correlation with traces

**When to use**: Low traffic, occasional debugging, tight budget

**Example:**
```bash
# Profile a running Python process
py-spy record -o profile.svg --pid 12345

# Profile for 60 seconds
py-spy record -d 60 -o profile.svg -- python app.py
```

## Section 8: When NOT to Use APM

### Scenario 1: Low Traffic (<1,000 requests/day)

**Why NOT to use APM:**
- Insufficient data for meaningful profiling patterns
- Sampling 10% of 1,000 requests = 100 traces (not enough)
- APM costs ($51/month) exceed infrastructure costs

**What to use instead:**
- py-spy for one-time profiling
- Manual logging during development
- Wait until traffic grows >1K requests/hour

---

### Scenario 2: Pre-Optimization (No Known Performance Problem)

**Why NOT to use APM:**
- Premature optimization wastes time
- No baseline to compare against
- APM overhead without benefit

**What to do instead:**
- Wait until P95 latency crosses threshold (e.g., 3s)
- Use basic metrics (Prometheus) to identify issues first
- Add APM when you have specific bottleneck to investigate

**Decision rule:** Only add APM when you have a known performance problem that basic metrics can't diagnose

---

### Scenario 3: Tight Budget (<$100/month total infrastructure)

**Why NOT to use APM:**
- APM minimum: $51/month
- At $100 total budget, APM is 50%+ of costs
- Open-source alternatives available

**What to use instead:**
- Grafana Tempo (self-hosted, $0)
- py-spy (manual profiling, $0)
- Cloud provider APM if already on AWS/GCP (often cheaper)

---

### Scenario 4: Highly Sensitive Data (Healthcare, Finance, Government)

**Why NOT to use APM:**
- Telemetry sent to third-party (Datadog)
- May violate data sovereignty requirements
- Compliance concerns (HIPAA, PCI-DSS)

**What to use instead:**
- Self-hosted APM (Grafana Tempo)
- On-premise profiling tools
- Cloud provider APM in same region/jurisdiction

---

### Summary: Use APM When You Have ALL of These

```
✓ High traffic (>1K requests/hour)
✓ Known performance problems (P95 >3s)
✓ Adequate budget ($50-200/month)
✓ Data privacy clearance for third-party telemetry
```

If missing ANY of the above, consider alternatives."

## Section 9: Common Failures

### Failure 1: APM Overhead Crushing Performance (5-15% slowdown)

**How it happens:**
```python
# ❌ WRONG - Too aggressive
DD_PROFILING_CAPTURE_PCT = 10  # 10% profiling
DD_TRACE_SAMPLE_RATE = 1.0      # 100% sampling
```

**Symptom:**
- P95 latency increased from 800ms to 1.2s (50% slowdown)
- CPU usage increased from 70% to 95%

**The fix:**
```python
# ✅ CORRECT - Production-safe
DD_PROFILING_CAPTURE_PCT = 1   # 1% profiling
DD_TRACE_SAMPLE_RATE = 0.1      # 10% sampling
DD_PROFILING_MAX_TIME_USAGE_PCT = 5  # Safety limit
```

---

### Failure 2: Profiling Crashes Application (OOM)

**How it happens:**
```python
# ❌ WRONG - memory_profiler in production
from memory_profiler import profile

@profile  # Line-by-line tracking = huge overhead
def process_batch(docs):
    # 10,000 docs × 2MB snapshot = 20GB memory
```

**The fix:**
```python
# ✅ CORRECT - Use Datadog's sampling-based profiling
DD_PROFILING_MEMORY_ENABLED = True
# No decorator needed - automatic sampling
```

---

### Failure 3: Memory Leak Detection Challenges

**Why leaks are hard to detect:**
- Slow retention leaks (<1MB/hour)
- APM samples periodically, may miss gradual growth
- Need long-term monitoring (hours/days)

**Solution:**
```python
# Run long-term monitoring
results = monitor_memory_leak(iterations=100)

# Use objgraph for deep analysis
import objgraph
objgraph.show_growth()  # Shows growing object types
```

---

### Failure 4: Query Optimization Complexity

**Challenge:**
- EXPLAIN ANALYZE shows sequential scan
- Adding index doesn't help
- Query planner decisions are complex

**Solution:**
- Start simple, optimize incrementally
- Test indexes in staging with production data volumes
- Use APM as starting point, not final answer

---

### Failure 5: APM Cost Explosion ($500+ bill)

**How it happens:**
- Expected: $51/month
- Actual: $523/month
- Cause: High traffic + 100% sampling = 100M spans/month

**The fix:**
```python
# Adaptive sampling based on traffic
if requests_per_hour > 10000:
    DD_TRACE_SAMPLE_RATE = 0.01  # 1% for high traffic
else:
    DD_TRACE_SAMPLE_RATE = 0.1   # 10% for normal traffic

# Monitor span count in Datadog billing dashboard
# Set budget alerts at $100, $200 thresholds
```"

## Section 10: Decision Card

### ✅ BENEFIT

Deep code-level profiling reveals bottlenecks down to specific function calls and line numbers. Reduces debugging time from hours to minutes by showing CPU hotspots, memory leaks, and slow queries with flame graphs. Correlates performance issues with traces from M7.1.

---

### ❌ LIMITATION

Adds 2-5% CPU overhead in production even with conservative sampling (1% profiling, 10% trace sampling). Cost scales rapidly: $51/month minimum, rising to $300+/month at 100K requests/hour due to per-span analysis fees ($5 per 1M spans). Memory profiling shows allocations but struggles to detect slow retention-based leaks.

---

### 💰 COST

- **Time to implement**: 2-4 hours for initial setup, 1-2 days for production tuning
- **Monthly cost**: $51-100 for small deployments (1-3 hosts), $300-800 for medium scale (10-15 hosts, 10M spans/day)
- **Complexity**: 300+ lines of APM config code, requires understanding of profiling overhead vs visibility trade-offs

---

### 🤔 USE WHEN

- Traffic exceeds 1K requests/hour with known performance problems (P95 >3s)
- Budget allows $50-200/month for APM
- Team of 3+ engineers who will actively monitor dashboards
- No compliance restrictions on sending telemetry to third-party services (Datadog)

---

### 🚫 AVOID WHEN

- Traffic below 1K requests/hour (insufficient data for profiling patterns - use py-spy instead)
- Budget under $100/month total (APM would be 50%+ of costs - use open-source Grafana Tempo)
- Processing sensitive data requiring full data sovereignty (use self-hosted APM)
- No known performance issues yet (premature optimization - wait until P95 crosses 3s)

---

### Decision Framework

```
Choose APM when you have ALL of:
✓ High traffic (>1K requests/hour)
✓ Known performance problems
✓ Adequate budget ($50-200/month)
✓ Data privacy clearance
```

**Save this card** - you'll reference it when deciding between Datadog APM, open-source alternatives, or manual profiling approaches."

## Section 11: Summary & Next Steps

### What You Built Today

- Full Datadog APM integration with OpenTelemetry bridge (connecting your M7.1 tracing)
- Continuous profiling setup profiling 1% of requests with <5% CPU overhead
- Memory leak detection system using tracemalloc
- Production-safe configuration with cost controls

### What You Learned

✅ How APM complements tracing by showing function-level bottlenecks (not just span-level)  
✅ Safe production profiling configuration (1% capture, 10% sampling, 5% max CPU)  
✅ Memory leak detection patterns and tools  
✅ When APM is overkill (low traffic, tight budgets, no known problems)  
✅ Cost management strategies (adaptive sampling, span filtering)

### Key Takeaways

1. **APM is not a replacement for optimization** - it shows you the problem, you fix it
2. **Start conservative** - 1% profiling, 10% sampling, increase only if safe
3. **Monitor costs** - APM can get expensive at scale ($300+/month)
4. **Know when NOT to use it** - low traffic, tight budgets, premature optimization

### Production Checklist

Before deploying APM to production:

- [ ] API keys configured in `.env`
- [ ] Sampling rates production-safe (≤10% traces, ≤1% profiling)
- [ ] Safety limits set (`DD_PROFILING_MAX_TIME_USAGE_PCT=5`)
- [ ] Load tested with APM enabled
- [ ] APM overhead measured (<5% CPU increase)
- [ ] Cost monitoring dashboard created
- [ ] Budget alerts configured ($100, $200 thresholds)
- [ ] Rollback plan documented

### Next Steps

1. **Explore Datadog UI**: If configured, visit https://app.datadoghq.com/apm/traces
2. **Run load tests**: Generate traffic and observe APM profiling in real-time
3. **Optimize bottlenecks**: Use APM to find and fix slow code paths
4. **Next module**: Module 7.3 - Error Tracking & Root Cause Analysis

---

**Congratulations!** You've mastered Application Performance Monitoring for RAG systems."

In [None]:
# Final Demo: Complete APM Workflow
print("=" * 60)
print("Module 7.2: Application Performance Monitoring - Complete Demo")
print("=" * 60)
print()

# 1. Check APM status
print("1. APM Status:")
print(f"   Initialized: {apm_manager.is_initialized}")
print(f"   Configured: {apm_config.is_configured}")
print()

# 2. Run sample query
print("2. Processing sample query...")
query = "What are the compliance requirements?"
result = pipeline.process_query(query, "demo_user")
print(f"   ✅ Query processed successfully")
print()

# 3. Check memory
from src.l3_m7_application_performance_monitoring import MemoryProfiledComponent
profiler = MemoryProfiledComponent()
stats = profiler.get_memory_stats()
print("3. Memory Statistics:")
print(f"   Current: {stats['current_mb']:.2f} MB")
print(f"   Peak: {stats['peak_mb']:.2f} MB")
print(f"   Growth: {stats['growth_mb']:.2f} MB")
print()

# 4. Summary
print("=" * 60)
print("Demo Complete!")
print()
print("Next steps:")
print("- View traces in Datadog UI (if configured)")
print("- Run load tests to generate profiling data")
print("- Explore APM flame graphs and bottleneck analysis")
print("=" * 60)

# Expected: All components work, APM captures profiling data if configured