# Bridge L3.M7.1 → L3.M7.2 Readiness Notebook

## Purpose

This bridge validates your transition from **M7.1 (Distributed Tracing)** to **M7.2 (Application Performance Monitoring)**. 

M7.1 gave you request-level visibility: you can see that `openai.generate` took 3.5 seconds in a trace. M7.2 adds function-level visibility: you'll discover which specific function inside that span is the bottleneck. This matters because distributed tracing shows WHAT is slow, but APM reveals WHY — enabling targeted optimizations instead of guesswork.

## Concepts Covered

This bridge validates four readiness checks (delta from M7.1):

- Jaeger UI accessibility and trace availability
- 10% sampling rate configuration to avoid APM cost overruns
- Custom span attributes working (foundation for M7.2's advanced attributes)
- OpenTelemetry + Datadog dependencies installed for M7.2 integration

## After Completing This Notebook

You will be able to:

- Confirm Jaeger UI is running and displaying traces from M7.1
- Verify 10% sampling configuration to prevent M7.2 cost overruns
- Validate custom span attributes (foundation for M7.2's function-level metadata)
- Test OpenTelemetry and Datadog SDK imports needed for M7.2 APM

## Context in Track

**Bridge L3.M7.1 → L3.M7.2** within Module 7 (Distributed Tracing & Advanced Observability)

This 8-10 minute bridge validates your M7.1 distributed tracing setup before advancing to M7.2's application performance monitoring with Datadog APM.

---

## Run Locally

**Windows (PowerShell):**
```powershell
powershell -c "$env:PYTHONPATH='$PWD'; jupyter notebook"
```

**macOS/Linux:**
```bash
PYTHONPATH=$PWD jupyter notebook
```

Then open this notebook and run all cells sequentially.

## Section 1: RECAP - What M7.1 Accomplished

### Achievement 1: OpenTelemetry Instrumentation Working End-to-End
✓ OpenTelemetry SDK configured with OTLP exporter  
✓ BatchSpanProcessor exporting every 5 seconds  
✓ Jaeger backend running at localhost:16686  
✓ Auto-instrumented FastAPI with FastAPIInstrumentor  
✓ Manual spans added to RAG pipeline (retrieve, rerank, generate)  

**Key Win:** Can now answer "Where did 4.2 seconds go in THIS specific request?"

### Achievement 2: Trace Visualization in Jaeger UI
✓ Deployed Jaeger and viewed first traces  
✓ Parent span (POST /query) with child spans for each operation  
✓ Span attributes showing llm.model, user_id, cost_usd  
✓ Filtering capability (e.g., duration >2s)  

### Achievement 3: Sampling Strategy Configured
✓ Implemented 10% head-based sampling  
✓ Reduced overhead from 20ms → 2ms per request (90% reduction)  
✓ Balanced cost vs coverage (1,000 traces/day at 10K requests/day)  

### Achievement 4: Production Observability Deployed
✓ Every request gets a trace_id  
✓ Context propagates through all services  
✓ Traces retained for 30 days  

**What you gained:** Request-level observability — can trace slow requests from FastAPI → Pinecone → OpenAI

## Section 2: Readiness Check #1 - Jaeger Showing Traces

**Checkpoint:** Verify Jaeger UI is accessible and showing traces  
**Location:** http://localhost:16686  
**Success Criteria:** ≥10 traces from last hour with waterfall view

**Impact if failing:** M7.2 APM traces won't correlate without working Jaeger (2 hours troubleshooting saved)

### Test Jaeger Connectivity

The following cell attempts an HTTP request to the Jaeger UI endpoint. If Jaeger isn't running, the test gracefully skips with a warning.

In [None]:
import requests

# Check if Jaeger UI is accessible
try:
    response = requests.get("http://localhost:16686", timeout=3)
    if response.status_code == 200:
        print("✓ Jaeger UI accessible at http://localhost:16686")
    else:
        print(f"⚠️ Jaeger UI returned status {response.status_code}")
except Exception as e:
    print(f"⚠️ Skipping (no Jaeger service): {type(e).__name__}")

# Expected:
# ✓ Jaeger UI accessible at http://localhost:16686
# (Manual verification needed: Open UI and check for ≥10 traces)

## Section 3: Readiness Check #2 - Sampling Configured

**Checkpoint:** Verify sampling rate is set to 10% (0.10)  
**Location:** Check `tracer.py` or equivalent configuration  
**Success Criteria:** `TraceIdRatioBased(0.10)` or similar 10% sampling configured

**Impact if failing:** 100% sampling → overwhelmed APM backend + high costs (unexpected bills avoided)

### Search for Tracer Configuration Files

The following cell scans your project for tracer configuration files (e.g., `tracer.py`) to help you locate where sampling is configured.

In [None]:
import os
import glob

# Search for tracer configuration files
tracer_files = glob.glob("**/tracer.py", recursive=True) + glob.glob("**/*trace*.py", recursive=True)

if tracer_files:
    print(f"Found {len(tracer_files)} tracer config file(s):")
    for f in tracer_files[:3]:  # Show first 3
        print(f"  - {f}")
    print("⚠️ Manual check needed: Verify TraceIdRatioBased(0.10) in config")
else:
    print("⚠️ No tracer.py found - manual verification required")

# Expected:
# Found 1 tracer config file(s): tracer.py
# ⚠️ Manual check: Verify TraceIdRatioBased(0.10) in config

## Section 4: Readiness Check #3 - Custom Span Attributes

**Checkpoint:** Verify custom attributes (llm.model, llm.prompt_tokens, llm.completion_tokens) appear in traces  
**Location:** Jaeger UI → Click on `openai.generate` span → Attributes section  
**Success Criteria:** Attributes show model name and token counts

**Impact if failing:** M7.2 adds MORE attributes (function names, memory) — if basic attributes don't work, advanced ones won't either (1-2 hours debugging saved)

### Print Manual Verification Instructions

The following cell outputs step-by-step instructions for manually verifying custom span attributes in the Jaeger UI.

In [None]:
# Simulate checking span attributes
print("⚠️ Manual verification required:")
print("  1. Open Jaeger UI at http://localhost:16686")
print("  2. Select a trace and click on 'openai.generate' span")
print("  3. Verify attributes section shows:")
print("     - llm.model (e.g., 'gpt-4')")
print("     - llm.prompt_tokens (e.g., 1850)")
print("     - llm.completion_tokens (e.g., 420)")

# Expected:
# ⚠️ Manual verification required: Check Jaeger UI for custom attributes

## Section 5: Readiness Check #4 - Dependencies Installed

**Checkpoint:** Verify OpenTelemetry and Datadog dependencies are installed  
**Required packages:** `opentelemetry`, `ddtrace`  
**Success Criteria:** Import test succeeds without errors

**Impact if failing:** M7.2 uses both OpenTelemetry + Datadog SDK — missing dependencies = import errors (30 min troubleshooting saved)

### Test Required Package Imports

The following cell attempts to import OpenTelemetry and Datadog packages. If packages are missing, it prints installation instructions.

In [None]:
# Test dependency imports
deps = {"opentelemetry": False, "ddtrace": False}

try:
    import opentelemetry
    deps["opentelemetry"] = True
    print("✓ opentelemetry installed")
except ImportError:
    print("✗ opentelemetry NOT installed (run: pip install opentelemetry-api opentelemetry-sdk)")

try:
    from ddtrace import tracer
    deps["ddtrace"] = True
    print("✓ ddtrace installed")
except ImportError:
    print("✗ ddtrace NOT installed (run: pip install ddtrace)")

if all(deps.values()):
    print("\n✓ All dependencies ready for M7.2")

# Expected:
# ✓ opentelemetry installed
# ✓ ddtrace installed
# ✓ All dependencies ready for M7.2

## Section 6: CALL-FORWARD - What M7.2 Will Introduce

### The Gap: Traces Show WHAT is Slow, Not WHY

**Current state (M7.1):**  
You see `openai.generate` took 3,500ms in Jaeger — but you don't know WHY.

**Questions you CAN'T answer yet:**
- Is 3.5s normal for GPT-4 with 1850 tokens? Or is this abnormal?
- Are you sending redundant data in the prompt (wasting tokens)?
- Which FUNCTION inside `generate_response()` is the bottleneck?
- Is there a memory leak degrading performance over time?

### M7.2: Application Performance Monitoring with Datadog APM

**Three capabilities you'll gain:**

**1. Function-Level Performance Breakdown**
- See which function inside `openai.generate` span is slow
- Automatic profiling: no manual instrumentation needed
- Example: Discover `serialize_prompt()` takes 800ms (can optimize serialization)

**2. Memory Leak Detection**
- Monitor heap growth over time
- Identify objects not being garbage collected
- Example: Find that `document_cache` dict grows unbounded (add eviction policy)

**3. Database Query Optimization**
- See exact SQL queries being executed
- Identify N+1 query problems
- Example: Discover 50 individual SELECT queries (should be 1 JOIN)

### M7.2 Setup: One-Line Integration
```bash
ddtrace-run python main.py
```
- Automatic instrumentation: FastAPI, Redis, OpenAI, Pinecone
- Correlates with your existing OpenTelemetry traces
- Function-level flame graphs showing CPU time

### Driving Question for M7.2:
**"Traces show `openai.generate` took 3.5 seconds. Which FUNCTION inside that span is the bottleneck?"**

Ready to dive deeper into code-level observability!