# Notebook 16: End-to-End Production Observability Stack

**Complete Integration: CUDA Inference + OpenTelemetry + Unified Visualizations**

---

## Objectives Demonstrated

‚úÖ **CUDA Inference** (GPU 0) - Production-grade inference pipeline

‚úÖ **LLM Observability** (GPU 0) - Full OpenTelemetry + llama.cpp metrics

‚úÖ **Unified Visualizations** (GPU 1) - Graphistry 2D + Plotly 3D/2D integrated dashboard

---

## Overview

This is the **flagship comprehensive notebook** that integrates all three core objectives of llamatelemetry into a unified production observability stack. It combines:
- CUDA-optimized LLM inference on GPU 0
- Multi-layer observability (OpenTelemetry + llama.cpp + GPU metrics)
- Unified visualization dashboard mixing Graphistry graph viz + Plotly charts

**What You'll Build:**
- Production-ready inference pipeline with full instrumentation
- Multi-source telemetry collection (traces, metrics, logs, GPU stats)
- Unified dashboard showing:
  - Request trace graphs (Graphistry 2D)
  - Performance metrics charts (Plotly 2D)
  - 3D model internals visualization (Plotly 3D)
  - Real-time monitoring panels
- Complete observability stack deployment

**Time:** 45 minutes

**Difficulty:** Expert

**VRAM:** GPU 0: 6-10 GB, GPU 1: 3-5 GB

---

## Part 1: Environment Setup (5 min)

### Cell 1: Install llamatelemetry v0.1.0

In [None]:
# Install llamatelemetry v0.1.0
!pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

import llamatelemetry
print(f"‚úÖ llamatelemetry {llamatelemetry.__version__} installed")

### Cell 2: Install Observability Stack

In [None]:
# Install OpenTelemetry and monitoring tools
!pip install -q \
    opentelemetry-api==1.37.0 \
    opentelemetry-sdk==1.37.0 \
    opentelemetry-exporter-otlp-proto-grpc==1.37.0 \
    opentelemetry-instrumentation \
    pynvml requests

print("‚úÖ Observability stack installed")

### Cell 3: Install Visualization Stack

In [None]:
# Install visualization tools
!pip install -q \
    plotly pandas numpy \
    pygraphistry \
    umap-learn scikit-learn

# Install RAPIDS for GPU-accelerated graph analytics (optional)
!pip install -q --extra-index-url=https://pypi.nvidia.com "cugraph-cu12==25.6.*" "cudf-cu12==25.6.*"

print("‚úÖ Visualization stack installed")

### Cell 4: Verify Dual GPU Setup

In [None]:
# Verify dual GPU environment
!nvidia-smi --query-gpu=index,name,memory.total,compute_cap --format=csv,noheader

import torch
print(f"\nFound {torch.cuda.device_count()} GPUs:")
for i in range(torch.cuda.device_count()):
    print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

print("\n‚úÖ Dual GPU environment verified")
print("   GPU 0 ‚Üí LLM Inference + Observability")
print("   GPU 1 ‚Üí Unified Visualization Dashboard")

---

## Part 2: Multi-Layer Observability Setup (10 min)

### Cell 5: Configure Resource Attributes (GPU Context)

In [None]:
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "llamatelemetry-production",
    "service.version": "0.1.0",
    "deployment.environment": "kaggle",
    "host.name": "kaggle-t4-dual",
    "gpu.model": "Tesla T4",
    "gpu.count": 2,
    "gpu.compute_capability": "7.5",
    "llm.framework": "llama.cpp",
    "llm.backend": "gguf",
})

print("‚úÖ Resource attributes configured")

### Cell 6: Setup Complete OpenTelemetry Stack

In [None]:
from opentelemetry import trace, metrics, _logs
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
    InMemorySpanExporter,
)
from opentelemetry.sdk.metrics.export import (
    PeriodicExportingMetricReader,
    ConsoleMetricExporter,
)
from opentelemetry.sdk._logs.export import (
    BatchLogRecordProcessor,
    ConsoleLogExporter,
)

# Tracing
memory_span_exporter = InMemorySpanExporter()
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
tracer_provider.add_span_processor(BatchSpanProcessor(memory_span_exporter))
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)

# Metrics
meter_provider = MeterProvider(
    resource=resource,
    metric_readers=[
        PeriodicExportingMetricReader(
            ConsoleMetricExporter(),
            export_interval_millis=10000,
        )
    ],
)
metrics.set_meter_provider(meter_provider)
meter = metrics.get_meter(__name__)

# Logging
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(BatchLogRecordProcessor(ConsoleLogExporter()))
_logs.set_logger_provider(logger_provider)
logger = _logs.get_logger(__name__)

print("‚úÖ OpenTelemetry stack initialized")

### Cell 7: Create Custom Instruments

In [None]:
# Counters
request_counter = meter.create_counter(
    "llm.requests.total",
    description="Total LLM requests",
    unit="1",
)

error_counter = meter.create_counter(
    "llm.errors.total",
    description="Total LLM errors",
    unit="1",
)

# Histograms
latency_histogram = meter.create_histogram(
    "llm.request.duration",
    description="Request latency distribution",
    unit="ms",
)

token_histogram = meter.create_histogram(
    "llm.tokens.count",
    description="Token count distribution",
    unit="{token}",
)

# Observable Gauges
def get_gpu_memory_callback(options):
    """Callback for GPU memory observable gauge"""
    import pynvml
    try:
        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        yield metrics.Observation(
            value=memory_info.used / 1024**2,  # MB
            attributes={"gpu.id": "0"}
        )
    except:
        pass

gpu_memory_gauge = meter.create_observable_gauge(
    "gpu.memory.used",
    callbacks=[get_gpu_memory_callback],
    description="GPU memory usage",
    unit="MB",
)

print("‚úÖ Custom instruments created")

### Cell 8: Define Unified Metrics Collector

In [None]:
import requests
import time
import threading
from collections import defaultdict
import pandas as pd
import pynvml

class UnifiedMetricsCollector:
    """Collects metrics from all observability sources"""

    def __init__(self, server_url: str, tracer, memory_exporter):
        self.server_url = server_url
        self.tracer = tracer
        self.memory_exporter = memory_exporter
        self.running = False
        self.lock = threading.Lock()

        # Storage
        self.otel_spans = []
        self.llama_metrics = defaultdict(list)
        self.gpu_metrics = []
        self.model_internals = {}
        self.timestamps = []

        # Initialize PyNVML
        try:
            pynvml.nvmlInit()
        except:
            pass

    def collect_otel_spans(self):
        """Get spans from memory exporter"""
        spans = self.memory_exporter.get_finished_spans()
        with self.lock:
            self.otel_spans.extend(spans)
        return len(spans)

    def collect_llama_metrics(self):
        """Poll llama.cpp /metrics endpoint"""
        try:
            response = requests.get(f"{self.server_url}/metrics", timeout=2)
            if response.status_code == 200:
                # Parse Prometheus metrics (simplified)
                metrics = {}
                for line in response.text.split("\n"):
                    if line.startswith("llamacpp:"):
                        parts = line.split()
                        if len(parts) >= 2:
                            name = parts[0]
                            value = float(parts[1])
                            metrics[name] = value

                with self.lock:
                    for key, value in metrics.items():
                        self.llama_metrics[key].append(value)
                return metrics
        except:
            pass
        return {}

    def collect_gpu_metrics(self):
        """Collect GPU metrics via PyNVML"""
        try:
            handle = pynvml.nvmlDeviceGetHandleByIndex(0)
            utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
            memory = pynvml.nvmlDeviceGetMemoryInfo(handle)
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000  # W

            gpu_data = {
                "timestamp": time.time(),
                "utilization": utilization.gpu,
                "memory_used_mb": memory.used / 1024**2,
                "memory_total_mb": memory.total / 1024**2,
                "temperature_c": temp,
                "power_w": power,
            }

            with self.lock:
                self.gpu_metrics.append(gpu_data)
            return gpu_data
        except:
            return {}

    def collect_all(self):
        """Single collection cycle across all sources"""
        timestamp = time.time()

        otel_count = self.collect_otel_spans()
        llama_metrics = self.collect_llama_metrics()
        gpu_data = self.collect_gpu_metrics()

        with self.lock:
            self.timestamps.append(timestamp)

        return {
            "timestamp": timestamp,
            "otel_spans": otel_count,
            "llama_metrics": len(llama_metrics),
            "gpu_data": bool(gpu_data),
        }

    def start_background_collection(self, interval: float = 1.0):
        """Start continuous collection in background"""
        self.running = True

        def collect_loop():
            while self.running:
                self.collect_all()
                time.sleep(interval)

        thread = threading.Thread(target=collect_loop, daemon=True)
        thread.start()
        print(f"üìä Started unified metrics collection (interval={interval}s)")

    def stop_background_collection(self):
        """Stop collection"""
        self.running = False
        print("‚èπÔ∏è Stopped metrics collection")

    def get_summary(self):
        """Get collection summary"""
        with self.lock:
            return {
                "total_spans": len(self.otel_spans),
                "llama_metrics_count": len(self.llama_metrics),
                "gpu_samples": len(self.gpu_metrics),
                "collection_duration": self.timestamps[-1] - self.timestamps[0] if self.timestamps else 0,
            }

# Initialize unified collector (will be used after server starts)
print("‚úÖ Unified metrics collector class defined")

---

## Part 3: Start Instrumented Inference Pipeline (5 min)

### Cell 9: Download GGUF Model

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="unsloth/Qwen2.5-3B-Instruct-GGUF",
    filename="Qwen2.5-3B-Instruct-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)
print(f"‚úÖ Model: {model_path}")

### Cell 10: Start llama-server with Full Instrumentation

In [None]:
from llamatelemetry.server import ServerManager

server = ServerManager(server_url="http://127.0.0.1:8090")
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,
    n_parallel=8,  # 8 parallel slots
    port=8090,
    extra_args=["--metrics", "--slots"],  # Enable observability endpoints
)
print("‚úÖ Server started with full instrumentation")

### Cell 11: Start Background Metrics Collection

In [None]:
# Initialize collector now that server is running
collector = UnifiedMetricsCollector(
    server_url="http://127.0.0.1:8090",
    tracer=tracer,
    memory_exporter=memory_span_exporter,
)

collector.start_background_collection(interval=1.0)
time.sleep(3)  # Let it collect initial data
print(f"üìä Collecting metrics... {collector.get_summary()}")

### Cell 12: Create Production Inference Client

In [None]:
from llamatelemetry.api import LlamaCppClient
from opentelemetry.trace import Status, StatusCode
import time

class ProductionLLMClient:
    """Production LLM client with full instrumentation"""

    def __init__(self, base_url: str, tracer, meter):
        self.client = LlamaCppClient(base_url)
        self.tracer = tracer
        self.request_counter = request_counter
        self.latency_histogram = latency_histogram
        self.token_histogram = token_histogram

    def chat_completion(self, messages: list, **kwargs):
        model = kwargs.get("model", "unknown")
        max_tokens = kwargs.get("max_tokens", 100)
        temperature = kwargs.get("temperature", 0.7)

        with self.tracer.start_as_current_span(
            name=f"llm.chat.{model}",
            kind=trace.SpanKind.CLIENT,
        ) as span:
            try:
                span.set_attribute("llm.system", "llama.cpp")
                span.set_attribute("llm.model", model)
                span.set_attribute("llm.request.max_tokens", max_tokens)
                span.set_attribute("llm.request.temperature", temperature)
                span.set_attribute("llm.request.messages", len(messages))

                start_time = time.time()
                response = self.client.chat.completions.create(
                    messages=messages,
                    **kwargs
                )
                latency_ms = (time.time() - start_time) * 1000

                finish_reason = response.choices[0].finish_reason
                content = response.choices[0].message.content

                span.set_attribute("llm.response.finish_reason", finish_reason)
                span.set_attribute("llm.response.length", len(content))

                self.request_counter.add(
                    1,
                    attributes={
                        "model": model,
                        "finish_reason": finish_reason,
                        "status": "success",
                    }
                )
                self.latency_histogram.record(
                    latency_ms,
                    attributes={"model": model, "status": "success"}
                )

                if hasattr(response, 'usage'):
                    input_tokens = getattr(response.usage, 'prompt_tokens', 0)
                    output_tokens = getattr(response.usage, 'completion_tokens', 0)

                    span.set_attribute("llm.usage.input_tokens", input_tokens)
                    span.set_attribute("llm.usage.output_tokens", output_tokens)

                    self.token_histogram.record(
                        input_tokens,
                        attributes={"model": model, "token_type": "input"}
                    )
                    self.token_histogram.record(
                        output_tokens,
                        attributes={"model": model, "token_type": "output"}
                    )

                span.set_status(Status(StatusCode.OK))
                return response

            except Exception as e:
                span.set_status(Status(StatusCode.ERROR, str(e)))
                span.record_exception(e)
                self.request_counter.add(
                    1,
                    attributes={
                        "model": model,
                        "status": "error",
                        "error_type": type(e).__name__,
                    }
                )
                raise

client = ProductionLLMClient("http://127.0.0.1:8090", tracer, meter)
print("‚úÖ Production LLM client initialized")

### Cell 13: Generate Sample Load

In [None]:
test_prompts = [
    "Explain CUDA programming",
    "What is quantization?",
    "Describe transformer architecture",
    "How does FlashAttention work?",
    "What is GGUF format?",
]

print("üöÄ Generating sample requests...")
for i, prompt in enumerate(test_prompts * 3):  # 15 total requests
    response = client.chat_completion(
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
    )
    print(f"  Request {i+1}/15 complete")
    time.sleep(0.5)

print(f"‚úÖ Generated load. Metrics: {collector.get_summary()}")

---

## Part 4: Unified Visualization Dashboard (GPU 1) (20 min)

### Cell 14: Switch to GPU 1

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
print("üîÑ Switched to GPU 1 for visualizations")

### SECTION 1: Request Trace Graphs (Graphistry 2D)

### Cell 15: Setup Graphistry

In [None]:
import graphistry
from kaggle_secrets import UserSecretsClient

secrets = UserSecretsClient()
graphistry.register(
    api=3,
    username=secrets.get_secret("Graphistry_Username"),
    personal_key_id=secrets.get_secret("Graphistry_Personal_Key_ID"),
    personal_key_secret=secrets.get_secret("Graphistry_Personal_Key_Secret"),
)

print("‚úÖ Graphistry configured")

### Cell 16: Transform Spans to Graph Data

In [None]:
import pandas as pd

with collector.lock:
    spans = collector.otel_spans

span_data = []
for span in spans:
    span_data.append({
        "span_id": format(span.context.span_id, "016x"),
        "parent_span_id": format(span.parent.span_id, "016x") if span.parent else None,
        "trace_id": format(span.context.trace_id, "032x"),
        "name": span.name,
        "duration_ms": (span.end_time - span.start_time) / 1_000_000,
        "status": span.status.status_code.name,
        "model": span.attributes.get("llm.model", "unknown") if span.attributes else "unknown",
    })

df_spans = pd.DataFrame(span_data)

edges = []
for _, span in df_spans.iterrows():
    if span["parent_span_id"]:
        edges.append({
            "source": span["parent_span_id"],
            "destination": span["span_id"],
        })

df_edges = pd.DataFrame(edges) if edges else pd.DataFrame(columns=["source", "destination"])

print(f"üìä Spans: {len(df_spans)}, Edges: {len(df_edges)}")

### Cell 17: Create Graphistry Trace Visualization

In [None]:
if len(df_edges) > 0:
    g = graphistry.edges(df_edges, "source", "destination")
    g = g.nodes(df_spans, "span_id")
    g = g.bind(
        point_title="name",
        point_size="duration_ms",
        point_color="status",
    )
    g = g.encode_point_color("status", categorical_mapping={
        "OK": "#4CAF50", "ERROR": "#F44336", "UNSET": "#9E9E9E"
    }, as_categorical=True)

    url_traces = g.plot(render=False)
    print(f"üîó Trace Graph: {url_traces}")
else:
    print("‚ö†Ô∏è No trace edges available for visualization")
    url_traces = None

### SECTION 2: Performance Metrics (Plotly 2D)

### Cell 18: Prepare Metrics DataFrames

In [None]:
with collector.lock:
    df_gpu = pd.DataFrame(collector.gpu_metrics)

if len(df_gpu) > 0:
    df_gpu["timestamp"] = pd.to_datetime(df_gpu["timestamp"], unit="s")

# Create metrics from spans
span_metrics = []
for span in collector.otel_spans:
    attrs = span.attributes or {}
    span_metrics.append({
        "timestamp": pd.to_datetime(span.start_time, unit="ns"),
        "duration_ms": (span.end_time - span.start_time) / 1_000_000,
        "input_tokens": attrs.get("llm.usage.input_tokens", 0),
        "output_tokens": attrs.get("llm.usage.output_tokens", 0),
        "status": span.status.status_code.name,
    })

df_span_metrics = pd.DataFrame(span_metrics)

print(f"üìä GPU samples: {len(df_gpu)}, Span metrics: {len(df_span_metrics)}")

### Cell 19: Create Comprehensive 2D Metrics Dashboard

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=(
        "Latency Distribution (ms)",
        "GPU Utilization Over Time (%)",
        "Token Usage (Input vs Output)",
        "GPU Memory Usage (MB)",
        "Request Success Rate",
        "GPU Temperature & Power"
    ),
    specs=[
        [{"type": "histogram"}, {"type": "scatter"}],
        [{"type": "scatter"}, {"type": "scatter"}],
        [{"type": "bar"}, {"type": "scatter"}]
    ],
    vertical_spacing=0.12,
)

# 1. Latency histogram
if len(df_span_metrics) > 0:
    fig.add_trace(
        go.Histogram(
            x=df_span_metrics["duration_ms"],
            nbinsx=30,
            name="Latency",
            marker_color="blue",
        ),
        row=1, col=1
    )

# 2. GPU utilization over time
if len(df_gpu) > 0:
    fig.add_trace(
        go.Scatter(
            x=df_gpu["timestamp"],
            y=df_gpu["utilization"],
            mode="lines",
            name="GPU %",
            line=dict(color="green"),
            fill="tozeroy",
        ),
        row=1, col=2
    )

# 3. Token usage scatter
if len(df_span_metrics) > 0:
    fig.add_trace(
        go.Scatter(
            x=df_span_metrics["input_tokens"],
            y=df_span_metrics["output_tokens"],
            mode="markers",
            name="Tokens",
            marker=dict(
                size=df_span_metrics["duration_ms"] / 10,
                color=df_span_metrics["duration_ms"],
                colorscale="Viridis",
                showscale=True,
            ),
        ),
        row=2, col=1
    )

# 4. GPU memory usage
if len(df_gpu) > 0:
    fig.add_trace(
        go.Scatter(
            x=df_gpu["timestamp"],
            y=df_gpu["memory_used_mb"],
            mode="lines+markers",
            name="Memory MB",
            line=dict(color="red"),
        ),
        row=2, col=2
    )

# 5. Success rate bar
if len(df_span_metrics) > 0:
    status_counts = df_span_metrics["status"].value_counts()
    fig.add_trace(
        go.Bar(
            x=status_counts.index,
            y=status_counts.values,
            name="Requests",
            marker_color=["green" if s == "OK" else "red" for s in status_counts.index],
        ),
        row=3, col=1
    )

# 6. Temperature and power
if len(df_gpu) > 0:
    fig.add_trace(
        go.Scatter(
            x=df_gpu["timestamp"],
            y=df_gpu["temperature_c"],
            mode="lines",
            name="Temp ¬∞C",
            line=dict(color="orange"),
        ),
        row=3, col=2
    )
    fig.add_trace(
        go.Scatter(
            x=df_gpu["timestamp"],
            y=df_gpu["power_w"],
            mode="lines",
            name="Power W",
            line=dict(color="purple"),
            yaxis="y2",
        ),
        row=3, col=2
    )

fig.update_layout(
    title_text="üìä Performance Metrics Dashboard (2D)",
    showlegend=True,
    height=900,
)

fig.show()

### SECTION 3: Model Internals 3D (Plotly 3D)

### Cell 20: Extract Token Embeddings (Synthetic Demo)

In [None]:
import numpy as np
from sklearn.decomposition import PCA

# Simulate 100 tokens with 768-dim embeddings
np.random.seed(42)
n_tokens = 100
embedding_dim = 768

# Create synthetic embeddings (replace with actual GGUF extraction in production)
embeddings = np.random.randn(n_tokens, embedding_dim)

# Add semantic clustering (simulate word categories)
categories = ["tech", "math", "science", "language", "general"]
token_categories = np.random.choice(categories, size=n_tokens)

# Project to 3D using PCA
pca = PCA(n_components=3)
embeddings_3d = pca.fit_transform(embeddings)

df_embeddings = pd.DataFrame({
    "x": embeddings_3d[:, 0],
    "y": embeddings_3d[:, 1],
    "z": embeddings_3d[:, 2],
    "token_id": range(n_tokens),
    "category": token_categories,
})

print(f"üìä Embedded {n_tokens} tokens to 3D space")

### Cell 21: Create 3D Embedding Visualization

In [None]:
import plotly.express as px

fig_3d = px.scatter_3d(
    df_embeddings,
    x="x", y="y", z="z",
    color="category",
    hover_data=["token_id"],
    title="Token Embedding Space (3D PCA Projection)",
    labels={"x": "PC1", "y": "PC2", "z": "PC3"},
    opacity=0.7,
)

fig_3d.update_traces(marker=dict(size=5))
fig_3d.update_layout(height=700)
fig_3d.show()

### Cell 22: 3D Attention Heatmap (Surface Plot)

In [None]:
# Create synthetic attention weights (replace with actual extraction)
attention_heads = 8
seq_length = 64

# Simulate attention weights for one head
attention_weights = np.random.rand(seq_length, seq_length)
attention_weights = (attention_weights + attention_weights.T) / 2  # Symmetric

fig_attn = go.Figure(data=[go.Surface(
    z=attention_weights,
    colorscale="RdBu",
    colorbar=dict(title="Attention"),
)])

fig_attn.update_layout(
    title="Attention Weight Heatmap (Head 0, 3D Surface)",
    scene=dict(
        xaxis_title="Query Position",
        yaxis_title="Key Position",
        zaxis_title="Attention Score",
    ),
    height=600,
)

fig_attn.show()

### SECTION 4: Real-Time Monitoring Panel

### Cell 23: Live Monitoring Dashboard

In [None]:
# Create live monitoring panel (static snapshot)
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Get latest values
latest_gpu = df_gpu.iloc[-1] if len(df_gpu) > 0 else {}
total_requests = len(df_span_metrics)
success_rate = (df_span_metrics["status"] == "OK").mean() * 100 if len(df_span_metrics) > 0 else 0
avg_latency = df_span_metrics["duration_ms"].mean() if len(df_span_metrics) > 0 else 0

# Create indicator panel
fig_monitor = make_subplots(
    rows=2, cols=2,
    specs=[
        [{"type": "indicator"}, {"type": "indicator"}],
        [{"type": "indicator"}, {"type": "indicator"}]
    ],
    subplot_titles=("GPU Utilization", "Success Rate", "Avg Latency", "Temperature")
)

# GPU Utilization Gauge
fig_monitor.add_trace(
    go.Indicator(
        mode="gauge+number",
        value=latest_gpu.get("utilization", 0),
        title={"text": "GPU %"},
        gauge={"axis": {"range": [0, 100]}, "bar": {"color": "green"}},
    ),
    row=1, col=1
)

# Success Rate Gauge
fig_monitor.add_trace(
    go.Indicator(
        mode="gauge+number+delta",
        value=success_rate,
        title={"text": "Success %"},
        delta={"reference": 100},
        gauge={"axis": {"range": [0, 100]}, "bar": {"color": "blue"}},
    ),
    row=1, col=2
)

# Latency Gauge
fig_monitor.add_trace(
    go.Indicator(
        mode="number+delta",
        value=avg_latency,
        title={"text": "Avg Latency (ms)"},
        delta={"reference": 500, "relative": False},
    ),
    row=2, col=1
)

# Temperature Gauge
fig_monitor.add_trace(
    go.Indicator(
        mode="gauge+number",
        value=latest_gpu.get("temperature_c", 0),
        title={"text": "Temp ¬∞C"},
        gauge={
            "axis": {"range": [0, 100]},
            "bar": {"color": "orange"},
            "threshold": {"line": {"color": "red", "width": 4}, "thickness": 0.75, "value": 80},
        },
    ),
    row=2, col=2
)

fig_monitor.update_layout(
    title_text="üî¥ LIVE Monitoring Panel",
    height=600,
)

fig_monitor.show()

---

## Part 5: Summary & Analysis (5 min)

### Cell 24: Print Observability Summary

In [None]:
summary = collector.get_summary()

print("=" * 60)
print("üìä PRODUCTION OBSERVABILITY STACK SUMMARY")
print("=" * 60)

print(f"\n‚úÖ OpenTelemetry:")
print(f"  Total Spans: {summary['total_spans']}")
print(f"  Trace Duration: {summary['collection_duration']:.2f}s")

print(f"\n‚úÖ llama.cpp Metrics:")
print(f"  Metric Types Collected: {summary['llama_metrics_count']}")

print(f"\n‚úÖ GPU Monitoring:")
print(f"  Samples Collected: {summary['gpu_samples']}")
if len(df_gpu) > 0:
    print(f"  Avg GPU Utilization: {df_gpu['utilization'].mean():.2f}%")
    print(f"  Peak Memory: {df_gpu['memory_used_mb'].max():.2f} MB")
    print(f"  Max Temperature: {df_gpu['temperature_c'].max():.2f}¬∞C")

print(f"\n‚úÖ Request Statistics:")
print(f"  Total Requests: {len(df_span_metrics)}")
if len(df_span_metrics) > 0:
    print(f"  Success Rate: {success_rate:.2f}%")
    print(f"  Avg Latency: {avg_latency:.2f}ms")
    print(f"  P95 Latency: {df_span_metrics['duration_ms'].quantile(0.95):.2f}ms")
    print(f"  Total Tokens: {df_span_metrics['input_tokens'].sum() + df_span_metrics['output_tokens'].sum()}")

print("\n" + "=" * 60)

### Cell 25: Visualization Links Summary

In [None]:
print("\nüé® UNIFIED DASHBOARD COMPONENTS:")
print(f"\n1Ô∏è‚É£ Trace Graphs (Graphistry 2D):")
if url_traces:
    print(f"   {url_traces}")
else:
    print("   Not available (no trace edges)")
print(f"\n2Ô∏è‚É£ Performance Metrics (Plotly 2D): ‚úÖ Rendered above")
print(f"\n3Ô∏è‚É£ Model Internals (Plotly 3D): ‚úÖ Rendered above")
print(f"\n4Ô∏è‚É£ Real-Time Monitoring: ‚úÖ Rendered above")

print("\n" + "=" * 60)
print("üèÜ PRODUCTION OBSERVABILITY STACK COMPLETE!")
print("=" * 60)
print("\n‚úÖ ALL THREE OBJECTIVES ACHIEVED:")
print("   1. CUDA Inference (GPU 0)")
print("   2. LLM Observability (GPU 0)")
print("   3. Unified Visualizations (GPU 1)")

---

## Part 6: Cleanup

### Cell 26: Stop All Services

In [None]:
# Stop metrics collection
collector.stop_background_collection()

# Stop server
server.stop_server()

print("‚úÖ All services stopped. Observability stack demo complete!")