# Notebook 15: Real-Time Inference Monitoring & Performance Analysis

**Live Performance Dashboards with llama.cpp Metrics + Plotly**

---

## Objectives Demonstrated

‚úÖ **CUDA Inference** (GPU 0) - Continuous inference workload

‚úÖ **LLM Observability** (GPU 0) - llama.cpp /metrics endpoint + CUDA monitoring

‚úÖ **Visualizations** (GPU 1) - Real-time Plotly dashboards with live updates

---

## Overview

This notebook demonstrates **real-time performance monitoring** of LLM inference by continuously polling llama.cpp's built-in `/metrics` endpoint and NVIDIA's GPU metrics, then visualizing them as live-updating Plotly dashboards on GPU 1.

**What You'll Learn:**
- Access llama.cpp's Prometheus `/metrics` endpoint
- Monitor GPU utilization with `nvidia-smi` and `pynvml`
- Poll llama.cpp `/slots` endpoint for request queue monitoring
- Create live-updating Plotly dashboards with `plotly.graph_objects.FigureWidget`
- Identify performance bottlenecks and optimization opportunities
- Benchmark different configurations (batch size, context length, etc.)

**Time:** 30 minutes

**Difficulty:** Intermediate-Advanced

**VRAM:** GPU 0: 5-8 GB, GPU 1: 1-2 GB

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import numpy as np
import pandas as pd
import os

# Input data files are available in the read-only "../input/" directory
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

## Part 1: Setup & Dependencies

In [None]:
# ==============================================================================
# Step 1: Verify Dual GPU Environment
# ==============================================================================

import subprocess

print("="*70)
print("üîç SPLIT-GPU ENVIRONMENT CHECK")
print("="*70)

result = subprocess.run(
    ["nvidia-smi", "--query-gpu=index,name,memory.total,memory.free", "--format=csv,noheader"],
    capture_output=True, text=True
)

gpus = result.stdout.strip().split('\n')
print(f"\nüìä Detected {len(gpus)} GPU(s):")
for gpu in gpus:
    print(f"   {gpu}")

if len(gpus) >= 2:
    print("\n‚úÖ Dual T4 ready for split-GPU operation!")
    print("   GPU 0 ‚Üí llama-server (GGUF model inference)")
    print("   GPU 1 ‚Üí Real-time dashboards (Plotly)")
else:
    print("\n‚ö†Ô∏è Need 2 GPUs for split operation")

In [None]:
# ==============================================================================
# Step 2: Install llamatelemetry v0.1.0
# ==============================================================================
print("üì¶ Installing dependencies...")

# Install llamatelemetry v0.1.0
!pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Install monitoring packages
!pip install -q plotly pandas numpy pynvml requests

# Verify installations
import llamatelemetry
print(f"\n‚úÖ llamatelemetry {llamatelemetry.__version__} installed")

try:
    import plotly
    print(f"‚úÖ Plotly {plotly.__version__}")
except ImportError as e:
    print(f"‚ö†Ô∏è Plotly: {e}")

try:
    import pynvml
    print(f"‚úÖ PyNVML installed")
except ImportError as e:
    print(f"‚ö†Ô∏è PyNVML: {e}")

## Part 2: Start Instrumented Server

In [None]:
# ==============================================================================
# Step 3: Download GGUF Model
# ==============================================================================
from huggingface_hub import hf_hub_download

# Create models directory
os.makedirs("/kaggle/working/models", exist_ok=True)

# Download model
print("Downloading model...")
model_path = hf_hub_download(
    repo_id="bartowski/Qwen2.5-3B-Instruct-GGUF",
    filename="Qwen2.5-3B-Instruct-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)
print(f"‚úì Model downloaded: {model_path}")

In [None]:
# ==============================================================================
# Step 4: Start Server with Metrics Enabled
# ==============================================================================
from llamatelemetry.server import ServerManager
import torch

# Check GPUs
print(f"\nFound {torch.cuda.device_count()} GPUs:")
for i in range(torch.cuda.device_count()):
    print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

# Start server with metrics enabled
server = ServerManager(server_url="http://127.0.0.1:8090")

server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,
    port=8090,
    host="127.0.0.1",
    ctx_size=4096,
    batch_size=512,
    # Enable metrics endpoint
    extra_args=["--metrics"],
)

print("\n‚úì Server running on http://127.0.0.1:8090")
print("‚úì GPU 0: Used for LLM")
print("‚úì GPU 1: Free for visualizations")
print("‚úì Metrics endpoint: /metrics")

## Part 3: Metrics Collection Infrastructure

In [None]:
# ==============================================================================
# Step 5: Define Metrics Collector
# ==============================================================================
import requests
import time
import re
from typing import Dict, List
from collections import defaultdict
import threading

class LlamaMetricsCollector:
    """Collects metrics from llama.cpp server endpoints"""

    def __init__(self, base_url: str = "http://127.0.0.1:8090"):
        self.base_url = base_url
        self.metrics_history = defaultdict(list)
        self.slots_history = []
        self.gpu_metrics_history = []
        self.timestamps = []
        self.running = False
        self.lock = threading.Lock()

    def parse_prometheus_metrics(self, text: str) -> Dict[str, float]:
        """Parse Prometheus-format metrics from /metrics endpoint"""
        metrics = {}

        # Parse metric lines (format: metric_name{labels} value)
        for line in text.split("\n"):
            if line.startswith("#") or not line.strip():
                continue

            # Simple parsing (handles metrics without labels)
            match = re.match(r"(\w+)\s+([\d.]+)", line)
            if match:
                metric_name, value = match.groups()
                metrics[metric_name] = float(value)

        return metrics

    def fetch_server_metrics(self) -> Dict[str, float]:
        """Fetch metrics from /metrics endpoint"""
        try:
            response = requests.get(f"{self.base_url}/metrics", timeout=2)
            if response.status_code == 200:
                return self.parse_prometheus_metrics(response.text)
        except Exception as e:
            print(f"Error fetching metrics: {e}")
        return {}

    def fetch_slots_info(self) -> List[Dict]:
        """Fetch slot information from /slots endpoint"""
        try:
            response = requests.get(f"{self.base_url}/slots", timeout=2)
            if response.status_code == 200:
                return response.json()
        except Exception as e:
            print(f"Error fetching slots: {e}")
        return []

    def fetch_gpu_metrics(self) -> Dict[str, float]:
        """Fetch GPU metrics using pynvml"""
        try:
            import pynvml

            # Initialize NVML (if not already done)
            try:
                pynvml.nvmlInit()
            except:
                pass

            # Get GPU 0 handle
            handle = pynvml.nvmlDeviceGetHandleByIndex(0)

            # Query metrics
            utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
            memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            temperature = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            power_draw = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000  # mW to W

            return {
                "gpu_utilization": utilization.gpu,  # %
                "memory_utilization": utilization.memory,  # %
                "memory_used_mb": memory_info.used / 1024**2,  # bytes to MB
                "memory_total_mb": memory_info.total / 1024**2,
                "temperature_c": temperature,
                "power_draw_w": power_draw,
            }
        except Exception as e:
            print(f"Error fetching GPU metrics: {e}")
            return {}

    def collect_once(self):
        """Collect all metrics at current timestamp"""
        timestamp = time.time()

        # Fetch from all sources
        server_metrics = self.fetch_server_metrics()
        slots_info = self.fetch_slots_info()
        gpu_metrics = self.fetch_gpu_metrics()

        # Store with lock
        with self.lock:
            self.timestamps.append(timestamp)

            # Store server metrics
            for key, value in server_metrics.items():
                self.metrics_history[key].append(value)

            # Store slots info
            self.slots_history.append({
                "timestamp": timestamp,
                "slots": slots_info,
                "num_processing": sum(1 for s in slots_info if s.get("is_processing", False)),
                "num_idle": sum(1 for s in slots_info if not s.get("is_processing", False)),
            })

            # Store GPU metrics
            gpu_record = {"timestamp": timestamp, **gpu_metrics}
            self.gpu_metrics_history.append(gpu_record)

    def start_background_collection(self, interval: float = 1.0):
        """Start background thread for continuous collection"""
        self.running = True

        def collect_loop():
            while self.running:
                self.collect_once()
                time.sleep(interval)

        thread = threading.Thread(target=collect_loop, daemon=True)
        thread.start()
        print(f"üìä Started metrics collection (interval={interval}s)")

    def stop_background_collection(self):
        """Stop background collection"""
        self.running = False
        print("‚èπÔ∏è Stopped metrics collection")

    def get_dataframe(self, metric_name: str) -> pd.DataFrame:
        """Get metric history as pandas DataFrame"""
        with self.lock:
            if metric_name not in self.metrics_history:
                return pd.DataFrame()

            return pd.DataFrame({
                "timestamp": pd.to_datetime(self.timestamps, unit="s"),
                "value": self.metrics_history[metric_name],
            })

    def get_gpu_dataframe(self) -> pd.DataFrame:
        """Get GPU metrics history as DataFrame"""
        with self.lock:
            if not self.gpu_metrics_history:
                return pd.DataFrame()
            return pd.DataFrame(self.gpu_metrics_history)

# Initialize collector
collector = LlamaMetricsCollector()
print("‚úÖ Metrics collector initialized")

In [None]:
# ==============================================================================
# Step 6: Test Metrics Collection
# ==============================================================================
# Test single collection
collector.collect_once()

print("\nüìä Server Metrics:")
for key in list(collector.metrics_history.keys())[:10]:
    print(f"  {key}: {collector.metrics_history[key][-1]}")

print("\nüé∞ Slots Info:")
if collector.slots_history:
    latest = collector.slots_history[-1]
    print(f"  Processing: {latest['num_processing']}")
    print(f"  Idle: {latest['num_idle']}")

print("\nüñ•Ô∏è GPU Metrics:")
if collector.gpu_metrics_history:
    latest = collector.gpu_metrics_history[-1]
    for key, value in latest.items():
        if key != "timestamp":
            print(f"  {key}: {value:.2f}")

In [None]:
# ==============================================================================
# Step 7: Start Background Collection
# ==============================================================================
# Start collecting metrics in background
collector.start_background_collection(interval=1.0)

# Let it collect for a few seconds
time.sleep(5)

print(f"üìà Collected {len(collector.timestamps)} data points")

## Part 4: Generate Continuous Inference Load

In [None]:
# ==============================================================================
# Step 8: Define Load Generator
# ==============================================================================
from llamatelemetry.api import LlamaCppClient
import random

class InferenceLoadGenerator:
    """Generates continuous inference requests"""

    def __init__(self, base_url: str, prompts: List[str]):
        self.client = LlamaCppClient(base_url)
        self.prompts = prompts
        self.running = False
        self.request_count = 0
        self.error_count = 0
        self.lock = threading.Lock()

    def generate_request(self):
        """Generate single inference request"""
        try:
            prompt = random.choice(self.prompts)
            response = self.client.chat.completions.create(
                messages=[{"role": "user", "content": prompt}],
                max_tokens=random.randint(50, 150),
                temperature=random.uniform(0.5, 0.9),
            )

            with self.lock:
                self.request_count += 1

            return response

        except Exception as e:
            with self.lock:
                self.error_count += 1
            print(f"‚ùå Request error: {e}")
            return None

    def start_continuous_load(self, qps: float = 2.0):
        """Start generating continuous load at specified QPS"""
        self.running = True

        def load_loop():
            interval = 1.0 / qps
            while self.running:
                self.generate_request()
                time.sleep(interval)

        thread = threading.Thread(target=load_loop, daemon=True)
        thread.start()
        print(f"üöÄ Started load generation (QPS={qps})")

    def stop_continuous_load(self):
        """Stop load generation"""
        self.running = False
        print(f"‚èπÔ∏è Stopped load generation (Total: {self.request_count}, Errors: {self.error_count})")

# Define test prompts
test_prompts = [
    "Explain how CUDA kernels work",
    "What is quantization in neural networks?",
    "Describe the transformer architecture",
    "How does attention mechanism work?",
    "What are the benefits of GGUF format?",
    "Explain FlashAttention optimization",
    "What is tensor parallelism?",
    "How does KV cache improve inference?",
    "Describe NCCL in distributed training",
    "What is mixed precision training?",
]

# Initialize load generator
load_gen = InferenceLoadGenerator("http://127.0.0.1:8090", test_prompts)
print("‚úÖ Load generator initialized")

In [None]:
# ==============================================================================
# Step 9: Start Generating Load
# ==============================================================================
load_gen.start_continuous_load(qps=2.0)  # 2 requests per second

# Let it run for a bit
time.sleep(10)

print(f"üìä Requests sent: {load_gen.request_count}")
print(f"‚ùå Errors: {load_gen.error_count}")

## Part 5: Live Plotly Dashboards (GPU 1)

In [None]:
# ==============================================================================
# Step 10: Switch to GPU 1
# ==============================================================================
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
print("üîÑ Switched to GPU 1 for visualizations")

In [None]:
# ==============================================================================
# Step 11: Create Live Dashboard with Plotly FigureWidget
# ==============================================================================
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import display

# Create subplots
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=(
        "Token Generation Rate (tokens/sec)",
        "GPU Utilization (%)",
        "Request Processing Time (ms)",
        "GPU Memory Usage (MB)",
        "Active Slots",
        "GPU Temperature (¬∞C) & Power (W)"
    ),
    specs=[
        [{"type": "scatter"}, {"type": "scatter"}],
        [{"type": "scatter"}, {"type": "scatter"}],
        [{"type": "scatter"}, {"type": "scatter"}]
    ],
    vertical_spacing=0.12,
    horizontal_spacing=0.15,
)

# Initialize traces
fig.add_trace(go.Scatter(x=[], y=[], mode="lines", name="Tokens/sec", line=dict(color="green")), row=1, col=1)
fig.add_trace(go.Scatter(x=[], y=[], mode="lines", name="GPU %", line=dict(color="blue")), row=1, col=2)
fig.add_trace(go.Scatter(x=[], y=[], mode="lines", name="Latency", line=dict(color="orange")), row=2, col=1)
fig.add_trace(go.Scatter(x=[], y=[], mode="lines", name="Memory MB", line=dict(color="red")), row=2, col=2)
fig.add_trace(go.Scatter(x=[], y=[], mode="lines+markers", name="Active", line=dict(color="purple")), row=3, col=1)
fig.add_trace(go.Scatter(x=[], y=[], mode="lines", name="Temp ¬∞C", line=dict(color="darkred")), row=3, col=2)
fig.add_trace(go.Scatter(x=[], y=[], mode="lines", name="Power W", line=dict(color="darkorange")), row=3, col=2)

# Configure layout
fig.update_layout(
    title_text="üî¥ LIVE LLM Performance Dashboard",
    showlegend=True,
    height=900,
)

# Create FigureWidget for live updates
fig_widget = go.FigureWidget(fig)
display(fig_widget)
print("‚úÖ Live dashboard created")

In [None]:
# ==============================================================================
# Step 12: Dashboard Update Loop
# ==============================================================================
from datetime import datetime

def update_dashboard():
    """Update dashboard with latest metrics"""

    # Get GPU metrics
    df_gpu = collector.get_gpu_dataframe()
    if not df_gpu.empty:
        timestamps = pd.to_datetime(df_gpu["timestamp"], unit="s")

        # Update GPU utilization
        with fig_widget.batch_update():
            fig_widget.data[1].x = timestamps
            fig_widget.data[1].y = df_gpu["gpu_utilization"]

            # Update GPU memory
            fig_widget.data[3].x = timestamps
            fig_widget.data[3].y = df_gpu["memory_used_mb"]

            # Update temperature and power
            fig_widget.data[5].x = timestamps
            fig_widget.data[5].y = df_gpu["temperature_c"]
            fig_widget.data[6].x = timestamps
            fig_widget.data[6].y = df_gpu["power_draw_w"]

    # Get server metrics (if available)
    # Note: Metric names may vary, adjust as needed
    metric_keys = list(collector.metrics_history.keys())
    if metric_keys:
        # Try to find token generation rate metric
        for key in metric_keys:
            if "token" in key.lower() and "sec" in key.lower():
                df_tokens = collector.get_dataframe(key)
                if not df_tokens.empty:
                    with fig_widget.batch_update():
                        fig_widget.data[0].x = df_tokens["timestamp"]
                        fig_widget.data[0].y = df_tokens["value"]
                break

    # Get slots info
    if collector.slots_history:
        slots_times = [pd.Timestamp(s["timestamp"], unit="s") for s in collector.slots_history]
        slots_active = [s["num_processing"] for s in collector.slots_history]

        with fig_widget.batch_update():
            fig_widget.data[4].x = slots_times
            fig_widget.data[4].y = slots_active

# Update every 2 seconds
print("üîÑ Starting live dashboard updates...")
for i in range(30):  # Update 30 times (60 seconds total)
    update_dashboard()
    time.sleep(2)

print("‚úÖ Dashboard updates complete")

## Part 6: Performance Analysis

In [None]:
# ==============================================================================
# Step 13: Calculate Performance Statistics
# ==============================================================================
df_gpu = collector.get_gpu_dataframe()

if not df_gpu.empty:
    print("üìä Performance Statistics\n")

    print("GPU Utilization:")
    print(f"  Mean: {df_gpu['gpu_utilization'].mean():.2f}%")
    print(f"  P50:  {df_gpu['gpu_utilization'].quantile(0.50):.2f}%")
    print(f"  P95:  {df_gpu['gpu_utilization'].quantile(0.95):.2f}%")
    print(f"  Max:  {df_gpu['gpu_utilization'].max():.2f}%")

    print("\nGPU Memory:")
    print(f"  Mean: {df_gpu['memory_used_mb'].mean():.2f} MB")
    print(f"  Max:  {df_gpu['memory_used_mb'].max():.2f} MB")

    print("\nTemperature:")
    print(f"  Mean: {df_gpu['temperature_c'].mean():.2f}¬∞C")
    print(f"  Max:  {df_gpu['temperature_c'].max():.2f}¬∞C")

In [None]:
# ==============================================================================
# Step 14: Request Statistics
# ==============================================================================
print(f"\nüöÄ Load Generator Statistics:")
print(f"  Total Requests: {load_gen.request_count}")
print(f"  Errors: {load_gen.error_count}")
print(f"  Success Rate: {(1 - load_gen.error_count / max(load_gen.request_count, 1)) * 100:.2f}%")

## Part 7: Cleanup

In [None]:
# ==============================================================================
# Step 15: Stop Everything
# ==============================================================================
# Stop load generation
load_gen.stop_continuous_load()

# Stop metrics collection
collector.stop_background_collection()

# Stop server
server.stop_server()

print("‚úÖ Cleanup complete!")

---

## Key Learnings

### **1. llama.cpp Metrics**
- ‚úÖ `/metrics` endpoint provides Prometheus-format metrics
- ‚úÖ Token generation throughput (tokens/second)
- ‚úÖ Request processing statistics
- ‚úÖ Cache hit rates

### **2. GPU Monitoring**
- ‚úÖ PyNVML for programmatic GPU metrics access
- ‚úÖ Utilization, memory, temperature, power draw
- ‚úÖ Real-time monitoring at 1-second intervals

### **3. Request Queue Monitoring**
- ‚úÖ `/slots` endpoint shows request queue state
- ‚úÖ Number of processing vs idle slots
- ‚úÖ Per-slot token generation progress

### **4. Live Visualization**
- ‚úÖ Plotly FigureWidget for real-time updates
- ‚úÖ Multi-panel dashboards with synchronized timelines
- ‚úÖ Efficient batch updates for smooth rendering

### **5. Performance Analysis**
- ‚úÖ Identify bottlenecks (GPU, memory, queue depth)
- ‚úÖ Optimize batch size and concurrency
- ‚úÖ Monitor thermal throttling and power limits

---

## Next Steps

- **Notebook 16:** End-to-end production observability stack
- Export metrics to Prometheus/Grafana
- Set up alerting for performance degradation
- Implement auto-scaling based on queue depth
- A/B test different model configurations

---

**üéØ Objectives Achieved:**

‚úÖ CUDA Inference (GPU 0) - Continuous workload

‚úÖ LLM Observability (GPU 0) - Full metrics collection

‚úÖ Plotly Visualizations (GPU 1) - Live dashboards