# Production Deployment & Best Practices

**Duration:** ~25 min | **Platform:** Kaggle dual Tesla T4

This notebook covers **production-ready patterns** for LLM deployment —
secret management, server lifecycle, error handling, health checks,
session tracking, and graceful shutdown.

### What you'll learn
1. Secret management on Kaggle
2. Server lifecycle with context managers
3. Error handling patterns
4. Health check loops and auto-restart
5. Per-user session tracking
6. Graceful shutdown procedures

In [None]:
!pip install -q git+https://github.com/llamatelemetry/llamatelemetry.git@v1.0.0

import llamatelemetry

# Production-grade initialization
llamatelemetry.init(
    service_name="production-app",
    environment="kaggle",
    sampling="always_on",    # Use "ratio" with sampling_ratio=0.1 in production
    redact=False,            # Set True to redact prompts in traces
    enable_gpu=True,
)
print(f"llamatelemetry {llamatelemetry.version()} — production config")

## Secret Management

Use Kaggle secrets to securely manage API keys and tokens.
Never hardcode credentials in notebooks.

In [None]:
from llamatelemetry.kaggle import (
    auto_load_secrets,
    setup_huggingface_auth,
    auto_configure_grafana_cloud,
)

# Auto-load all available Kaggle secrets
secrets = auto_load_secrets()
print("Loaded secrets:")
for key, value in secrets.items():
    status = "set" if value else "not found"
    print(f"  {key}: {status}")

# Setup HuggingFace auth for model downloads
hf_ok = setup_huggingface_auth()
print(f"\nHuggingFace auth: {'configured' if hf_ok else 'not available'}")

# Setup Grafana Cloud for telemetry export
try:
    grafana_ok = auto_configure_grafana_cloud()
    print(f"Grafana Cloud: {'configured' if grafana_ok else 'not available'}")
except Exception as e:
    print(f"Grafana Cloud: {e}")

## Server Lifecycle Management

Use `ServerManager` as a context manager for automatic cleanup on errors.

In [None]:
from llamatelemetry.llama import ServerManager, LlamaCppClient
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="bartowski/google_gemma-3-1b-it-GGUF",
    filename="google_gemma-3-1b-it-Q4_K_M.gguf",
    cache_dir="/root/.cache/huggingface",
)

# Pattern 1: Explicit lifecycle
mgr = ServerManager()
mgr.start_server(model_path=model_path, gpu_layers=99, ctx_size=2048)
mgr.wait_until_ready(timeout=60)
client = LlamaCppClient(base_url="http://127.0.0.1:8090")
print("Server started (explicit lifecycle)")

# Always wrap in try/finally for production
try:
    resp = client.chat.completions.create(
        messages=[{"role": "user", "content": "Hello"}],
        max_tokens=32,
    )
    print(f"Response: {resp.choices[0].message.content}")
except Exception as e:
    print(f"Error: {e}")

# Keep server running for remaining cells
print("Server running for remaining examples")

## Error Handling Patterns

Use `@workflow` with proper error handling and span error recording.

In [None]:
import traceback

@llamatelemetry.workflow(name="resilient-inference")
def resilient_inference(client, prompt, max_retries=3):
    """Inference with retry and error recording."""
    last_error = None

    for attempt in range(max_retries):
        try:
            with llamatelemetry.span("attempt", attempt=attempt + 1):
                resp = client.chat.completions.create(
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=64, temperature=0.7,
                )
                return resp.choices[0].message.content
        except Exception as e:
            last_error = e
            with llamatelemetry.span("retry-wait", error=str(e)):
                import time
                time.sleep(2 ** attempt)  # Exponential backoff

    # All retries failed
    return f"[Error after {max_retries} retries: {last_error}]"

# Test resilient inference
result = resilient_inference(client, "What is error handling in production systems?")
print(f"Result: {result}")

# Test with graceful degradation
@llamatelemetry.workflow(name="degraded-inference")
def inference_with_fallback(client, prompt):
    """Try LLM first, fall back to a canned response."""
    try:
        resp = client.chat.completions.create(
            messages=[{"role": "user", "content": prompt}],
            max_tokens=64, temperature=0.7,
        )
        return {"source": "llm", "text": resp.choices[0].message.content}
    except Exception as e:
        return {"source": "fallback", "text": "I'm currently unable to process your request. Please try again later."}

result = inference_with_fallback(client, "Explain graceful degradation.")
print(f"Source: {result['source']}")
print(f"Text: {result['text']}")

## Health Check Loop

Implement periodic health checks with automatic restart on failure.

In [None]:
import time

def health_check_loop(client, mgr, model_path, check_interval=5, max_checks=5):
    """Periodic health check with auto-restart."""
    consecutive_failures = 0
    max_failures = 3

    for i in range(max_checks):
        try:
            health = client.health()
            print(f"  Check {i+1}: {health.status} (idle={health.slots_idle}, processing={health.slots_processing})")
            consecutive_failures = 0
        except Exception as e:
            consecutive_failures += 1
            print(f"  Check {i+1}: FAILED ({e})")

            if consecutive_failures >= max_failures:
                print(f"  {max_failures} consecutive failures — restarting server...")
                try:
                    mgr.stop_server()
                except Exception:
                    pass
                mgr.start_server(model_path=model_path, gpu_layers=99, ctx_size=2048)
                mgr.wait_until_ready(timeout=60)
                consecutive_failures = 0
                print("  Server restarted successfully")

        time.sleep(check_interval)

    print("Health check loop complete")

health_check_loop(client, mgr, model_path, check_interval=2, max_checks=5)

## Session Tracking

Use `session()` to group requests by user or conversation for per-user observability.

In [None]:
# Simulate multiple user sessions
users = [
    ("user-123", ["What is Python?", "How do I install pip?"]),
    ("user-456", ["Explain Docker containers.", "What is Kubernetes?"]),
    ("user-789", ["What is GPU computing?"]),
]

for user_id, questions in users:
    with llamatelemetry.session(f"session-{user_id}", user_id=user_id):
        print(f"\n--- {user_id} ---")
        for q in questions:
            with llamatelemetry.span("user-query", question=q[:50]):
                resp = client.chat.completions.create(
                    messages=[{"role": "user", "content": q}],
                    max_tokens=32, temperature=0.7,
                )
                print(f"  Q: {q}")
                print(f"  A: {resp.choices[0].message.content[:80]}...")

print("\nAll sessions complete — traces grouped by session_id in your backend.")

## Graceful Shutdown

Always follow the proper shutdown sequence to ensure all telemetry is exported
and resources are released.

In [None]:
import signal

def graceful_shutdown(mgr, timeout_s=5.0):
    """Production shutdown sequence."""
    print("Starting graceful shutdown...")

    # Step 1: Stop accepting new requests
    print("  1. Stopping server...")
    try:
        mgr.stop_server()
        print("     Server stopped")
    except Exception as e:
        print(f"     Server stop warning: {e}")

    # Step 2: Flush pending telemetry
    print("  2. Flushing telemetry...")
    llamatelemetry.flush(timeout_s=timeout_s)
    print("     Telemetry flushed")

    # Step 3: Shutdown SDK
    print("  3. Shutting down SDK...")
    llamatelemetry.shutdown(timeout_s=timeout_s)
    print("     SDK shutdown complete")

    # Step 4: Verify GPU resources released
    print("  4. Verifying resource cleanup...")
    try:
        from llamatelemetry.gpu import snapshot
        snaps = snapshot()
        for s in snaps:
            print(f"     GPU {s.gpu_id}: {s.mem_used_mb} MB used")
    except Exception:
        print("     GPU check skipped (SDK shut down)")

    print("\nGraceful shutdown complete.")

graceful_shutdown(mgr)

## Production Deployment Checklist

### Before Deployment
- [ ] Model validated with `parse_gguf_header()` and `validate_gguf()`
- [ ] VRAM requirements calculated for target hardware
- [ ] Tensor split configured for multi-GPU
- [ ] Context window sized for workload

### Secrets & Auth
- [ ] `HF_TOKEN` set for model downloads
- [ ] `GRAFANA_CLOUD_*` set for telemetry export
- [ ] `GRAPHISTRY_*` set for visualization (optional)
- [ ] No hardcoded credentials in code

### Observability
- [ ] Sampling strategy configured (`ratio` with 1-10%)
- [ ] Prompt redaction enabled if handling user data
- [ ] Custom metrics defined for business KPIs
- [ ] Session tracking for per-user observability

### Reliability
- [ ] Error handling with retries and fallbacks
- [ ] Health check loop with auto-restart
- [ ] Graceful shutdown with `flush()` + `shutdown()`
- [ ] GPU monitoring with `start_sampler()`

### Performance
- [ ] Inference latency benchmarked
- [ ] Throughput tested under load
- [ ] Memory usage profiled over time
- [ ] Flash attention enabled for memory savings