# Quick Start with llamatelemetry v1.2.0

**Duration:** ~10 min | **Platform:** Kaggle dual Tesla T4

This notebook walks you through the fundamentals of **llamatelemetry** — a CUDA-first
OpenTelemetry Python SDK for LLM inference on Kaggle.

### What you'll learn
1. Install and initialize the SDK
2. Verify GPU availability
3. Download a GGUF model
4. Start a llama-server and run inference
5. Monitor GPU memory usage

In [None]:
# Install llamatelemetry v1.2.0 from GitHub
!pip install -q git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0

## Initialize the SDK

A single `init()` call configures tracing, GPU monitoring, and the llama-server runtime.

In [None]:
import llamatelemetry

llamatelemetry.init(service_name="quickstart")
print(f"llamatelemetry {llamatelemetry.version()}")

# Verify GPU availability
devices = llamatelemetry.gpu.list_devices()
for d in devices:
    print(f"  GPU {d.id}: {d.name} — {d.memory_total_mb} MB VRAM (SM {d.compute_capability})")

## Download a GGUF Model

We'll use **Gemma-3 1B Q4_K_M** — a compact model that loads instantly on a single T4.

In [None]:
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="bartowski/google_gemma-3-1b-it-GGUF",
    filename="google_gemma-3-1b-it-Q4_K_M.gguf",
    cache_dir="/root/.cache/huggingface",
)
print(f"Model downloaded: {model_path}")

## Start the Server

`ServerManager` wraps the bundled llama-server binary. It handles process lifecycle,
health checks, and readiness polling automatically.

In [None]:
from llamatelemetry.llama import ServerManager

mgr = ServerManager()
mgr.start_server(model_path=model_path, gpu_layers=99, ctx_size=2048)
mgr.wait_until_ready(timeout=60)
print("Server is ready!")

## Run Inference

`LlamaCppClient` provides both the native llama.cpp completion API and an
OpenAI-compatible chat API.

In [None]:
from llamatelemetry.llama import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8090")

# --- Chat completion ---
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Explain GPU tensor parallelism in two sentences."}],
    max_tokens=128,
    temperature=0.7,
)
print("Chat response:")
print(response.choices[0].message.content)

# --- Streaming ---
print("\nStreaming response:")
for chunk in client.chat.completions.create(
    messages=[{"role": "user", "content": "What is GGUF format?"}],
    max_tokens=128,
    stream=True,
):
    if chunk.choices[0].text:
        print(chunk.choices[0].text, end="", flush=True)
print()

## Monitor GPU

`gpu.snapshot()` returns a point-in-time reading of every GPU's utilization,
memory, power, and temperature.

In [None]:
snapshots = llamatelemetry.gpu.snapshot()
for s in snapshots:
    print(
        f"GPU {s.gpu_id}: {s.mem_used_mb}/{s.mem_total_mb} MB "
        f"({s.utilization_pct}% util, {s.temp_c}°C, {s.power_w:.0f} W)"
    )

# Graceful shutdown — stops the server and flushes telemetry
mgr.stop_server()
llamatelemetry.shutdown()
print("\nDone — all resources released.")