# Advanced Server Configuration

**Duration:** ~15 min | **Platform:** Kaggle dual Tesla T4

This notebook explores the full server configuration surface — presets, custom
parameters, health monitoring, tensor-split modes, and performance benchmarking.

### What you'll learn
1. Use built-in server presets for Kaggle
2. Start servers with custom configuration
3. Monitor server health
4. Choose tensor-split modes for dual GPUs
5. Benchmark inference performance

In [None]:
!pip install -q git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0

import llamatelemetry
llamatelemetry.init(service_name="server-config")

## Server Presets for Kaggle

llamatelemetry ships with optimized presets for common GPU configurations.
Each preset specifies context size, batch size, tensor split, and other parameters
tuned for that hardware.

In [None]:
from llamatelemetry.kaggle import ServerPreset, get_preset_config

# List all available presets
for preset in ServerPreset:
    config = get_preset_config(preset)
    print(f"{preset.name:20s}  ctx={config.ctx_size:>5d}  batch={config.batch_size:>4d}  "
          f"gpu_layers={config.gpu_layers:>3d}  split={config.tensor_split or 'none'}")

## Quick Start with Presets

`quick_start()` is a one-liner that selects the right preset, starts the server,
and waits for readiness.

In [None]:
from huggingface_hub import hf_hub_download
from llamatelemetry.llama import quick_start

model_path = hf_hub_download(
    repo_id="bartowski/google_gemma-3-1b-it-GGUF",
    filename="google_gemma-3-1b-it-Q4_K_M.gguf",
    cache_dir="/root/.cache/huggingface",
)

mgr = quick_start(model_path, preset="kaggle_t4_dual")
print("Server started with Kaggle dual T4 preset")

## Custom Configuration

For fine-grained control, pass individual parameters to `start_server()`.

In [None]:
from llamatelemetry.llama import ServerManager

# Stop the preset server first
mgr.stop_server()

# Start with custom parameters
mgr = ServerManager()
mgr.start_server(
    model_path=model_path,
    port=8090,
    gpu_layers=99,
    ctx_size=4096,         # larger context window
    batch_size=1024,       # larger batch for throughput
    ubatch_size=256,       # micro-batch for memory efficiency
    n_parallel=2,          # concurrent request slots
    tensor_split="0.5,0.5",  # equal split across 2 GPUs
    flash_attn=True,       # enable flash attention
)
mgr.wait_until_ready(timeout=60)
print("Custom server ready")

## Health Monitoring

Use the `LlamaCppClient` to query server health, slot status, and performance metrics.

In [None]:
from llamatelemetry.llama import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8090")

# Health status
health = client.health()
print(f"Status: {health.status}")
print(f"Idle slots: {health.slots_idle}, Processing: {health.slots_processing}")

# Server properties
props = client.props()
print(f"\nServer properties:")
for k, v in props.items():
    print(f"  {k}: {v}")

# Slot details
slots = client.slots.list()
for slot in slots:
    print(f"\nSlot {slot.id}: processing={slot.is_processing}, ctx={slot.n_ctx}")

## Tensor Split Modes

On dual-GPU systems, tensor-split controls how model layers are distributed.
llamatelemetry provides named modes for common configurations.

In [None]:
from llamatelemetry.kaggle import TensorSplitMode

print("Available tensor-split modes:")
for mode in TensorSplitMode:
    split_str = mode.to_string() or "N/A"
    print(f"  {mode.name:15s}  → {split_str}")

# Recommended for dual T4:
print(f"\nRecommended for dual T4: {TensorSplitMode.DUAL_50_50.value}")
print(f"For LLM+RAPIDS split: {TensorSplitMode.NONE.value} (single GPU)")

## Performance Benchmarking

Measure inference latency and throughput by running multiple prompts.

In [None]:
import time

prompts = [
    "Explain the difference between Q4_K_M and Q5_K_M quantization.",
    "What are the benefits of tensor parallelism for LLM inference?",
    "How does flash attention reduce memory usage?",
    "Describe the GGUF file format in one paragraph.",
]

results = []
for prompt in prompts:
    t0 = time.perf_counter()
    resp = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        max_tokens=128,
        temperature=0.7,
    )
    elapsed_ms = (time.perf_counter() - t0) * 1000
    tokens = resp.usage.completion_tokens
    tps = tokens / (elapsed_ms / 1000) if elapsed_ms > 0 else 0
    results.append((elapsed_ms, tokens, tps))
    print(f"  {elapsed_ms:7.0f} ms | {tokens:3d} tokens | {tps:5.1f} tok/s")

avg_ms = sum(r[0] for r in results) / len(results)
avg_tps = sum(r[2] for r in results) / len(results)
print(f"\nAverage: {avg_ms:.0f} ms, {avg_tps:.1f} tok/s")

# Cleanup
mgr.stop_server()
llamatelemetry.shutdown()
print("Done.")