# Multi-GPU Inference with Tensor Split

**Duration:** ~20 min | **Platform:** Kaggle dual Tesla T4 (2 × 15 GB VRAM)

This notebook demonstrates how to leverage both T4 GPUs for LLM inference,
compare different tensor-split strategies, and use the split-GPU architecture
for simultaneous LLM + RAPIDS workloads.

### What you'll learn
1. Verify and configure dual GPUs
2. Tensor-split strategies and benchmarking
3. Split-GPU mode (LLM on GPU 0, RAPIDS on GPU 1)
4. GPU context managers
5. Background GPU monitoring during inference

In [None]:
!pip install -q git+https://github.com/llamatelemetry/llamatelemetry.git@v1.0.0

import llamatelemetry
from llamatelemetry.gpu import list_devices, snapshot, start_sampler

llamatelemetry.init(service_name="multi-gpu")

# Verify dual GPUs
devices = list_devices()
print(f"Found {len(devices)} GPU(s):")
for d in devices:
    print(f"  GPU {d.id}: {d.name} — {d.memory_total_mb} MB (SM {d.compute_capability})")
assert len(devices) >= 2, "This notebook requires 2 GPUs"

## Tensor-Split Strategies

Tensor-split controls how model weights are distributed across GPUs.
A 50/50 split gives each GPU half the model; a 70/30 split favors GPU 0.

In [None]:
import time
from huggingface_hub import hf_hub_download
from llamatelemetry.llama import ServerManager, LlamaCppClient

model_path = hf_hub_download(
    repo_id="bartowski/google_gemma-3-1b-it-GGUF",
    filename="google_gemma-3-1b-it-Q4_K_M.gguf",
    cache_dir="/root/.cache/huggingface",
)

# Start with 50/50 split
mgr = ServerManager()
mgr.start_server(model_path=model_path, gpu_layers=99, tensor_split="0.5,0.5", ctx_size=2048)
mgr.wait_until_ready(timeout=60)

# Check memory distribution
snaps = snapshot()
print("Memory after 50/50 split:")
for s in snaps:
    print(f"  GPU {s.gpu_id}: {s.mem_used_mb}/{s.mem_total_mb} MB")

mgr.stop_server()

## Performance Comparison

Benchmark inference across different tensor-split configurations to find the
optimal setting for your model and workload.

In [None]:
splits_to_test = [
    ("50/50", "0.5,0.5"),
    ("70/30", "0.7,0.3"),
    ("100/0", "1.0,0.0"),
]

benchmark_prompt = "Explain how tensor parallelism works for large language models in detail."
benchmark_results = []

for name, split in splits_to_test:
    mgr = ServerManager()
    mgr.start_server(model_path=model_path, gpu_layers=99, tensor_split=split, ctx_size=2048)
    mgr.wait_until_ready(timeout=60)
    client = LlamaCppClient(base_url="http://127.0.0.1:8090")

    # Warm-up
    client.chat.completions.create(
        messages=[{"role": "user", "content": "Hi"}], max_tokens=16
    )

    # Benchmark
    times = []
    for _ in range(3):
        t0 = time.perf_counter()
        resp = client.chat.completions.create(
            messages=[{"role": "user", "content": benchmark_prompt}],
            max_tokens=128, temperature=0.7,
        )
        times.append(time.perf_counter() - t0)

    avg_s = sum(times) / len(times)
    tokens = resp.usage.completion_tokens
    tps = tokens / avg_s if avg_s > 0 else 0
    mem = snapshot()
    benchmark_results.append((name, avg_s * 1000, tps, mem))
    print(f"  {name}: {avg_s*1000:.0f} ms avg, {tps:.1f} tok/s")

    mgr.stop_server()
    time.sleep(2)

print("\nBenchmark complete.")

## Split-GPU Mode (LLM + RAPIDS)

Run the LLM server on GPU 0 while reserving GPU 1 for RAPIDS/cuDF analytics.
Use `tensor_split="1.0,0.0"` to confine the model to GPU 0.

In [None]:
from llamatelemetry.kaggle import rapids_gpu

# LLM on GPU 0 only
mgr = ServerManager()
mgr.start_server(model_path=model_path, gpu_layers=99, tensor_split="1.0,0.0", ctx_size=2048)
mgr.wait_until_ready(timeout=60)
client = LlamaCppClient(base_url="http://127.0.0.1:8090")

# RAPIDS work on GPU 1
with rapids_gpu(1):
    try:
        import cudf
        df = cudf.DataFrame({"text": ["hello", "world"], "score": [0.9, 0.8]})
        print(f"cuDF DataFrame on GPU 1: {len(df)} rows")
    except ImportError:
        print("cuDF not available — RAPIDS context still set CUDA_VISIBLE_DEVICES=1")

# LLM inference on GPU 0 (concurrent)
resp = client.chat.completions.create(
    messages=[{"role": "user", "content": "Summarize split-GPU architecture."}],
    max_tokens=64,
)
print(f"\nLLM response: {resp.choices[0].message.content}")

## GPU Context Managers

llamatelemetry provides context managers to temporarily pin operations to specific GPUs.

In [None]:
from llamatelemetry.kaggle import GPUContext, llm_gpu, single_gpu
import os

# Explicit GPU context
with GPUContext(gpu_ids=[0, 1]):
    print(f"Both GPUs visible: CUDA_VISIBLE_DEVICES={os.environ.get('CUDA_VISIBLE_DEVICES', 'all')}")

with single_gpu(0):
    print(f"GPU 0 only: CUDA_VISIBLE_DEVICES={os.environ.get('CUDA_VISIBLE_DEVICES', 'all')}")

with llm_gpu([0]):
    print(f"LLM GPU context: CUDA_VISIBLE_DEVICES={os.environ.get('CUDA_VISIBLE_DEVICES', 'all')}")

print(f"Restored: CUDA_VISIBLE_DEVICES={os.environ.get('CUDA_VISIBLE_DEVICES', 'all')}")

## GPU Monitoring During Inference

`start_sampler()` launches a background thread that periodically captures GPU
metrics, allowing you to profile memory and utilization over time.

In [None]:
# Start background GPU sampler
handle = start_sampler(interval_ms=500)

# Run inference workload
for i in range(5):
    client.chat.completions.create(
        messages=[{"role": "user", "content": f"Tell me fact #{i+1} about neural networks."}],
        max_tokens=64,
    )

# Retrieve collected snapshots
handle.stop()
samples = handle.get_snapshots()
print(f"Collected {len(samples)} GPU snapshots during inference:")
for s in samples[:6]:  # show first 6
    print(f"  t={s.timestamp:.1f}  GPU {s.gpu_id}: {s.mem_used_mb} MB, {s.utilization_pct}% util")

## Summary

| Strategy | Use Case | Tensor Split |
|----------|----------|-------------|
| **50/50** | Maximum model size | `0.5,0.5` |
| **70/30** | Faster GPU gets more layers | `0.7,0.3` |
| **100/0** | LLM + RAPIDS split-GPU | `1.0,0.0` |
| **Single GPU** | Small models, other GPU free | `None` + `main_gpu=0` |

In [None]:
mgr.stop_server()
llamatelemetry.shutdown()
print("Done.")