# Running Large Models (13B-70B) on Kaggle

**Duration:** ~35 min | **Platform:** Kaggle dual Tesla T4 (2 × 15 GB VRAM)

This notebook covers deploying **large models** (13B to 70B parameters) on Kaggle's
dual T4 GPUs using aggressive quantization and optimal tensor-split configuration.

### What you'll learn
1. VRAM planning for large models
2. Deploy 13B models with Q4_K_M
3. Deploy 70B models with IQ3_XS
4. Memory optimization techniques
5. Context window tuning

In [None]:
!pip install -q git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0

import llamatelemetry
from llamatelemetry.gpu import list_devices, snapshot, start_sampler

llamatelemetry.init(service_name="large-models")

# GPU inventory
devices = list_devices()
total_vram = sum(d.memory_total_mb for d in devices)
print(f"GPUs: {len(devices)}")
for d in devices:
    print(f"  GPU {d.id}: {d.name} — {d.memory_total_mb} MB")
print(f"Total VRAM: {total_vram} MB ({total_vram/1024:.1f} GB)")

## VRAM Planning for Large Models

Use the model size formula: `VRAM ≈ params × bits_per_weight / 8 + KV_cache_overhead`

On dual T4 (30 GB total), you need to account for:
- Model weights
- KV cache (grows with context length)
- Runtime overhead (~500 MB per GPU)

In [None]:
def estimate_vram(params_b, bpw, ctx_size=2048, n_layers=None, embed_dim=None):
    """Estimate total VRAM needed for a model."""
    model_gb = params_b * bpw / 8
    # KV cache estimate: ~0.5 MB per 1K context per billion params (rough)
    kv_cache_gb = (ctx_size / 1024) * params_b * 0.0005
    overhead_gb = 1.0  # runtime overhead
    return model_gb + kv_cache_gb + overhead_gb

DUAL_T4_GB = 30.0

scenarios = [
    ("Gemma-3 4B",    4,  "Q4_K_M", 4.5, 2048),
    ("Llama-3.2 7B",  7,  "Q4_K_M", 4.5, 2048),
    ("Llama-3.2 7B",  7,  "Q5_K_M", 5.5, 4096),
    ("Llama-3.1 13B", 13, "Q4_K_M", 4.5, 2048),
    ("Llama-3.1 13B", 13, "Q4_K_M", 4.5, 4096),
    ("CodeLlama 34B",  34, "Q4_K_M", 4.5, 2048),
    ("Llama-3.1 70B", 70, "IQ3_XS", 3.3, 512),
    ("Llama-3.1 70B", 70, "IQ3_XS", 3.3, 1024),
    ("Llama-3.1 70B", 70, "Q4_K_M", 4.5, 2048),
]

print(f"{'Model':<20} {'Quant':<10} {'Ctx':<6} {'Est VRAM':<10} {'Fits?':<6} Notes")
print("-" * 80)
for name, params, quant, bpw, ctx in scenarios:
    vram = estimate_vram(params, bpw, ctx)
    fits = vram <= DUAL_T4_GB
    margin = DUAL_T4_GB - vram
    note = f"{margin:+.1f} GB margin" if fits else "DOES NOT FIT"
    print(f"{name:<20} {quant:<10} {ctx:<6} {vram:<10.1f} {'Yes' if fits else 'NO':<6} {note}")

## Deploying 13B Models

A 13B Q4_K_M model (~7.3 GB) fits comfortably on dual T4 with room for large contexts.

In [None]:
import time
from huggingface_hub import hf_hub_download
from llamatelemetry.llama import ServerManager, LlamaCppClient
from llamatelemetry.kaggle import TensorSplitMode

# Download a 13B model (using a smaller model as proxy for this demo)
# In production, replace with actual 13B model:
#   repo_id="TheBloke/CodeLlama-13B-Instruct-GGUF"
#   filename="codellama-13b-instruct.Q4_K_M.gguf"
model_path = hf_hub_download(
    repo_id="bartowski/google_gemma-3-1b-it-GGUF",
    filename="google_gemma-3-1b-it-Q4_K_M.gguf",
    cache_dir="/root/.cache/huggingface",
)

# Dual GPU with 50/50 split
split = TensorSplitMode.DUAL_50_50
mgr = ServerManager()
mgr.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split=split.to_string(),
    ctx_size=4096,
    batch_size=512,
)
mgr.wait_until_ready(timeout=120)

client = LlamaCppClient(base_url="http://127.0.0.1:8090")

# Benchmark
t0 = time.perf_counter()
resp = client.chat.completions.create(
    messages=[{"role": "user", "content": "Write a Python function to compute Fibonacci numbers using dynamic programming."}],
    max_tokens=256, temperature=0.7,
)
elapsed = time.perf_counter() - t0

print(f"Response ({resp.usage.completion_tokens} tokens in {elapsed:.1f}s, {resp.usage.completion_tokens/elapsed:.1f} tok/s):")
print(resp.choices[0].message.content)

# Memory distribution
mem = snapshot()
for s in mem:
    print(f"\nGPU {s.gpu_id}: {s.mem_used_mb}/{s.mem_total_mb} MB")

mgr.stop_server()

## Deploying 70B Models with IQ3_XS

A 70B IQ3_XS model (~28.9 GB) barely fits on dual T4. Key settings:
- **Tensor split**: 50/50 across both GPUs
- **Context**: 512-1024 max (KV cache is expensive at 70B)
- **Batch size**: Small to reduce memory spikes

In [None]:
# 70B deployment configuration (download separately if needed)
# model_70b = hf_hub_download(
#     repo_id="bartowski/Meta-Llama-3.1-70B-Instruct-GGUF",
#     filename="Meta-Llama-3.1-70B-Instruct-IQ3_XS.gguf",
#     cache_dir="/root/.cache/huggingface",
# )

# Optimal 70B configuration for dual T4:
config_70b = {
    "gpu_layers": 99,
    "tensor_split": TensorSplitMode.DUAL_50_50.to_string(),
    "ctx_size": 512,       # minimal context to save VRAM
    "batch_size": 256,     # small batch to reduce peaks
    "ubatch_size": 64,     # small micro-batch
    "n_parallel": 1,       # single slot only
    "flash_attn": True,    # save memory with flash attention
}

print("70B IQ3_XS configuration for dual T4:")
for k, v in config_70b.items():
    print(f"  {k}: {v}")

print(f"\nEstimated VRAM: {estimate_vram(70, 3.3, 512):.1f} GB / {DUAL_T4_GB:.0f} GB available")
print("Note: Download the actual 70B model to run this cell")

## Memory Optimization Techniques

Monitor memory during inference to identify optimization opportunities.

In [None]:
# Restart with the demo model for monitoring
mgr = ServerManager()
mgr.start_server(model_path=model_path, gpu_layers=99, tensor_split="0.5,0.5", ctx_size=2048)
mgr.wait_until_ready(timeout=60)
client = LlamaCppClient(base_url="http://127.0.0.1:8090")

# Background GPU sampling during inference
handle = start_sampler(interval_ms=200)

# Generate varying-length responses
for length in [32, 64, 128, 256]:
    resp = client.chat.completions.create(
        messages=[{"role": "user", "content": "Describe the history of computing in detail."}],
        max_tokens=length, temperature=0.7,
    )

handle.stop()
samples = handle.get_snapshots()

# Analyze memory patterns
if samples:
    gpu0_mem = [s.mem_used_mb for s in samples if s.gpu_id == 0]
    gpu1_mem = [s.mem_used_mb for s in samples if s.gpu_id == 1]

    if gpu0_mem:
        print(f"GPU 0 memory: min={min(gpu0_mem)} MB, max={max(gpu0_mem)} MB, avg={sum(gpu0_mem)//len(gpu0_mem)} MB")
    if gpu1_mem:
        print(f"GPU 1 memory: min={min(gpu1_mem)} MB, max={max(gpu1_mem)} MB, avg={sum(gpu1_mem)//len(gpu1_mem)} MB")
    print(f"Collected {len(samples)} samples over {samples[-1].timestamp - samples[0].timestamp:.1f}s")

## Context Window Tuning

Larger context windows consume more VRAM for the KV cache. Find the optimal
context size for your model and available memory.

In [None]:
context_sizes = [512, 1024, 2048, 4096]
ctx_results = []

for ctx in context_sizes:
    mgr.stop_server()
    time.sleep(2)

    mgr = ServerManager()
    mgr.start_server(model_path=model_path, gpu_layers=99, tensor_split="0.5,0.5", ctx_size=ctx)
    mgr.wait_until_ready(timeout=60)
    client = LlamaCppClient(base_url="http://127.0.0.1:8090")

    mem = snapshot()
    total_mem = sum(s.mem_used_mb for s in mem)

    t0 = time.perf_counter()
    resp = client.chat.completions.create(
        messages=[{"role": "user", "content": "Explain quantum computing."}],
        max_tokens=64, temperature=0.7,
    )
    elapsed = time.perf_counter() - t0

    ctx_results.append((ctx, total_mem, elapsed * 1000))
    print(f"  ctx={ctx:5d}: {total_mem:5d} MB VRAM, {elapsed*1000:.0f} ms latency")

print("\nRecommendation: Use the largest context that leaves ≥2 GB headroom.")

## Performance Comparison

| Model | Quant | VRAM | Context | Speed | Use Case |
|-------|-------|------|---------|-------|----------|
| 7B | Q4_K_M | ~4 GB | 4096 | Fast | General tasks |
| 13B | Q4_K_M | ~7 GB | 2048-4096 | Good | Better quality |
| 34B | Q4_K_M | ~19 GB | 1024-2048 | Moderate | Specialized tasks |
| 70B | IQ3_XS | ~29 GB | 512-1024 | Slow | Maximum capability |

### Recommendations
- **Start with 7B Q4_K_M** for development and prototyping
- **Use 13B Q4_K_M** when quality matters and speed is acceptable
- **Reserve 70B IQ3_XS** for tasks where model capability is critical
- **Always use flash attention** (`flash_attn=True`) for memory savings

In [None]:
mgr.stop_server()
llamatelemetry.shutdown()
print("Done.")