# Module 02: Disaggregated Prefill-Decode Serving

> **Goal**: Run prefill on GPU 0 and decode on GPU 1, observe KV cache transfer via NIXL/RDMA.

---

## What is Disaggregated Serving?

LLM inference has two phases with very different characteristics:

| Phase | What it does | Bottleneck | GPU Utilization |
|-------|--------------|------------|------------------|
| **Prefill** | Process entire prompt, generate KV cache | Compute-bound | High (matrix multiplications) |
| **Decode** | Generate tokens one-by-one using KV cache | Memory-bound | Low (memory bandwidth limited) |

**The Problem**: Running both phases on the same GPU leads to inefficiency:
- During prefill: GPU compute is saturated, decode requests wait
- During decode: GPU compute is underutilized, prefill requests wait

**The Solution**: Disaggregated serving separates these phases onto different workers:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Prefill Worker ‚îÇ     ‚îÇ  Decode Worker   ‚îÇ
‚îÇ     (GPU 0)     ‚îÇ     ‚îÇ     (GPU 1)      ‚îÇ
‚îÇ                 ‚îÇ     ‚îÇ                  ‚îÇ
‚îÇ ‚Ä¢ Process prompt‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ ‚Ä¢ Generate tokens‚îÇ
‚îÇ ‚Ä¢ Generate KV   ‚îÇ KV  ‚îÇ ‚Ä¢ Use KV cache   ‚îÇ
‚îÇ                 ‚îÇcache‚îÇ                  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### The Bootstrap Server

Each worker runs a **bootstrap server** - an HTTP endpoint that coordinates the KV cache transfer handshake:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Prefill Worker ‚îÇ                    ‚îÇ  Decode Worker  ‚îÇ
‚îÇ     (GPU 0)     ‚îÇ                    ‚îÇ     (GPU 1)     ‚îÇ
‚îÇ                 ‚îÇ                    ‚îÇ                 ‚îÇ
‚îÇ ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îÇ  1. GET /route     ‚îÇ                 ‚îÇ
‚îÇ ‚îÇ Bootstrap   ‚îÇ‚óÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÇ                 ‚îÇ
‚îÇ ‚îÇ Server      ‚îÇ ‚îÇ  "Where should I   ‚îÇ                 ‚îÇ
‚îÇ ‚îÇ (HTTP)      ‚îÇ ‚îÇ   fetch KV from?"  ‚îÇ                 ‚îÇ
‚îÇ ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îÇ                    ‚îÇ                 ‚îÇ
‚îÇ                 ‚îú‚îÄ‚îÄ2. Route info‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ                 ‚îÇ
‚îÇ                 ‚îÇ  (memory addrs,    ‚îÇ                 ‚îÇ
‚îÇ                 ‚îÇ   rank info)       ‚îÇ                 ‚îÇ
‚îÇ                 ‚îÇ                    ‚îÇ                 ‚îÇ
‚îÇ  KV Cache ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ3. RDMA/NIXL‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  KV Cache       ‚îÇ
‚îÇ  (GPU memory)   ‚îÇ  Direct Transfer   ‚îÇ  (GPU memory)   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**The handshake flow:**
1. Decode worker receives a request needing KV cache from prefill
2. Decode queries prefill's bootstrap server (`/route`) to get memory addresses and rank info
3. NIXL performs direct GPU-to-GPU transfer via RDMA (fast, bypasses CPU)

The bootstrap server handles **coordination metadata**, not actual data. The KV cache transfer uses RDMA for speed.

---

## Step 1: Verify Two GPUs Available

In [12]:
import subprocess

print("=" * 60)
print("GPU CHECK FOR DISAGGREGATED SERVING")
print("=" * 60)

result = subprocess.run(
    ['nvidia-smi', '--query-gpu=index,name,memory.total,memory.free', '--format=csv,noheader'],
    capture_output=True, text=True, timeout=5
)

gpus = result.stdout.strip().split('\n')
print(f"\nFound {len(gpus)} GPU(s):\n")

for gpu in gpus:
    parts = gpu.split(', ')
    idx, name, total, free = parts
    print(f"  GPU {idx}: {name}")
    print(f"          Memory: {free} free / {total} total")
    print()

if len(gpus) >= 2:
    print("‚úì Two GPUs available - ready for disaggregated serving!")
    print("  ‚Ä¢ GPU 0 ‚Üí Prefill Worker")
    print("  ‚Ä¢ GPU 1 ‚Üí Decode Worker")
else:
    print("‚úó Need at least 2 GPUs for disaggregated serving")
    print("  This notebook requires 2 GPUs on the same node.")

GPU CHECK FOR DISAGGREGATED SERVING

Found 2 GPU(s):

  GPU 0: NVIDIA GeForce RTX 5090
          Memory: 32109 MiB free / 32607 MiB total

  GPU 1: NVIDIA GeForce RTX 5090
          Memory: 32109 MiB free / 32607 MiB total

‚úì Two GPUs available - ready for disaggregated serving!
  ‚Ä¢ GPU 0 ‚Üí Prefill Worker
  ‚Ä¢ GPU 1 ‚Üí Decode Worker


---

## Step 2: Verify Infrastructure (etcd)

Make sure etcd is running from Module 01.

In [13]:
import urllib.request
import json

print("=" * 60)
print("INFRASTRUCTURE CHECK")
print("=" * 60)

try:
    with urllib.request.urlopen("http://localhost:2379/health", timeout=5) as resp:
        print("‚úì etcd: OK")
except Exception as e:
    print(f"‚úó etcd: Not responding - {e}")
    print("\n‚ö†Ô∏è  Start infrastructure first:")
    print("    Run Module 01 Step 3, or:")
    print("    docker start dynamo-etcd")

INFRASTRUCTURE CHECK
‚úì etcd: OK


---

## Step 3: Stop Any Existing Dynamo Processes

Clean up from previous sessions.

In [14]:
%%bash
echo "=== Finding Dynamo processes ==="
pgrep -af 'python -m dynamo' || echo "No Dynamo processes found"

echo -e "\n=== Finding processes on port 8000 ==="
fuser -v 8000/tcp 2>&1 || echo "No processes on port 8000"

echo -e "\n=== Sending SIGTERM to Dynamo processes ==="
pkill -f 'python -m dynamo' && echo "Sent SIGTERM to Dynamo processes" || echo "No Dynamo processes to kill"
sleep 2

echo -e "\n=== Sending SIGKILL to any remaining Dynamo processes ==="
pkill -9 -f 'python -m dynamo' && echo "Sent SIGKILL to remaining processes" || echo "No remaining processes"

echo -e "\n=== Force killing processes on port 8000 ==="
fuser -k -9 8000/tcp 2>&1 && echo "Killed processes on port 8000" || echo "No processes on port 8000 to kill"

sleep 3

echo -e "\n=== Verifying cleanup ==="
pgrep -af 'python -m dynamo' && echo "WARNING: Some processes still running!" || echo "‚úì All Dynamo processes stopped"
fuser -v 8000/tcp 2>&1 && echo "WARNING: Port 8000 still in use!" || echo "‚úì Port 8000 is free"
echo -e "\n‚úì Cleanup complete"

=== Finding Dynamo processes ===
No Dynamo processes found

=== Finding processes on port 8000 ===
No processes on port 8000

=== Sending SIGTERM to Dynamo processes ===
No Dynamo processes to kill

=== Sending SIGKILL to any remaining Dynamo processes ===
No remaining processes

=== Force killing processes on port 8000 ===
No processes on port 8000 to kill

=== Verifying cleanup ===
‚úì All Dynamo processes stopped
‚úì Port 8000 is free

‚úì Cleanup complete


---

## Step 4: Install KV Cache Transfer Backend

Disaggregated serving requires a backend to transfer KV cache between GPUs.

| Backend | Description |
|---------|-------------|
| **NIXL** | NVIDIA's native transfer library for Dynamo (recommended) |
| **Mooncake** | Third-party option |

Both work on single-node setups. We'll use **NIXL** since it's designed for Dynamo.

In [15]:
!uv pip install nixl
print("\n‚úì NIXL (NVIDIA Inference Xfer Library) installed")

[2mUsing Python 3.12.3 environment at: /root/src/github.com/sara4dev/ai-dynamo-the-hard-way/.venv[0m
[2mAudited [1m1 package[0m [2min 22ms[0m[0m

‚úì NIXL (NVIDIA Inference Xfer Library) installed


---

## Step 5: Launch Disaggregated Workers

We'll start three processes:
1. **Frontend** - HTTP API endpoint
2. **Decode Worker** (GPU 1) - Generates tokens using KV cache
3. **Prefill Worker** (GPU 0) - Processes prompts, generates KV cache

Key flags:
- `--disaggregation-mode prefill` or `decode` - which role this worker plays
- `--disaggregation-transfer-backend nixl` - use NIXL for KV cache transfer
- `--host 0.0.0.0` - **required** so the bootstrap server binds to all interfaces (not just localhost)
- `CUDA_VISIBLE_DEVICES` - assign specific GPU to each worker

> ‚ö†Ô∏è **Important**: Without `--host 0.0.0.0`, the bootstrap server binds to `127.0.0.1` only, causing "Connection refused" errors when workers try to communicate via the external IP.

In [16]:
%%bash
MODEL="Qwen/Qwen3-0.6B"

echo "============================================================"
echo "LAUNCHING DISAGGREGATED SERVING"
echo "============================================================"

# Start Frontend
echo ""
echo "[1/3] Starting Frontend..."
python -m dynamo.frontend > /tmp/dynamo_frontend.log 2>&1 &
echo "      PID: $!"

# Start Decode Worker on GPU 1 (decode workers start first in Dynamo)
# NOTE: --host 0.0.0.0 is required so the bootstrap server binds to all interfaces
echo ""
echo "[2/3] Starting Decode Worker (GPU 1)..."
CUDA_VISIBLE_DEVICES=1 python -m dynamo.sglang \
    --model-path $MODEL \
    --host 0.0.0.0 \
    --attention-backend flashinfer \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend nixl \
    > /tmp/dynamo_decode.log 2>&1 &
echo "      PID: $!"

# Start Prefill Worker on GPU 0
# NOTE: --host 0.0.0.0 is required so the bootstrap server binds to all interfaces
echo ""
echo "[3/3] Starting Prefill Worker (GPU 0)..."
CUDA_VISIBLE_DEVICES=0 python -m dynamo.sglang \
    --model-path $MODEL \
    --host 0.0.0.0 \
    --attention-backend flashinfer \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend nixl \
    > /tmp/dynamo_prefill.log 2>&1 &
echo "      PID: $!"

echo ""
echo "‚úì All processes started"
echo "  Logs: /tmp/dynamo_frontend.log, /tmp/dynamo_prefill.log, /tmp/dynamo_decode.log"
echo ""
echo "‚è≥ Wait ~60s for models to load on both GPUs..."

LAUNCHING DISAGGREGATED SERVING

[1/3] Starting Frontend...
      PID: 2334384

[2/3] Starting Decode Worker (GPU 1)...
      PID: 2334385

[3/3] Starting Prefill Worker (GPU 0)...
      PID: 2334386

‚úì All processes started
  Logs: /tmp/dynamo_frontend.log, /tmp/dynamo_prefill.log, /tmp/dynamo_decode.log

‚è≥ Wait ~60s for models to load on both GPUs...


---

## Step 6: Wait for Workers to Register

In [17]:
import urllib.request
import json
import time

print("Waiting for disaggregated workers to start...")
print("(This may take ~60-90 seconds for both models to load)\n")

MAX_WAIT = 180
INTERVAL = 10
elapsed = 0

while elapsed < MAX_WAIT:
    try:
        with urllib.request.urlopen('http://localhost:8000/health', timeout=5) as response:
            health = json.loads(response.read())
            instances = health.get('instances', [])
            
            # Count worker types
            prefill_workers = [i for i in instances if 'prefill' in str(i).lower()]
            decode_workers = [i for i in instances if 'decode' in str(i).lower() or 'backend' in str(i.get('component', '')).lower()]
            
            print(f"[{elapsed}s] Frontend: ‚úì | Workers registered: {len(instances)}")
            
            # For disaggregated, we need at least 2 workers
            if len(instances) >= 2:
                print(f"\n‚úì Disaggregated setup ready!")
                print(f"\nRegistered instances:")
                for inst in instances:
                    print(f"  ‚Ä¢ {inst.get('component', 'unknown')}/{inst.get('endpoint', 'unknown')}")
                    print(f"    Transport: {inst.get('transport', {})}")
                break
                
    except Exception as e:
        print(f"[{elapsed}s] Waiting for frontend...")
    
    time.sleep(INTERVAL)
    elapsed += INTERVAL

if elapsed >= MAX_WAIT:
    print(f"\n‚ö†Ô∏è  Timeout after {MAX_WAIT}s")
    print("\nCheck logs for errors:")
    print("  tail -50 /tmp/dynamo_prefill.log")
    print("  tail -50 /tmp/dynamo_decode.log")

Waiting for disaggregated workers to start...
(This may take ~60-90 seconds for both models to load)

[0s] Waiting for frontend...
[10s] Frontend: ‚úì | Workers registered: 0
[20s] Frontend: ‚úì | Workers registered: 0
[30s] Frontend: ‚úì | Workers registered: 0
[40s] Frontend: ‚úì | Workers registered: 0
[50s] Frontend: ‚úì | Workers registered: 0
[60s] Frontend: ‚úì | Workers registered: 0
[70s] Frontend: ‚úì | Workers registered: 0
[80s] Frontend: ‚úì | Workers registered: 0
[90s] Frontend: ‚úì | Workers registered: 2

‚úì Disaggregated setup ready!

Registered instances:
  ‚Ä¢ backend/generate
    Transport: {'tcp': '192.168.1.180:36463/18f9c1b0f95b2ca/generate'}
  ‚Ä¢ prefill/generate
    Transport: {'tcp': '192.168.1.180:36349/18f9c1b0f95b2c4/generate'}


---

## Step 7: Check etcd Registrations

Let's see how both workers registered themselves in etcd.

In [18]:
import urllib.request
import json
import base64

print("=" * 60)
print("ETCD WORKER REGISTRATIONS")
print("=" * 60)

try:
    req = urllib.request.Request(
        "http://localhost:2379/v3/kv/range",
        data=json.dumps({
            "key": base64.b64encode(b"v1/").decode(),
            "range_end": base64.b64encode(b"v10").decode()
        }).encode(),
        headers={'Content-Type': 'application/json'}
    )
    with urllib.request.urlopen(req, timeout=5) as resp:
        data = json.loads(resp.read())
        
        if 'kvs' in data and data['kvs']:
            for kv in data['kvs']:
                key = base64.b64decode(kv['key']).decode()
                value = json.loads(base64.b64decode(kv['value']).decode())
                
                # Highlight prefill vs decode
                if 'prefill' in key.lower():
                    print(f"\nüîµ PREFILL: {key}")
                elif 'decode' in key.lower() or 'backend' in key.lower():
                    print(f"\nüü¢ DECODE: {key}")
                else:
                    print(f"\n‚ö™ {key}")
                    
                print(f"   Type: {value.get('type', 'unknown')}")
                if 'transport' in value:
                    print(f"   Transport: {value['transport']}")
        else:
            print("No workers registered yet")
            
except Exception as e:
    print(f"Error querying etcd: {e}")

ETCD WORKER REGISTRATIONS

üü¢ DECODE: v1/instances/dynamo/backend/generate/18f9c1b0f95b2ca
   Type: Endpoint
   Transport: {'tcp': '192.168.1.180:36463/18f9c1b0f95b2ca/generate'}

üîµ PREFILL: v1/instances/dynamo/prefill/generate/18f9c1b0f95b2c4
   Type: Endpoint
   Transport: {'tcp': '192.168.1.180:36349/18f9c1b0f95b2c4/generate'}

üü¢ DECODE: v1/mdc/dynamo/backend/generate/18f9c1b0f95b2ca
   Type: Model

üîµ PREFILL: v1/mdc/dynamo/prefill/generate/18f9c1b0f95b2c4
   Type: Model


---

## Step 8: Test Disaggregated Inference

Let's send a request and verify the disaggregated setup is working.

In [19]:
import urllib.request
import json

print("=" * 60)
print("TESTING DISAGGREGATED INFERENCE")
print("=" * 60)

payload = {
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Explain quantum computing in 2 sentences."}],
    "max_tokens": 100
}

try:
    req = urllib.request.Request(
        "http://localhost:8000/v1/chat/completions",
        data=json.dumps(payload).encode(),
        headers={'Content-Type': 'application/json'}
    )
    with urllib.request.urlopen(req, timeout=120) as resp:
        result = json.loads(resp.read())
        content = result['choices'][0]['message']['content']
        print(f"\n‚úì Inference complete!")
        print(f"\nResponse:\n{content}")
except Exception as e:
    print(f"\n‚úó Inference failed: {e}")

TESTING DISAGGREGATED INFERENCE

‚úì Inference complete!

Response:
<think>
Okay, the user wants me to explain quantum computing in two sentences. Let me start with the basics. I know quantum computing uses qubits instead of classical bits. Qubits are fundamental because they can be in a superposition state, right? That's a key point. So first sentence: "Quantum computing uses qubits, which are quantum mechanical particles, to enable computing that can process multiple possibilities at once."

Now the second sentence. I need to mention the interference and entang


In [20]:
# Check logs to verify disaggregation is working
!echo "=== Prefill Worker Log (last 5 lines) ===" && tail -5 /tmp/dynamo_prefill.log | grep -E "(Prefill|INFO)" || echo "No prefill activity"
!echo -e "\n=== Decode Worker Log (last 5 lines) ===" && tail -5 /tmp/dynamo_decode.log | grep -E "(Decode|INFO)" || echo "No decode activity"

=== Prefill Worker Log (last 5 lines) ===
[2m2026-02-01T23:25:14.782379Z[0m [32m INFO[0m [2mdynamo_llm::hub[0m[2m:[0m ModelExpress download completed successfully for model: Qwen/Qwen3-0.6B
[2m2026-02-01T23:25:14.807789Z[0m [32m INFO[0m [2m_core[0m[2m:[0m Registered base model 'Qwen/Qwen3-0.6B' MDC
[2m2026-02-01T23:25:14.808546Z[0m [32m INFO[0m [2mregister._register_llm_with_runtime_config[0m[2m:[0m Successfully registered LLM with runtime config
[2m2026-02-01T23:25:14.808844Z[0m [32m INFO[0m [2mregister.register_llm_with_readiness_gate[0m[2m:[0m Model registration succeeded; processing queued requests
[2m2026-02-01T23:25:23.957274Z[0m [32m INFO[0m [2mscheduler_metrics_mixin.log_prefill_stats[0m[2m:[0m Prefill batch, #new-seq: 1, #new-token: 17, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 0.00, 

=== Decode Worker Log (last 5 lines) ===
[2m2026-02-01T23:25:2

---

## Step 9: Verify GPU Utilization

Let's confirm both GPUs are being used during inference by monitoring utilization while sending requests.

In [21]:
import subprocess
import time
import urllib.request
import json
import threading

def monitor_gpus(duration=10):
    """Monitor GPU utilization for a duration"""
    print("=" * 60)
    print("GPU UTILIZATION DURING INFERENCE")
    print("=" * 60)
    
    samples = []
    start = time.time()
    
    while time.time() - start < duration:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=index,utilization.gpu,memory.used', '--format=csv,noheader,nounits'],
            capture_output=True, text=True
        )
        samples.append(result.stdout.strip())
        time.sleep(0.5)
    
    # Parse and summarize
    gpu0_util = []
    gpu1_util = []
    
    for sample in samples:
        for line in sample.split('\n'):
            parts = line.split(', ')
            if len(parts) >= 2:
                idx, util = int(parts[0]), int(parts[1])
                if idx == 0:
                    gpu0_util.append(util)
                elif idx == 1:
                    gpu1_util.append(util)
    
    print(f"\nGPU 0 (Prefill): avg {sum(gpu0_util)/len(gpu0_util):.1f}% | max {max(gpu0_util)}%")
    print(f"GPU 1 (Decode):  avg {sum(gpu1_util)/len(gpu1_util):.1f}% | max {max(gpu1_util)}%")

def send_inference_request():
    time.sleep(1)
    payload = {
        "model": "Qwen/Qwen3-0.6B",
        "messages": [{"role": "user", "content": "Write a haiku about computers."}],
        "max_tokens": 50
    }
    try:
        req = urllib.request.Request(
            "http://localhost:8000/v1/chat/completions",
            data=json.dumps(payload).encode(),
            headers={'Content-Type': 'application/json'}
        )
        urllib.request.urlopen(req, timeout=60)
    except:
        pass

# Start inference in background and monitor GPUs
thread = threading.Thread(target=send_inference_request)
thread.start()
monitor_gpus(duration=15)
thread.join()

GPU UTILIZATION DURING INFERENCE

GPU 0 (Prefill): avg 0.0% | max 0%
GPU 1 (Decode):  avg 0.8% | max 20%


---

## Understanding the Request Flow

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Client  ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ Frontend ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ Decode Worker‚îÇ‚óÄ‚îÄ‚îÄ‚ñ∂‚îÇPrefill Worker‚îÇ
‚îÇ          ‚îÇ    ‚îÇ  (:8000) ‚îÇ    ‚îÇ   (GPU 1)    ‚îÇ    ‚îÇ   (GPU 0)    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                     ‚îÇ                 ‚îÇ                   ‚îÇ
                     ‚îÇ                 ‚îÇ    KV Cache       ‚îÇ
                     ‚ñº                 ‚îÇ‚óÄ‚îÄ‚îÄ‚îÄTransfer‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ
                  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê             ‚îÇ   (RDMA/NIXL)     ‚îÇ
                  ‚îÇ etcd ‚îÇ‚óÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         Worker Discovery
```

### Detailed Flow:

1. **Client ‚Üí Frontend**: HTTP request arrives
2. **Frontend ‚Üí Decode Worker**: Routes to decode worker (discovered via etcd)
3. **Decode ‚Üí Prefill**: Decode forwards prompt to prefill for KV cache generation
4. **Prefill processing**: Processes prompt, generates KV cache on GPU 0
5. **KV Transfer**: KV cache transferred via RDMA directly to GPU 1 memory
6. **Decode generates**: Decode worker generates tokens using transferred KV cache
7. **Response**: Tokens stream back to client

### Key Components:

- **etcd**: Service discovery - workers register themselves, frontend finds them
- **Bootstrap Server**: HTTP endpoint on each worker for KV transfer handshake
- **NIXL/RDMA**: Direct GPU-to-GPU memory transfer (fast, bypasses CPU)

---

## Key Takeaways

1. **Disaggregated serving** separates compute-bound prefill from memory-bound decode
2. **etcd** is used for worker discovery - both workers register independently
3. **Bootstrap server** on each worker handles KV transfer coordination via HTTP
4. **NIXL/RDMA** transfers KV cache directly between GPU memories (fast!)
5. **Both GPUs active** - each specialized for its phase

### Performance Benefits

| Metric | Single Worker | Disaggregated |
|--------|---------------|---------------|
| Prefill throughput | Limited by decode | Dedicated GPU |
| Decode latency | Interrupted by prefill | Uninterrupted |
| GPU utilization | Mixed workload | Optimized per-phase |

---

## Cleanup

In [22]:
%%bash
echo "Stopping Dynamo processes..."
pkill -f 'python -m dynamo' 2>/dev/null || true
echo "‚úì Cleanup complete"

# Show GPU memory freed
sleep 2
echo "\nGPU memory after cleanup:"
nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv

Stopping Dynamo processes...
‚úì Cleanup complete
\nGPU memory after cleanup:
index, memory.used [MiB], memory.free [MiB]
0, 2 MiB, 32109 MiB
1, 2 MiB, 32109 MiB
