# Module 04: Exploring NATS in AI Dynamo

> **Goal**: Understand when and how Dynamo uses NATS, and observe KV-aware routing in action.

---

## Overview

Dynamo uses NATS for **KV cache event distribution** and **router coordination**, but **not for the inference hot path**.

### When Does Dynamo Use NATS?

| Use Case | Enabled By | Purpose |
|----------|-----------|--------|
| **KV Cache Events** | `--router-mode kv` | Notify router about cached/evicted blocks |
| **Router Replica Sync** | `--router-replica-sync` | Sync routing decisions between routers |
| **Request Transport** | `--request-plane nats` | Route requests via NATS (optional, slower) |

### When NATS is NOT Used

| Use Case | What's Used Instead |
|----------|--------------------|
| Inference requests | TCP (direct connection) |
| Service discovery | etcd (or Kubernetes) |
| Disaggregated KV transfer | NIXL/RDMA (GPU-to-GPU) |

### Monitoring NATS

Open a **separate terminal** and run:

```bash
docker run --rm -it --network host natsio/nats-box nats sub ">" --server localhost:4222
```

This subscribes to ALL NATS messages. Keep it running while you work through this notebook.

---

## Step 1: Verify GPUs and Infrastructure

We need 2 GPUs for two workers, plus etcd and NATS running.

In [24]:
import urllib.request
import json
import socket
import subprocess
import time

print("=" * 60)
print("GPU CHECK")
print("=" * 60)

result = subprocess.run(
    ['nvidia-smi', '--query-gpu=index,name,memory.free', '--format=csv,noheader'],
    capture_output=True, text=True, timeout=5
)

gpus = result.stdout.strip().split('\n')
print(f"\nFound {len(gpus)} GPU(s):")
for gpu in gpus:
    print(f"  {gpu}")

if len(gpus) >= 2:
    print("\n‚úì Two GPUs available!")
else:
    print("\n‚ö†Ô∏è  Need 2 GPUs for this demo")

print("\n" + "=" * 60)
print("INFRASTRUCTURE CHECK")
print("=" * 60)

def check_port(port):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(2)
        result = sock.connect_ex(('localhost', port))
        sock.close()
        return result == 0
    except:
        return False

try:
    with urllib.request.urlopen("http://localhost:2379/health", timeout=5) as resp:
        print("‚úì etcd: Running")
except:
    print("‚úó etcd: Not running (docker start dynamo-etcd)")

print(f"{'‚úì' if check_port(4222) else '‚úó'} NATS: {'Running' if check_port(4222) else 'Not running'}")
print(f"{'‚úì' if check_port(8222) else '‚úó'} NATS Monitor: {'Running' if check_port(8222) else 'Not running'}")

GPU CHECK

Found 2 GPU(s):
  0, NVIDIA GeForce RTX 5090, 2456 MiB
  1, NVIDIA GeForce RTX 5090, 2324 MiB

‚úì Two GPUs available!

INFRASTRUCTURE CHECK
‚úì etcd: Running
‚úì NATS: Running
‚úì NATS Monitor: Running


---

## Step 2: Stop Any Existing Dynamo Processes

In [46]:
%%bash
echo "=== Finding Dynamo processes ==="
pgrep -af 'python -m dynamo' || echo "No Dynamo processes found"

echo -e "\n=== Finding processes on port 8000 ==="
fuser -v 8000/tcp 2>&1 || echo "No processes on port 8000"

echo -e "\n=== Sending SIGTERM to Dynamo processes ==="
pkill -f 'python -m dynamo' && echo "Sent SIGTERM to Dynamo processes" || echo "No Dynamo processes to kill"
sleep 2

echo -e "\n=== Sending SIGKILL to any remaining Dynamo processes ==="
pkill -9 -f 'python -m dynamo' && echo "Sent SIGKILL to remaining processes" || echo "No remaining processes"

echo -e "\n=== Force killing processes on port 8000 ==="
fuser -k -9 8000/tcp 2>&1 && echo "Killed processes on port 8000" || echo "No processes on port 8000 to kill"

sleep 3

echo -e "\n=== Verifying cleanup ==="
pgrep -af 'python -m dynamo' && echo "WARNING: Some processes still running!" || echo "‚úì All Dynamo processes stopped"
fuser -v 8000/tcp 2>&1 && echo "WARNING: Port 8000 still in use!" || echo "‚úì Port 8000 is free"
echo -e "\n‚úì Cleanup complete"

=== Finding Dynamo processes ===


3186630 python -m dynamo.frontend --router-mode kv
3186631 python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --host 0.0.0.0 --attention-backend flashinfer --page-size 16
3186632 python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --host 0.0.0.0 --attention-backend flashinfer --page-size 16

=== Finding processes on port 8000 ===
                     USER        PID ACCESS COMMAND
8000/tcp:            root      3186630 F.... python

=== Sending SIGTERM to Dynamo processes ===
Sent SIGTERM to Dynamo processes

=== Sending SIGKILL to any remaining Dynamo processes ===
Sent SIGKILL to remaining processes

=== Force killing processes on port 8000 ===
No processes on port 8000 to kill

=== Verifying cleanup ===
‚úì All Dynamo processes stopped
‚úì Port 8000 is free

‚úì Cleanup complete


---

## Step 3: Launch Two Workers with KV-Aware Routing

For **KV-aware routing** to work, we need **multiple workers**. The router chooses which worker based on which one has the relevant prefix cached.

**Key flags:**
- Frontend: `--router-mode kv` - Enable KV-aware routing
- Workers: `--page-size 16` - Set the KV cache block size (the router discovers this from workers' MDC)
- **Workers: `--kv-events-config` - CRITICAL! Enables KV cache event publishing**

**How KV Events Flow:**
```
Worker 1 (GPU 0)                    Worker 2 (GPU 1)
     ‚îÇ                                   ‚îÇ
     ‚îÇ ZMQ publish                       ‚îÇ ZMQ publish
     ‚îÇ tcp://*:5557                      ‚îÇ tcp://*:5558
     ‚ñº                                   ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ             NATS JetStream                      ‚îÇ
‚îÇ         (kv-events topic)                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                        ‚îÇ subscribe
                        ‚ñº
               ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
               ‚îÇ    KV Router    ‚îÇ
               ‚îÇ (radix tree of  ‚îÇ
               ‚îÇ  cached blocks) ‚îÇ
               ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

Without `--kv-events-config`, workers don't advertise their cached blocks ‚Üí router sees 0 cached blocks!

```
                    Request: "Hello, how are you?"
                              ‚îÇ
                              ‚ñº
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ     KV Router     ‚îÇ
                    ‚îÇ                   ‚îÇ
                    ‚îÇ Which worker has  ‚îÇ
                    ‚îÇ this prefix?      ‚îÇ
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                              ‚îÇ
              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
              ‚ñº                               ‚ñº
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê             ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ   Worker 1      ‚îÇ             ‚îÇ   Worker 2      ‚îÇ
    ‚îÇ   (GPU 0)       ‚îÇ             ‚îÇ   (GPU 1)       ‚îÇ
    ‚îÇ Has: "Hello"    ‚îÇ             ‚îÇ Has: "What is"  ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò             ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
             ‚îÇ
             ‚ñº
    Route to Worker 1 (cache hit!)
```

In [47]:
%%bash
MODEL="Qwen/Qwen3-0.6B"

echo "============================================================"
echo "LAUNCHING TWO WORKERS WITH KV-AWARE ROUTING"
echo "============================================================"

echo ""
echo "[1/3] Starting Frontend (--router-mode kv)..."
python -m dynamo.frontend --router-mode kv > /tmp/dynamo_frontend.log 2>&1 &
echo "      PID: $!"

echo ""
echo "[2/3] Starting Worker 1 (GPU 0) with KV events on port 5557..."
CUDA_VISIBLE_DEVICES=0 python -m dynamo.sglang \
    --model-path $MODEL \
    --host 0.0.0.0 \
    --attention-backend flashinfer \
    --page-size 16 \
    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:5557"}' \
    > /tmp/dynamo_worker1.log 2>&1 &
echo "      PID: $!"

echo ""
echo "[3/3] Starting Worker 2 (GPU 1) with KV events on port 5558..."
CUDA_VISIBLE_DEVICES=1 python -m dynamo.sglang \
    --model-path $MODEL \
    --host 0.0.0.0 \
    --attention-backend flashinfer \
    --page-size 16 \
    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:5558"}' \
    > /tmp/dynamo_worker2.log 2>&1 &
echo "      PID: $!"

echo ""
echo "‚úì All processes started"
echo "  Logs: /tmp/dynamo_frontend.log, /tmp/dynamo_worker1.log, /tmp/dynamo_worker2.log"
echo ""
echo "‚è≥ Wait ~60-90s for models to load..."

LAUNCHING TWO WORKERS WITH KV-AWARE ROUTING

[1/3] Starting Frontend (--router-mode kv)...
      PID: 3208584

[2/3] Starting Worker 1 (GPU 0) with KV events on port 5557...
      PID: 3208585

[3/3] Starting Worker 2 (GPU 1) with KV events on port 5558...
      PID: 3208586

‚úì All processes started
  Logs: /tmp/dynamo_frontend.log, /tmp/dynamo_worker1.log, /tmp/dynamo_worker2.log

‚è≥ Wait ~60-90s for models to load...


---

## Step 4: Wait for Workers to Register

In [48]:
print("Waiting for workers to start...")
print("(This may take ~60-90 seconds)\n")

MAX_WAIT = 180
INTERVAL = 10
elapsed = 0

while elapsed < MAX_WAIT:
    try:
        with urllib.request.urlopen('http://localhost:8000/health', timeout=5) as response:
            health = json.loads(response.read())
            instances = health.get('instances', [])
            
            print(f"[{elapsed}s] Frontend: ‚úì | Workers: {len(instances)}")
            
            if len(instances) >= 2:
                print(f"\n‚úì Two workers ready!")
                for inst in instances:
                    print(f"  ‚Ä¢ {inst.get('component', '?')}/{inst.get('endpoint', '?')}")
                break
    except:
        print(f"[{elapsed}s] Waiting...")
    
    time.sleep(INTERVAL)
    elapsed += INTERVAL

if elapsed >= MAX_WAIT:
    print(f"\n‚ö†Ô∏è  Timeout. Check logs: tail /tmp/dynamo_worker1.log")

Waiting for workers to start...
(This may take ~60-90 seconds)

[0s] Frontend: ‚úì | Workers: 0
[10s] Frontend: ‚úì | Workers: 0
[20s] Frontend: ‚úì | Workers: 0
[30s] Frontend: ‚úì | Workers: 0
[40s] Frontend: ‚úì | Workers: 0
[50s] Frontend: ‚úì | Workers: 0
[60s] Frontend: ‚úì | Workers: 0
[70s] Frontend: ‚úì | Workers: 0
[80s] Frontend: ‚úì | Workers: 0
[90s] Frontend: ‚úì | Workers: 3

‚úì Two workers ready!
  ‚Ä¢ backend/generate
  ‚Ä¢ backend/generate
  ‚Ä¢ kv-router/generate


---

## Step 5: Check NATS Server Status

Let's see what's happening on NATS before we send requests.

In [29]:
print("=" * 60)
print("NATS SERVER STATUS")
print("=" * 60)

try:
    with urllib.request.urlopen("http://localhost:8222/varz", timeout=5) as resp:
        varz = json.loads(resp.read())
        print(f"\nServer: {varz.get('server_name', '?')}")
        print(f"Uptime: {varz.get('uptime', '?')}")
        print(f"Connections: {varz.get('connections', 0)}")
        print(f"Messages In: {varz.get('in_msgs', 0):,}")
        print(f"Messages Out: {varz.get('out_msgs', 0):,}")
except Exception as e:
    print(f"Error: {e}")

print("\n" + "=" * 60)
print("JETSTREAM STREAMS")
print("=" * 60)

try:
    with urllib.request.urlopen("http://localhost:8222/jsz?streams=true", timeout=5) as resp:
        jsz = json.loads(resp.read())
        print(f"\nStreams: {jsz.get('streams', 0)}")
        print(f"Consumers: {jsz.get('consumers', 0)}")
        
        for account in jsz.get('account_details', []):
            for stream in account.get('stream_detail', []):
                name = stream.get('name', '?')
                msgs = stream.get('state', {}).get('messages', 0)
                print(f"  ‚Ä¢ {name}: {msgs} messages")
except Exception as e:
    print(f"Error: {e}")

NATS SERVER STATUS

Server: NBOM62WZSE5VHSNVZDZHEPUP4355H2LINXONTVB4DDOFY7AMRSMNJYQW
Uptime: 8h50m21s
Connections: 2
Messages In: 3
Messages Out: 21

JETSTREAM STREAMS

Streams: 0
Consumers: 0


---

## Step 6: Demonstrate KV-Aware Routing

Now let's send requests and observe routing behavior.

**Watch your nats-box terminal** for `BlockStored` events!

---

## Troubleshooting: KV Events Not Appearing

If you don't see BlockStored events in NATS, run this diagnostic script.

In [49]:
def send_request(prompt):
    """Send inference request to Dynamo."""
    url = "http://localhost:8000/v1/chat/completions"
    payload = {
        "model": "Qwen/Qwen3-0.6B",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 30
    }
    start = time.time()
    try:
        req = urllib.request.Request(
            url,
            data=json.dumps(payload).encode(),
            headers={'Content-Type': 'application/json'}
        )
        with urllib.request.urlopen(req, timeout=120) as resp:
            result = json.loads(resp.read())
            elapsed = time.time() - start
            return result["choices"][0]["message"]["content"], elapsed
    except Exception as e:
        return f"Error: {e}", 0

PROMPT_A = "Explain the theory of relativity in simple terms."
PROMPT_B = "What is the capital of Japan?"

print("=" * 60)
print("KV-AWARE ROUTING DEMO")
print("=" * 60)
print("\nüîç Watch your nats-box terminal for BlockStored events!\n")

# Request 1: New prompt, creates KV cache
print(f"[1] '{PROMPT_A[:40]}...'")
resp, t = send_request(PROMPT_A)
print(f"    Time: {t:.2f}s | Response: {resp[:50]}...\n")
time.sleep(1)

# Request 2: Same prompt, should hit cache on same worker
print(f"[2] SAME PROMPT (should route to same worker)")
resp, t = send_request(PROMPT_A)
print(f"    Time: {t:.2f}s | Response: {resp[:50]}...\n")
time.sleep(1)

# Request 3: Different prompt
print(f"[3] '{PROMPT_B}'")
resp, t = send_request(PROMPT_B)
print(f"    Time: {t:.2f}s | Response: {resp[:50]}...\n")
time.sleep(1)

# Request 4: Repeat prompt B
print(f"[4] SAME AS [3] (should route to same worker)")
resp, t = send_request(PROMPT_B)
print(f"    Time: {t:.2f}s | Response: {resp[:50]}...\n")

print("=" * 60)
print("Check your nats-box terminal for the KV events!")

KV-AWARE ROUTING DEMO

üîç Watch your nats-box terminal for BlockStored events!

[1] 'Explain the theory of relativity in simp...'
    Time: 3.44s | Response: <think>
Okay, so I need to explain relativity in s...

[2] SAME PROMPT (should route to same worker)
    Time: 0.42s | Response: <think>
Okay, let me try to explain relativity the...

[3] 'What is the capital of Japan?'
    Time: 3.52s | Response: <think>
Okay, so the user is asking about Japan's ...

[4] SAME AS [3] (should route to same worker)
    Time: 0.37s | Response: <think>
Okay, the user is asking about the capital...

Check your nats-box terminal for the KV events!


---

## Step 7: Check NATS After Requests

Let's see how NATS state changed after sending requests.

In [None]:
print("=" * 60)
print("NATS CONNECTIONS")
print("=" * 60)

try:
    with urllib.request.urlopen("http://localhost:8222/connz", timeout=5) as resp:
        connz = json.loads(resp.read())
        connections = connz.get('connections', [])
        print(f"\nActive connections: {len(connections)}\n")
        
        for conn in connections:
            name = conn.get('name', 'unnamed')
            subs = conn.get('subscriptions', 0)
            msgs_in = conn.get('in_msgs', 0)
            msgs_out = conn.get('out_msgs', 0)
            print(f"  [{conn.get('cid')}] {name}")
            print(f"      Subs: {subs}, In: {msgs_in:,}, Out: {msgs_out:,}")
except Exception as e:
    print(f"Error: {e}")

print("\n" + "=" * 60)
print("JETSTREAM AFTER REQUESTS")
print("=" * 60)

try:
    with urllib.request.urlopen("http://localhost:8222/jsz?streams=true", timeout=5) as resp:
        jsz = json.loads(resp.read())
        print(f"\nTotal messages stored: {jsz.get('messages', 0)}")
        
        for account in jsz.get('account_details', []):
            for stream in account.get('stream_detail', []):
                name = stream.get('name', '?')
                msgs = stream.get('state', {}).get('messages', 0)
                subjects = stream.get('config', {}).get('subjects', [])
                print(f"\n  Stream: {name}")
                print(f"    Messages: {msgs}")
                print(f"    Subjects: {subjects}")
except Exception as e:
    print(f"Error: {e}")

---

## Cleanup

In [None]:
%%bash
echo "Stopping Dynamo processes..."
pkill -f 'python -m dynamo' 2>/dev/null || true
sleep 2
echo "‚úì Cleanup complete"

echo ""
nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv

---

## Summary: NATS in Dynamo

### Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îÇ
‚îÇ  ‚îÇ   Frontend  ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  KV Router  ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ      Workers        ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ   (:8000)   ‚îÇ TCP ‚îÇ             ‚îÇ TCP ‚îÇ  (GPU 0, GPU 1)     ‚îÇ   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ
‚îÇ                             ‚îÇ                       ‚îÇ              ‚îÇ
‚îÇ                             ‚îÇ subscribe             ‚îÇ publish      ‚îÇ
‚îÇ                             ‚ñº                       ‚ñº              ‚îÇ
‚îÇ                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê      ‚îÇ
‚îÇ                    ‚îÇ           NATS JetStream               ‚îÇ      ‚îÇ
‚îÇ                    ‚îÇ  ‚Ä¢ KV cache events (BlockStored)       ‚îÇ      ‚îÇ
‚îÇ                    ‚îÇ  ‚Ä¢ Router uses to pick best worker     ‚îÇ      ‚îÇ
‚îÇ                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò      ‚îÇ
‚îÇ                                                                    ‚îÇ
‚îÇ  Request data ‚Üí TCP (fast)                                         ‚îÇ
‚îÇ  KV events ‚Üí NATS (coordination)                                   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Monitoring Commands

```bash
# Subscribe to all NATS messages
docker run --rm -it --network host natsio/nats-box nats sub ">" --server localhost:4222

# Real-time monitoring
docker run --rm -it --network host natsio/nats-box nats-top -s localhost:4222

# List JetStream streams
docker run --rm --network host natsio/nats-box nats stream ls --server localhost:4222
```

### Key Takeaways

1. **NATS is for events, not data**: KV cache *events* flow through NATS; actual data uses TCP/NIXL
2. **KV-aware routing**: Router subscribes to events to know which worker has which prefix
3. **Same prefix ‚Üí same worker**: Requests with similar prompts route to workers with cached prefixes
4. **nats-box is your friend**: Simple CLI for real-time NATS monitoring