# Module 01: Setup and First Inference

> **Goal**: Get Dynamo running and make your first inference request in under 15 minutes.

---

## Step 1: Verify Your Environment

Let's make sure we have the basics: GPU, Python, and network access.

In [3]:
# Quick environment check
import subprocess
import sys

print("=" * 60)
print("ENVIRONMENT CHECK")
print("=" * 60)

# Python version
print(f"\n‚úì Python: {sys.version.split()[0]}")

# GPU check
try:
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=name,memory.total,driver_version', '--format=csv,noheader'],
        capture_output=True, text=True, timeout=5
    )
    if result.returncode == 0:
        for line in result.stdout.strip().split('\n'):
            print(f"‚úì GPU: {line}")
    else:
        print("‚úó No GPU detected")
except Exception as e:
    print(f"‚úó GPU check failed: {e}")

# CUDA check
try:
    result = subprocess.run(['nvcc', '--version'], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        cuda_version = [l for l in result.stdout.split('\n') if 'release' in l][0]
        print(f"‚úì CUDA: {cuda_version.split('release')[-1].strip().rstrip(',').strip()}")
except:
    print("! CUDA toolkit not found (may still work with runtime)")

print("\n" + "=" * 60)

ENVIRONMENT CHECK

‚úì Python: 3.12.3
‚úì GPU: NVIDIA GB10, [N/A], 580.95.05
‚úì CUDA: 13.0, V13.0.88



---

## Step 2: Install Dynamo

Dynamo can be installed from PyPI. We'll use SGLang as our inference backend.

In [3]:
# Install Dynamo with SGLang (as per official quickstart)
# Note: This may take a few minutes

%pip install "ai-dynamo[sglang]" --quiet

print("\n" + "=" * 60)
print("Verifying installation...")
print("=" * 60)

# Verify Python imports
try:
    import dynamo
    print(f"‚úì Dynamo Python package imported successfully")
except ImportError as e:
    print(f"‚úó Dynamo import failed: {e}")

try:
    import torch
    print(f"‚úì PyTorch version: {torch.__version__}")
    print(f"‚úì CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"‚úì CUDA device: {torch.cuda.get_device_name(0)}")
except ImportError as e:
    print(f"‚úó PyTorch import failed: {e}")

try:
    import sglang
    print(f"‚úì SGLang version: {sglang.__version__}")
except ImportError as e:
    print(f"‚úó SGLang import failed: {e}")

Note: you may need to restart the kernel to use updated packages.

Verifying installation...
‚úì Dynamo Python package imported successfully
‚úì PyTorch version: 2.9.1+cu130
‚úì CUDA available: True
‚úì CUDA device: NVIDIA GB10
‚úì SGLang version: 0.5.6.post2


---

## Step 3: Start Infrastructure (etcd + NATS)

Dynamo needs two services running:
- **etcd**: Service discovery (workers register here)
- **NATS**: Event messaging (KV cache notifications)

We'll use Docker for quick setup:

In [7]:
%%bash
# Start etcd if not running
if ! docker ps | grep -q dynamo-etcd; then
    echo "Starting etcd..."
    docker run -d \
        --name dynamo-etcd \
        --restart unless-stopped \
        -p 2379:2379 \
        -p 2380:2380 \
        quay.io/coreos/etcd:v3.5.17 \
        /usr/local/bin/etcd \
        --name etcd0 \
        --advertise-client-urls http://0.0.0.0:2379 \
        --listen-client-urls http://0.0.0.0:2379 \
        --initial-advertise-peer-urls http://0.0.0.0:2380 \
        --listen-peer-urls http://0.0.0.0:2380 \
        --initial-cluster etcd0=http://0.0.0.0:2380
    echo "‚úì etcd started"
else
    echo "‚úì etcd already running"
fi

# Start NATS if not running
if ! docker ps | grep -q dynamo-nats; then
    echo "Starting NATS..."
    docker run -d \
        --name dynamo-nats \
        --restart unless-stopped \
        -p 4222:4222 \
        -p 8222:8222 \
        nats:latest \
        -js -m 8222
    echo "‚úì NATS started"
else
    echo "‚úì NATS already running"
fi

# Wait for services to be ready
sleep 2
echo ""
echo "Infrastructure status:"
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | grep -E "(NAMES|dynamo)"

‚úì etcd already running


‚úì NATS already running

Infrastructure status:
NAMES         STATUS          PORTS
dynamo-nats   Up 48 minutes   0.0.0.0:4222->4222/tcp, [::]:4222->4222/tcp, 0.0.0.0:8222->8222/tcp, [::]:8222->8222/tcp, 6222/tcp
dynamo-etcd   Up 48 minutes   0.0.0.0:2379-2380->2379-2380/tcp, [::]:2379-2380->2379-2380/tcp


In [8]:
# Verify services are responding
import urllib.request
import json

print("Checking service health...\n")

# Check etcd
try:
    with urllib.request.urlopen('http://localhost:2379/health', timeout=5) as response:
        data = json.loads(response.read())
        print(f"‚úì etcd: {data}")
except Exception as e:
    print(f"‚úó etcd not responding: {e}")

# Check NATS
try:
    with urllib.request.urlopen('http://localhost:8222/healthz', timeout=5) as response:
        status = response.read().decode()
        print(f"‚úì NATS: {status}")
except Exception as e:
    print(f"‚úó NATS not responding: {e}")

Checking service health...

‚úì etcd: {'health': 'true', 'reason': ''}
‚úì NATS: {"status":"ok"}


---

## Step 4: Download a Model

We'll use a small model for quick testing: **Qwen3-0.6B**

This is small enough to load quickly but demonstrates all Dynamo functionality.

In [10]:
# Model configuration
MODEL_NAME = "Qwen/Qwen3-0.6B"

print(f"Model: {MODEL_NAME}")
print("\nThis model will be downloaded on first use.")
print("For faster testing, we'll let the backend handle the download.")

Model: Qwen/Qwen3-0.6B

This model will be downloaded on first use.
For faster testing, we'll let the backend handle the download.


---

## Step 5: Start Dynamo Frontend and Worker

Dynamo uses a disaggregated architecture with separate frontend and worker processes:
- **Frontend**: Handles HTTP requests and routes to workers
- **Worker**: Runs the model inference (vLLM backend)

The cell below will start both processes in the background:

In [14]:
%%bash
# Kill any existing Dynamo processes and free port 8000
pkill -f 'python -m dynamo' 2>/dev/null || true
fuser -k 8000/tcp 2>/dev/null || true
sleep 2

# Start the frontend
python -m dynamo.frontend > /tmp/dynamo_frontend.log 2>&1 &

# Start the SGLang worker
python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B > /tmp/dynamo_worker.log 2>&1 &

echo "Started Dynamo frontend and SGLang worker in background"
echo "Wait ~30s for the model to load before continuing"
echo "Check logs: /tmp/dynamo_frontend.log and /tmp/dynamo_worker.log"

 602625Started Dynamo frontend and vLLM worker in background
Wait ~30s for the model to load before continuing
Check logs: /tmp/dynamo_frontend.log and /tmp/dynamo_worker.log


In [15]:
%%bash
# Wait for services to start and check logs
echo "Waiting for Dynamo to start..."
sleep 10

echo ""
echo "=== Frontend Log (last 20 lines) ==="
tail -20 /tmp/dynamo_frontend.log 2>/dev/null || echo "No frontend log yet"

echo ""
echo "=== Worker Log (last 20 lines) ==="
tail -20 /tmp/dynamo_worker.log 2>/dev/null || echo "No worker log yet"

echo ""
echo "=== Health Check ==="
curl -s http://localhost:8000/health && echo " - Frontend is UP" || echo "Frontend not responding yet"

Waiting for Dynamo to start...

=== Frontend Log (last 20 lines) ===
[2m2026-02-01T19:47:40.738379Z[0m [32m INFO[0m [2mdynamo_runtime::distributed[0m[2m:[0m Initializing KV store discovery backend
[2m2026-02-01T19:47:40.738508Z[0m [32m INFO[0m [2mdynamo_runtime::pipeline::network::manager[0m[2m:[0m Initializing NetworkManager with TCP request plane [3mmode[0m[2m=[0mtcp [3mhost[0m[2m=[0m192.168.1.76 [3mport[0m[2m=[0mOS-assigned
[2m2026-02-01T19:47:40.739994Z[0m [32m INFO[0m [2mdynamo_llm::http::service::service_v2[0m[2m:[0m Starting HTTP(S) service [3mprotocol[0m[2m=[0m"HTTP" [3maddress[0m[2m=[0m"0.0.0.0:8000"

=== Worker Log (last 20 lines) ===
    from .communication_op import *
  File "/root/src/github.com/sara4dev/ai-dynamo-the-hard-way/.venv/lib/python3.12/site-packages/vllm/distributed/communication_op.py", line 9, in <module>
    from .parallel_state import get_tp_group
  File "/root/src/github.com/sara4dev/ai-dynamo-the-hard-way/.venv/

---

## Step 6: Your First Inference Request! üéâ

Now let's send a request using the OpenAI-compatible API:

In [None]:
import urllib.request
import json
import time

# Dynamo endpoint
DYNAMO_URL = "http://localhost:8000/v1/chat/completions"

# Request payload (OpenAI-compatible format)
payload = {
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
        {"role": "user", "content": "What is the capital of France? Answer in one sentence."}
    ],
    "max_tokens": 50,
    "temperature": 0.7
}

print("Sending request to Dynamo...")
print(f"Prompt: {payload['messages'][0]['content']}")
print("-" * 50)

start_time = time.time()

try:
    req = urllib.request.Request(
        DYNAMO_URL,
        data=json.dumps(payload).encode(),
        headers={'Content-Type': 'application/json'}
    )
    with urllib.request.urlopen(req, timeout=60) as response:
        result = json.loads(response.read())
        
    elapsed = time.time() - start_time
    
    # Extract response
    answer = result['choices'][0]['message']['content']
    usage = result.get('usage', {})
    
    print(f"\n‚úì Response received in {elapsed:.2f}s")
    print(f"\nAnswer: {answer}")
    print(f"\nTokens: {usage.get('prompt_tokens', '?')} prompt + {usage.get('completion_tokens', '?')} completion")
    
except urllib.error.URLError as e:
    print(f"\n‚úó Connection failed: {e}")
    print("\nMake sure the Dynamo worker is running (Step 5)")
except Exception as e:
    print(f"\n‚úó Error: {e}")

---

## Step 7: Test Streaming Response

Dynamo supports streaming for real-time token generation:

In [None]:
import urllib.request
import json
import time

DYNAMO_URL = "http://localhost:8000/v1/chat/completions"

payload = {
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
        {"role": "user", "content": "Count from 1 to 10, one number per line."}
    ],
    "max_tokens": 100,
    "stream": True  # Enable streaming
}

print("Streaming response:")
print("-" * 50)

start_time = time.time()
first_token_time = None

try:
    req = urllib.request.Request(
        DYNAMO_URL,
        data=json.dumps(payload).encode(),
        headers={'Content-Type': 'application/json'}
    )
    
    with urllib.request.urlopen(req, timeout=60) as response:
        for line in response:
            line = line.decode().strip()
            if line.startswith('data: '):
                data = line[6:]  # Remove 'data: ' prefix
                if data == '[DONE]':
                    break
                try:
                    chunk = json.loads(data)
                    delta = chunk['choices'][0].get('delta', {})
                    content = delta.get('content', '')
                    if content:
                        if first_token_time is None:
                            first_token_time = time.time()
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    pass
    
    total_time = time.time() - start_time
    ttft = first_token_time - start_time if first_token_time else 0
    
    print(f"\n\n" + "-" * 50)
    print(f"Time to First Token (TTFT): {ttft:.3f}s")
    print(f"Total time: {total_time:.2f}s")
    
except Exception as e:
    print(f"\n‚úó Error: {e}")

---

## Step 8: Verify System State

Let's check the full system status:

In [None]:
import subprocess

print("=" * 60)
print("SYSTEM STATUS")
print("=" * 60)

# GPU utilization
print("\nüìä GPU Status:")
!nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv

# Docker containers
print("\nüê≥ Infrastructure Containers:")
!docker ps --filter name=dynamo --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Health endpoints
print("\nüè• Health Checks:")
import urllib.request
import json

services = [
    ("Dynamo Frontend", "http://localhost:8000/health"),
    ("etcd", "http://localhost:2379/health"),
    ("NATS", "http://localhost:8222/healthz"),
]

for name, url in services:
    try:
        with urllib.request.urlopen(url, timeout=2) as resp:
            print(f"  ‚úì {name}: OK")
    except:
        print(f"  ‚úó {name}: Not responding")

---

## üéâ Congratulations!

You now have a working Dynamo setup:

| Component | Status | Port |
|-----------|--------|------|
| etcd (Service Discovery) | Running | 2379 |
| NATS (Messaging) | Running | 4222 |
| Dynamo Worker (SGLang) | Running | 8000 |

### What's Next?

- **Module 02**: Deep dive into the Frontend (HTTP handling, routing)
- **Module 03**: Compare different backends (vLLM, SGLang, TensorRT-LLM)
- **Module 04**: Understanding etcd service discovery in detail

---

## Cleanup (Optional)

Run this cell to stop the Dynamo processes when you're done:

In [None]:
%%bash
# Stop Dynamo processes
pkill -f 'python -m dynamo' || true
echo "Dynamo processes stopped"

# Uncomment below to also stop infrastructure containers
# docker stop dynamo-etcd dynamo-nats
# docker rm dynamo-etcd dynamo-nats