hypr

Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, and vLLM inference. Runs two A100 VMs concurrently — each serving a different model — with Pi coding agents connected to each.

Architecture

  ┌─────────────┐
  │   Laptop    │
  └──────┬──────┘
         │ SSH (into bhyve VM, not the host)
         │
  FreeBSD physical host (earth)
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                                                                         │
  │  FreeBSD bhyve VM  (isolation layer)          192.168.3.2 / wg1        │
  │  ┌───────────────────────────────────────────────────────────────────┐  │
  │  │                          ▲                                        │  │
  │  │                          │ SSH                                    │  │
  │  │  tmux session  (tmux attach)                                      │  │
  │  │  ┌─────────────────────────────────────────────────────────────┐  │  │
  │  │  │  window 0                                                   │  │  │
  │  │  │  ┌───────────────────────┬─────────────────────────┐        │  │  │
  │  │  │  │ pane 0: pi-nemotron   │ pane 1: pi-coder        │        │  │  │
  │  │  │  │                       │                         │        │  │  │
  │  │  │  │ Pi                    │ Pi                      │        │  │  │
  │  │  │  │ Nemotron-3-Super      │ Qwen3-Coder-Next        │        │  │  │
  │  │  │  └──────────┬────────────┘└────────────┬───────────┘        │  │  │
  │  │  │             │ OpenAI API               │ OpenAI API         │  │  │
  │  │  │             │ /v1/chat/completions      │ /v1/chat/completions│ │  │
  │  │  └─────────────┼──────────────────────────┼────────────────────┘  │  │
  │  │              │                           │                        │  │
  │  └──────────────┼───────────────────────────┼────────────────────────┘  │
  │                 │ WireGuard wg1             │ WireGuard wg1             │
  └─────────────────┼───────────────────────────┼───────────────────────────┘
                    │ 192.168.3.0/24            │ 192.168.3.0/24
                    │ UDP :56710                │ UDP :56710
                    ▼                           ▼
  ┌──────────────────────────┐  ┌──────────────────────────┐
  │ VM1 (A100 80GB)          │  │ VM2 (A100 80GB)          │
  │ 192.168.3.1              │  │ 192.168.3.3              │
  │ hyperstack1.wg1          │  │ hyperstack2.wg1          │
  │                          │  │                          │
  │ vLLM :11434              │  │ vLLM :11434              │
  │ Nemotron-3-Super 120B    │  │ Qwen3-Coder-Next 80B     │
  │ (Mamba+MoE, AWQ-4bit)    │  │ (MoE, AWQ-4bit)          │
  └──────────────────────────┘  └──────────────────────────┘

WireGuard topology:

Interface wg1 on earth carries traffic to both VMs simultaneously
earth is 192.168.3.2; VM1 is .1; VM2 is .3; tunnel port is 56710/udp
Adding VM2 to an existing wg1 tunnel: wg1-setup.sh adds a second [Peer] block without disturbing VM1
vLLM on each VM listens on 0.0.0.0:11434, firewalled to 192.168.3.0/24 (WireGuard subnet only)
Pi connects directly to each VM's vLLM over the tunnel — no proxy or load balancer

Why Pi

Bring-your-own model — connects to any OpenAI-compatible endpoint; no translation proxy needed between Pi and vLLM
Custom providers via models.json — define hyperstack, hyperstack1, and hyperstack2 providers once; fish abbreviations route to the right VM
Project-local config — symlink this repo's pi/ directory to ~/.pi; Pi picks up models.json, settings.json, extensions, and skills automatically
TypeScript extensions — custom behaviour (web search, loop scheduler, ask-mode) lives in pi/agent/extensions/ and loads from the symlink
Minimal core — no built-in sub-agents, plan mode, or permission popups; fast TUI with mid-session model switching via Ctrl+L

Prerequisites

Hyperstack account with API key in ~/.hyperstack
SSH key registered in Hyperstack as earth (or change ssh.hyperstack_key_name in the TOML)
Review [network].allowed_ssh_cidrs and [network].allowed_wireguard_cidrs in your TOML. The secure default is ["auto"], which resolves your current public egress IP to /32. Set explicit CIDRs or HYPERSTACK_OPERATOR_CIDR if you deploy from a different network.
WireGuard setup script: wg1-setup.sh (present in this directory)
Ruby with toml-rb gem: bundle install
Pi coding agent installed

WireGuard setup

hyperstack.rb runs wg1-setup.sh automatically during create / create-both. This section explains the tunnel design for reference and manual troubleshooting.

Tunnel design

earth (192.168.3.2)
  /etc/wireguard/wg1.conf
  [Interface]  Address = 192.168.3.2/24
  [Peer]  # VM1 — AllowedIPs = 192.168.3.1/32, Endpoint = <vm1-public-ip>:56710
  [Peer]  # VM2 — AllowedIPs = 192.168.3.3/32, Endpoint = <vm2-public-ip>:56710

A single wg1 interface on earth carries traffic to both VMs. Each VM is a separate [Peer] block. Adding VM2 to an existing tunnel with VM1 already running leaves VM1's peer untouched.

Manual setup

# VM1 (first VM — generates fresh keys, writes /etc/wireguard/wg1.conf from scratch)
./wg1-setup.sh <vm1-public-ip>

# VM2 (additional VM — adds a [Peer] block to the existing wg1.conf)
./wg1-setup.sh <vm2-public-ip> 192.168.3.3 hyperstack2.wg1

Verify the tunnel

# Show active peers and handshake times (both VMs should appear)
sudo wg show wg1

# Ping each VM through the tunnel
ping -c 3 192.168.3.1   # VM1
ping -c 3 192.168.3.3   # VM2

# Check vLLM is reachable over the tunnel
curl http://hyperstack1.wg1:11434/v1/models
curl http://hyperstack2.wg1:11434/v1/models

Restart / recover

# Restart tunnel locally (e.g. after network change)
sudo systemctl restart wg-quick@wg1

# Restart tunnel on VM after a reboot (ssh via public IP since WireGuard is down)
ssh ubuntu@<vm-public-ip> 'sudo systemctl start wg-quick@wg1'

# Re-run setup when VM IP changes (e.g. after delete + recreate)
./wg1-setup.sh <new-vm1-public-ip>
./wg1-setup.sh <new-vm2-public-ip> 192.168.3.3 hyperstack2.wg1

Quickstart (two-VM setup)

# Deploy both VMs in parallel, set up WireGuard + vLLM (~10 min)
ruby hyperstack.rb create-both

# Verify both VMs are working
ruby hyperstack.rb --config hyperstack-vm1.toml test
ruby hyperstack.rb --config hyperstack-vm2.toml test

# Launch Pi coding agents — one per terminal (fish abbreviations from hyperstack.fish)
pi-hyperstack-nemotron   # Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # Qwen3-Coder-Next on VM2

# Tear down both VMs
ruby hyperstack.rb delete-both

Using Pi

Pi is the coding agent frontend used with this setup. Each Hyperstack VM runs a vLLM instance; Pi connects to it directly over the WireGuard tunnel.

Installation

Install Pi from pi.dev, then link the project-local config into place:

ln -s /path/to/hyperstack/pi ~/.pi

This symlink makes Pi pick up pi/agent/models.json and pi/agent/settings.json from this repo as its agent configuration, so the Hyperstack providers and model definitions are available without any manual config editing.

Fish shell abbreviations

Source hyperstack.fish or copy the abbreviations into your Fish config:

abbr pi-hyperstack         pi --model hyperstack/openai/gpt-oss-120b
abbr pi-hyperstack-nemotron pi --model hyperstack1/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
abbr pi-hyperstack-coder    pi --model hyperstack2/bullpoint/Qwen3-Coder-Next-AWQ-4bit

Then launch a session after the VM(s) are up:

pi-hyperstack            # single-VM → GPT-OSS 120B on hyperstack.wg1
pi-hyperstack-nemotron   # two-VM → Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # two-VM → Qwen3-Coder-Next 80B on VM2

Model configuration (`pi/agent/models.json`)

Three providers are defined, one per setup, each pointing at its vLLM endpoint over WireGuard:

Provider	Base URL	Primary model
`hyperstack`	`http://hyperstack.wg1:11434/v1`	GPT-OSS 120B (single-VM)
`hyperstack1`	`http://hyperstack1.wg1:11434/v1`	Nemotron-3-Super 120B
`hyperstack2`	`http://hyperstack2.wg1:11434/v1`	Qwen3-Coder-Next 80B

All model presets from the TOML configs are registered under each provider, so any model can be run on any VM after a model switch (see Switching models).

Settings (`pi/agent/settings.json`)

{
  "defaultProvider": "openai",
  "defaultModel": "gpt-4.1"
}

The default provider/model is OpenAI so that bare pi uses OpenAI rather than a Hyperstack VM. Use the fish abbreviations above to route to a specific VM.

Hot-switching models within Pi

After loading a different model on a VM with model switch (see Switching models), tell Pi to use it without restarting the session:

model switch hyperstack1/openai/gpt-oss-120b

Pi sends subsequent requests to the new model ID immediately; the provider base URL stays the same.

Extensions

Custom extensions live in pi/agent/extensions/ and are loaded automatically via the ~/.pi symlink.

Extension	Purpose
`web-search`	`web_search` and `web_fetch` tools — DuckDuckGo search + page fetching, no API key
`ask-mode`	`/ask` command — restricts the model to read-only exploration tools
`loop-scheduler`	`/loop` and `/watch` commands — recurring prompts plus condition-driven prompts
`inline-bash`	`!{cmd}` syntax — expands shell output inline before sending to the model
`session-name`	Auto-names sessions from the first message
`modal-editor`	Opens an external editor (`$VISUAL`) for composing long prompts
`handoff`	Compacts and hands off context to a fresh session
`fresh-subagent`	Spawns a sub-agent in a clean context for isolated tasks
`reload-runtime`	`/reload-runtime` command — hot-reloads extensions without restarting Pi
`nemotron-tool-repair`	Repairs malformed tool calls from Nemotron models
`agent-plan-mode`	Integrates task management into Pi sessions

Web search

The web-search extension registers two LLM-callable tools:

web_search — searches DuckDuckGo and returns up to 8 results (title, URL, snippet)
web_fetch — fetches a URL and returns up to 12,000 characters of readable text

Example prompts:

Search for the vLLM 0.9.0 changelog
Find the Qwen3-Coder model card and summarize the recommended vLLM flags

No API key or account required. Uses DuckDuckGo's free HTML endpoint.

Single-VM setup

A single VM can be deployed with the default config (GPT-OSS 120B):

ruby hyperstack.rb create                # uses hyperstack-vm.toml
ruby hyperstack.rb test
pi-hyperstack                            # fish abbreviation → hyperstack/openai/gpt-oss-120b
ruby hyperstack.rb delete

VM configuration

Config file	Default model	WireGuard IP	Hostname
`hyperstack-vm1.toml`	Nemotron-3-Super 120B (AWQ-4bit)	`192.168.3.1`	`hyperstack1.wg1`
`hyperstack-vm2.toml`	Qwen3-Coder-Next 80B (AWQ-4bit)	`192.168.3.3`	`hyperstack2.wg1`
`hyperstack-vm.toml`	GPT-OSS 120B (single-VM mode)	`192.168.3.1`	`hyperstack.wg1`

Each VM has independent state files so they can be managed separately:

ruby hyperstack.rb --config hyperstack-vm1.toml status
ruby hyperstack.rb --config hyperstack-vm2.toml status

Switching models

Each VM has named model presets in its TOML config. Hot-switch without reprovisioning:

ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super

Available presets (both VMs share the same set):

Preset	Model	VRAM	Context
`nemotron-super`	Nemotron-3-Super 120B (Mamba+MoE, 12B active)	~60 GB	131K
`qwen3-coder-next`	Qwen3-Coder-Next 80B (MoE, AWQ-4bit)	~45 GB	262K
`gpt-oss-120b`	GPT-OSS 120B (MoE, MXFP4)	~65 GB	131K
`gpt-oss-20b`	GPT-OSS 20B (MoE, MXFP4)	~14 GB	65K
`qwen25-coder-32b`	Qwen2.5-Coder-32B-Instruct (AWQ)	~18 GB	32K
`qwen3-coder-30b`	Qwen3-Coder-30B-A3B (MoE, AWQ)	~18 GB	65K
`deepseek-r1-32b`	DeepSeek-R1-Distill-Qwen-32B (AWQ)	~18 GB	32K
`qwen3-32b`	Qwen3-32B (AWQ)	~18 GB	32K
`devstral`	Devstral-Small-2507 (AWQ-4bit)	~15 GB	32K

CLI reference

ruby hyperstack.rb [--config path] <command> [options]

Commands:
  create       Deploy a new VM and run full provisioning
  create-both  Deploy VM1 + VM2 in parallel (uses hyperstack-vm1/vm2.toml)
  delete       Destroy the tracked VM
  delete-both  Destroy both VM1 and VM2
  status       Show VM and WireGuard status
  watch        Live dashboard: vLLM + GPU stats for all active VMs (refreshes every 5 s)
  test         Run end-to-end inference tests (vLLM)
  model switch <preset>  Hot-switch the running vLLM model

create / create-both options:
  --replace          Delete existing tracked VM before creating
  --dry-run          Print the plan without making changes
  --vllm / --no-vllm    Override config: enable/disable vLLM setup
  --ollama / --no-ollama Override config: enable/disable Ollama setup

Configuration

Edit hyperstack-vm1.toml / hyperstack-vm2.toml (or hyperstack-vm.toml for single-VM). Key sections:

Section	Purpose
`[vm]`	Flavor, image, environment name
`[vllm]`	Model, container settings, and vLLM runtime options
`[vllm.presets.*]`	Named model presets for hot-switching
`[ollama]`	Ollama settings (disabled by default; set `install = true` to use instead)
`[network]`	Ports, WireGuard subnet, allowed CIDRs
`[wireguard]`	Auto-setup script path

allowed_ssh_cidrs and allowed_wireguard_cidrs accept either explicit CIDRs such as ["203.0.113.4/32"] or ["auto"]. auto resolves the current public operator IP at runtime; set HYPERSTACK_OPERATOR_CIDR to override that detection when needed.

SSH host keys are pinned per state file in <state>.known_hosts. delete and --replace clear that trust file for intentional reprovisioning; unexpected host key changes now fail closed.

Automated setup reference

hyperstack.rb handles the full VM lifecycle automatically. All steps below (VM creation, WireGuard tunnel, vLLM Docker container) run in a single command.

Single-VM setup

# Deploy VM, configure WireGuard tunnel, pull and start vLLM (~10 min)
ruby hyperstack.rb create

# Run end-to-end inference test over the tunnel
ruby hyperstack.rb test

# Launch Pi coding agent connected to GPT-OSS 120B on the VM
pi-hyperstack   # fish abbreviation from hyperstack.fish

# Tear down the VM and remove WireGuard peer
ruby hyperstack.rb delete

Two-VM setup

# Deploy both VMs in parallel, set up tunnel and vLLM on each (~10 min)
ruby hyperstack.rb create-both

# Test each VM individually
ruby hyperstack.rb --config hyperstack-vm1.toml test
ruby hyperstack.rb --config hyperstack-vm2.toml test

# Launch Pi coding agents — one per terminal
pi-hyperstack-nemotron   # fish abbreviation → Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # fish abbreviation → Qwen3-Coder-Next 80B on VM2

# Tear down both VMs
ruby hyperstack.rb delete-both

Hot-switching models without reprovisioning

# Switch the running vLLM container to a different model preset
ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super

See the VM configuration and Switching models sections for available presets and config options.

Manual vLLM Docker setup

This section covers manual vLLM deployment for debugging or running outside the automation. The hyperstack.rb provisioner handles all of this automatically.

Prerequisites

VM with NVIDIA GPU, CUDA ≥ 12.x, driver ≥ 535, and Docker with nvidia-container-toolkit
WireGuard wg1 tunnel configured (see wg1-setup.sh)
If Ollama was previously running: sudo systemctl stop ollama && sudo systemctl disable ollama

Storage setup

Model cache on ephemeral NVMe (fast; re-downloads if lost on VM restart):

sudo mkdir -p /ephemeral/hug
sudo chmod -R 0777 /ephemeral/hug

Run the vLLM container

The model downloads on first start (~45 GB, ~2.5 min). Cold start after download: ~4–5 min.

docker pull vllm/vllm-openai:latest

docker run -d \
  --gpus all \
  --ipc=host \
  --network host \
  --name vllm_qwen3 \
  --restart always \
  -v /ephemeral/hug:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --max-model-len 262144 \
  --host 0.0.0.0 \
  --port 11434

Key flags:

Flag	Purpose
`--gpus all`	Expose all GPUs to the container
`--ipc=host`	Shared memory required by CUDA (avoids `/dev/shm` limits)
`--network host`	Host networking so WireGuard port 11434 is directly reachable
`--restart always`	Auto-restart the container on VM reboot
`-v /ephemeral/hug:...`	Model cache on fast ephemeral NVMe
`--tensor-parallel-size 1`	Single GPU (use 2/4 for multi-GPU)
`--enable-auto-tool-choice`	Enable function/tool calling
`--tool-call-parser qwen3_coder`	Parser for Qwen3-Coder tool format
`--enable-prefix-caching`	Block-level KV cache reuse across requests
`--gpu-memory-utilization 0.92`	Use 92% of VRAM; rest for OS/overhead
`--max-model-len 262144`	Full 256k context window
`--host 0.0.0.0`	Bind to all interfaces (WireGuard access requires this)
`--port 11434`	Reuse Ollama port for firewall compatibility

Verify startup

# Wait for "Application startup complete"
docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"

# Confirm model is loaded
curl -s http://localhost:11434/v1/models | python3 -m json.tool

# Quick inference test
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
       "messages":[{"role":"user","content":"Hello"}],
       "max_tokens":50}'

Firewall

sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp comment 'vLLM via wg1'

Client configuration

Use the VM's WireGuard IP (.1 for VM1, .3 for VM2):

# VM1 (hyperstack1.wg1 = 192.168.3.1)
OPENAI_BASE_URL=http://192.168.3.1:11434/v1 OPENAI_API_KEY=EMPTY pi

# VM2 (hyperstack2.wg1 = 192.168.3.3)
OPENAI_BASE_URL=http://192.168.3.3:11434/v1 OPENAI_API_KEY=EMPTY pi

Replacing the running container

To serve a different model, stop the current container and start a new one:

docker stop vllm_qwen3 && docker rm vllm_qwen3

# Example: smaller 30B model (fits easily, faster)
docker run -d \
  --gpus all --ipc=host --network host \
  --name vllm_qwen3_30b --restart always \
  -v /ephemeral/hug:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-Coder-30B-AWQ \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 --max-model-len 131072 \
  --host 0.0.0.0 --port 11434

Why vLLM instead of Ollama

FlashAttention v2: ~1.5–2× faster prefill for long prompts
Block-level prefix caching: partial KV cache reuse even when the prompt changes mid-sequence (Ollama requires an exact prefix match from token 0)
Chunked prefill: can interleave prefill and decode
Marlin kernels for AWQ MoE quantization

Monitoring vLLM

The watch command provides a built-in terminal dashboard that polls all active VMs every 5 seconds:

ruby hyperstack.rb watch

When two VMs are active the panels are shown side-by-side; a single VM uses a vertical layout. Press Ctrl-C to exit.

Each VM panel shows:

Row	Source	What it means
GPU header	`nvidia-smi`	Device index, name, temperature, power draw
util bar	`nvidia-smi`	GPU compute utilisation %
VRAM bar	`nvidia-smi`	GPU memory used / total
throughput	vLLM engine log	Rolling-average prefill tok/s and decode tok/s
requests	vLLM engine log	Running / waiting / swapped request counts
KV cache bar	vLLM engine log	GPU KV-cache fill %
cache hits bar	vLLM engine log	Prefix-cache hit rate %

Stats are collected via a single SSH call per VM over the WireGuard tunnel (hyperstack1.wg1 etc.). nvidia-smi provides hardware metrics; vLLM engine stats are read from docker logs --tail 200 filtered to the "Engine 0" line that vLLM emits every few seconds.

For lower-level ad-hoc inspection:

# Live engine stats (throughput, KV cache, prefix cache hit rate)
ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 0"'

# GPU stats (every 5 s)
ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'

# Last-minute stats (one-shot, no follow)
ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_nemotron_super 2>&1 | grep "Engine 0"'

# Request-level monitoring
ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "POST"'

Engine metrics key fields:

Field	Meaning
Avg prompt throughput	Prefill speed (tokens/s) — higher is faster
Avg generation throughput	Decode speed (tokens/s)
GPU KV cache usage	% of KV cache memory in use (proportional to active context vs max capacity)
Prefix cache hit rate	% of prompt tokens served from cache
Running / Waiting	Active and queued request counts

Healthy baseline (H100 SXM 80GB, Nemotron-3-Super-120B AWQ):

Metric	Expected
Prefill throughput	5,000–11,000 tok/s
Decode throughput	20–100 tok/s (varies with batch size)
KV cache usage	2–5% for typical sessions
Temperature	50–70°C under load, <50°C idle
Power	~100 W idle, 300–350 W under load per GPU

Warning signs:

Waiting > 0 for extended periods — requests queuing, model overloaded
KV cache usage near 100% — context too long, reduce --max-model-len
Decode throughput < 20 tok/s sustained — possible thermal throttling
Prefill throughput < 2,000 tok/s — check for CPU offload or driver issues

Troubleshooting

Problem	Fix
OOM on startup with `--max-model-len 262144`	Reduce to `131072` or `65536`
Prefix cache hit rate stays at 0%	Normal when prompts vary heavily turn-to-turn
vLLM container won't start (CUDA mismatch)	Check `nvidia-smi`; vLLM requires CUDA ≥ 12.x and driver ≥ 535
Still OOM after reducing context	Lower `gpu_memory_utilization` to `0.85` or use a smaller model

VRAM sizing guide

Rule of thumb for a single A100 80 GB at 92% utilization (~75 GiB usable):

Model size (params)	AWQ 4-bit VRAM	Max context (remaining for KV)
7–8B	~5 GiB	262k+ (plenty of KV headroom)
14B	~9 GiB	262k+ (plenty of KV headroom)
30–32B	~18 GiB	262k (~57 GiB for KV cache)
70–80B (MoE, 3B active)	~45 GiB	262k (~27 GiB for KV cache)
70B (dense)	~38 GiB	131k (~37 GiB for KV cache)
120B+	won't fit	use multi-GPU or smaller quant

Supported quantization formats:

AWQ (recommended): fast Marlin kernels, good quality
GPTQ: similar to AWQ, widely available
FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
BF16/FP16: full precision, needs more VRAM

Search HuggingFace for vLLM-compatible quantized models: https://huggingface.co/models?search=<model-name>+awq

Performance characteristics

Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit:

Metric	vLLM (AWQ 4-bit)	Ollama (Q4_K_M)
Prefill throughput	5,000–11,000 tok/s	~1,000 tok/s (est.)
Decode throughput	40–99 tok/s	~40 tok/s
Per-turn latency	~10–15 s	~28 s (32k ctx)
Context window	262k (full, no truncation)	32k (was truncating)
VRAM usage	75 GiB (more KV cache)	52–61 GiB

Photo enhancement (ComfyUI)

A separate VM setup (hyperstack-vm-photo.toml) runs ComfyUI on an L40 GPU for Photolemur-style automatic photo enhancement. No prompts needed — drop photos in, get enhanced photos out.

How it works

The pipeline runs Real-ESRGAN x4plus in "enhance in place" mode: upscale 4× (noise reduction, sharpening, colour correction) → scale back to the original resolution. Output is saved as JPEG at quality 92, so file sizes stay close to the originals.

Quickstart

# Provision the L40 VM (~$1/hr, ~8 min first-time setup including model download)
ruby hyperstack.rb --config hyperstack-vm-photo.toml create

# Check connectivity
ruby photo-enhance.rb --test

# Enhance all photos in a directory (outputs <name>_enhanced.jpg alongside originals)
ruby photo-enhance.rb --indir ~/Pictures/my-album

# Watch mode: process new arrivals automatically
ruby photo-enhance.rb --indir ~/Pictures/my-album --watch

# Destroy VM when done
ruby hyperstack.rb --config hyperstack-vm-photo.toml delete

Configuration (`hyperstack-vm-photo.toml`)

Key	Default	Description
`[vm].flavor_name`	`n3-L40x1`	Hyperstack GPU flavor (L40 48 GB, ~$1/hr)
`[network].wireguard_server_ip`	`192.168.3.4`	WireGuard IP (after VM1=.1, VM2=.3)
`[comfyui].port`	`8188`	ComfyUI REST API port (WireGuard subnet only)
`[comfyui].models_dir`	`/ephemeral/comfyui/models`	Model weights (ephemeral NVMe)
`[comfyui].models`	`["RealESRGAN_x4plus"]`	Pre-downloaded models

Custom workflows

The workflow JSON lives at workflows/photo-enhance.json. The NODE_INPUT_IMAGE placeholder is substituted at runtime by photo-enhance.rb with the uploaded filename. Swap in any ComfyUI-compatible workflow (e.g. add SUPIR for deeper restoration) by editing the JSON or passing --workflow path/to/other.json.

Performance (L40 48 GB)

Operation	Time per photo
Real-ESRGAN enhance + scale back	~50–60 s
Upload + download overhead	~3 s

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
lib/hyperstack		lib/hyperstack
pi		pi
workflows		workflows
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
hyperstack-vm-photo.toml		hyperstack-vm-photo.toml
hyperstack-vm.toml		hyperstack-vm.toml
hyperstack-vm1-gptoss.toml		hyperstack-vm1-gptoss.toml
hyperstack-vm1.toml		hyperstack-vm1.toml
hyperstack-vm2.toml		hyperstack-vm2.toml
hyperstack.rb		hyperstack.rb
hypr.fish		hypr.fish
logo.svg		logo.svg
photo-enhance-review.md		photo-enhance-review.md
photo-enhance.rb		photo-enhance.rb
smart_photo_node.py		smart_photo_node.py
wg1-setup.sh		wg1-setup.sh

Folders and files

Latest commit

History

Repository files navigation

hypr

Architecture

Why Pi

Prerequisites

WireGuard setup

Tunnel design

Manual setup

Verify the tunnel

Restart / recover

Quickstart (two-VM setup)

Using Pi

Installation

Fish shell abbreviations

Model configuration (pi/agent/models.json)

Settings (pi/agent/settings.json)

Hot-switching models within Pi

Extensions

Web search

Single-VM setup

VM configuration

Switching models

CLI reference

Configuration

Automated setup reference

Single-VM setup

Two-VM setup

Hot-switching models without reprovisioning

Manual vLLM Docker setup

Prerequisites

Storage setup

Run the vLLM container

Verify startup

Firewall

Client configuration

Replacing the running container

Why vLLM instead of Ollama

Monitoring vLLM

Troubleshooting

VRAM sizing guide

Performance characteristics

Photo enhancement (ComfyUI)

How it works

Quickstart

Configuration (hyperstack-vm-photo.toml)

Custom workflows

Performance (L40 48 GB)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Model configuration (`pi/agent/models.json`)

Settings (`pi/agent/settings.json`)

Configuration (`hyperstack-vm-photo.toml`)

Packages