Skip to content

snonux/hypr

Repository files navigation

hypr

Hyperstack · Pi · FreeBSD · AI · tmux logo

Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, and vLLM inference. Runs two A100 VMs concurrently — each serving a different model — with Pi coding agents connected to each.

Architecture

  ┌─────────────┐
  │   Laptop    │
  └──────┬──────┘
         │ SSH (into bhyve VM, not the host)
         │
  FreeBSD physical host (earth)
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                                                                         │
  │  FreeBSD bhyve VM  (isolation layer)          192.168.3.2 / wg1        │
  │  ┌───────────────────────────────────────────────────────────────────┐  │
  │  │                          ▲                                        │  │
  │  │                          │ SSH                                    │  │
  │  │  tmux session  (tmux attach)                                      │  │
  │  │  ┌─────────────────────────────────────────────────────────────┐  │  │
  │  │  │  window 0                                                   │  │  │
  │  │  │  ┌───────────────────────┬─────────────────────────┐        │  │  │
  │  │  │  │ pane 0: pi-nemotron   │ pane 1: pi-coder        │        │  │  │
  │  │  │  │                       │                         │        │  │  │
  │  │  │  │ Pi                    │ Pi                      │        │  │  │
  │  │  │  │ Nemotron-3-Super      │ Qwen3-Coder-Next        │        │  │  │
  │  │  │  └──────────┬────────────┘└────────────┬───────────┘        │  │  │
  │  │  │             │ OpenAI API               │ OpenAI API         │  │  │
  │  │  │             │ /v1/chat/completions      │ /v1/chat/completions│ │  │
  │  │  └─────────────┼──────────────────────────┼────────────────────┘  │  │
  │  │              │                           │                        │  │
  │  └──────────────┼───────────────────────────┼────────────────────────┘  │
  │                 │ WireGuard wg1             │ WireGuard wg1             │
  └─────────────────┼───────────────────────────┼───────────────────────────┘
                    │ 192.168.3.0/24            │ 192.168.3.0/24
                    │ UDP :56710                │ UDP :56710
                    ▼                           ▼
  ┌──────────────────────────┐  ┌──────────────────────────┐
  │ VM1 (A100 80GB)          │  │ VM2 (A100 80GB)          │
  │ 192.168.3.1              │  │ 192.168.3.3              │
  │ hyperstack1.wg1          │  │ hyperstack2.wg1          │
  │                          │  │                          │
  │ vLLM :11434              │  │ vLLM :11434              │
  │ Nemotron-3-Super 120B    │  │ Qwen3-Coder-Next 80B     │
  │ (Mamba+MoE, AWQ-4bit)    │  │ (MoE, AWQ-4bit)          │
  └──────────────────────────┘  └──────────────────────────┘

WireGuard topology:

  • Interface wg1 on earth carries traffic to both VMs simultaneously
  • earth is 192.168.3.2; VM1 is .1; VM2 is .3; tunnel port is 56710/udp
  • Adding VM2 to an existing wg1 tunnel: wg1-setup.sh adds a second [Peer] block without disturbing VM1
  • vLLM on each VM listens on 0.0.0.0:11434, firewalled to 192.168.3.0/24 (WireGuard subnet only)
  • Pi connects directly to each VM's vLLM over the tunnel — no proxy or load balancer

Why Pi

  • Bring-your-own model — connects to any OpenAI-compatible endpoint; no translation proxy needed between Pi and vLLM
  • Custom providers via models.json — define hyperstack, hyperstack1, and hyperstack2 providers once; fish abbreviations route to the right VM
  • Project-local config — symlink this repo's pi/ directory to ~/.pi; Pi picks up models.json, settings.json, extensions, and skills automatically
  • TypeScript extensions — custom behaviour (web search, loop scheduler, ask-mode) lives in pi/agent/extensions/ and loads from the symlink
  • Minimal core — no built-in sub-agents, plan mode, or permission popups; fast TUI with mid-session model switching via Ctrl+L

Prerequisites

  • Hyperstack account with API key in ~/.hyperstack
  • SSH key registered in Hyperstack as earth (or change ssh.hyperstack_key_name in the TOML)
  • Review [network].allowed_ssh_cidrs and [network].allowed_wireguard_cidrs in your TOML. The secure default is ["auto"], which resolves your current public egress IP to /32. Set explicit CIDRs or HYPERSTACK_OPERATOR_CIDR if you deploy from a different network.
  • WireGuard setup script: wg1-setup.sh (present in this directory)
  • Ruby with toml-rb gem: bundle install
  • Pi coding agent installed

WireGuard setup

hyperstack.rb runs wg1-setup.sh automatically during create / create-both. This section explains the tunnel design for reference and manual troubleshooting.

Tunnel design

earth (192.168.3.2)
  /etc/wireguard/wg1.conf
  [Interface]  Address = 192.168.3.2/24
  [Peer]  # VM1 — AllowedIPs = 192.168.3.1/32, Endpoint = <vm1-public-ip>:56710
  [Peer]  # VM2 — AllowedIPs = 192.168.3.3/32, Endpoint = <vm2-public-ip>:56710

A single wg1 interface on earth carries traffic to both VMs. Each VM is a separate [Peer] block. Adding VM2 to an existing tunnel with VM1 already running leaves VM1's peer untouched.

Manual setup

# VM1 (first VM — generates fresh keys, writes /etc/wireguard/wg1.conf from scratch)
./wg1-setup.sh <vm1-public-ip>

# VM2 (additional VM — adds a [Peer] block to the existing wg1.conf)
./wg1-setup.sh <vm2-public-ip> 192.168.3.3 hyperstack2.wg1

Verify the tunnel

# Show active peers and handshake times (both VMs should appear)
sudo wg show wg1

# Ping each VM through the tunnel
ping -c 3 192.168.3.1   # VM1
ping -c 3 192.168.3.3   # VM2

# Check vLLM is reachable over the tunnel
curl http://hyperstack1.wg1:11434/v1/models
curl http://hyperstack2.wg1:11434/v1/models

Restart / recover

# Restart tunnel locally (e.g. after network change)
sudo systemctl restart wg-quick@wg1

# Restart tunnel on VM after a reboot (ssh via public IP since WireGuard is down)
ssh ubuntu@<vm-public-ip> 'sudo systemctl start wg-quick@wg1'

# Re-run setup when VM IP changes (e.g. after delete + recreate)
./wg1-setup.sh <new-vm1-public-ip>
./wg1-setup.sh <new-vm2-public-ip> 192.168.3.3 hyperstack2.wg1

Quickstart (two-VM setup)

# Deploy both VMs in parallel, set up WireGuard + vLLM (~10 min)
ruby hyperstack.rb create-both

# Verify both VMs are working
ruby hyperstack.rb --config hyperstack-vm1.toml test
ruby hyperstack.rb --config hyperstack-vm2.toml test

# Launch Pi coding agents — one per terminal (fish abbreviations from hyperstack.fish)
pi-hyperstack-nemotron   # Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # Qwen3-Coder-Next on VM2

# Tear down both VMs
ruby hyperstack.rb delete-both

Using Pi

Pi is the coding agent frontend used with this setup. Each Hyperstack VM runs a vLLM instance; Pi connects to it directly over the WireGuard tunnel.

Installation

Install Pi from pi.dev, then link the project-local config into place:

ln -s /path/to/hyperstack/pi ~/.pi

This symlink makes Pi pick up pi/agent/models.json and pi/agent/settings.json from this repo as its agent configuration, so the Hyperstack providers and model definitions are available without any manual config editing.

Fish shell abbreviations

Source hyperstack.fish or copy the abbreviations into your Fish config:

abbr pi-hyperstack         pi --model hyperstack/openai/gpt-oss-120b
abbr pi-hyperstack-nemotron pi --model hyperstack1/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
abbr pi-hyperstack-coder    pi --model hyperstack2/bullpoint/Qwen3-Coder-Next-AWQ-4bit

Then launch a session after the VM(s) are up:

pi-hyperstack            # single-VM → GPT-OSS 120B on hyperstack.wg1
pi-hyperstack-nemotron   # two-VM → Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # two-VM → Qwen3-Coder-Next 80B on VM2

Model configuration (pi/agent/models.json)

Three providers are defined, one per setup, each pointing at its vLLM endpoint over WireGuard:

Provider Base URL Primary model
hyperstack http://hyperstack.wg1:11434/v1 GPT-OSS 120B (single-VM)
hyperstack1 http://hyperstack1.wg1:11434/v1 Nemotron-3-Super 120B
hyperstack2 http://hyperstack2.wg1:11434/v1 Qwen3-Coder-Next 80B

All model presets from the TOML configs are registered under each provider, so any model can be run on any VM after a model switch (see Switching models).

Settings (pi/agent/settings.json)

{
  "defaultProvider": "openai",
  "defaultModel": "gpt-4.1"
}

The default provider/model is OpenAI so that bare pi uses OpenAI rather than a Hyperstack VM. Use the fish abbreviations above to route to a specific VM.

Hot-switching models within Pi

After loading a different model on a VM with model switch (see Switching models), tell Pi to use it without restarting the session:

model switch hyperstack1/openai/gpt-oss-120b

Pi sends subsequent requests to the new model ID immediately; the provider base URL stays the same.

Extensions

Custom extensions live in pi/agent/extensions/ and are loaded automatically via the ~/.pi symlink.

Extension Purpose
web-search web_search and web_fetch tools — DuckDuckGo search + page fetching, no API key
ask-mode /ask command — restricts the model to read-only exploration tools
loop-scheduler /loop and /watch commands — recurring prompts plus condition-driven prompts
inline-bash !{cmd} syntax — expands shell output inline before sending to the model
session-name Auto-names sessions from the first message
modal-editor Opens an external editor ($VISUAL) for composing long prompts
handoff Compacts and hands off context to a fresh session
fresh-subagent Spawns a sub-agent in a clean context for isolated tasks
reload-runtime /reload-runtime command — hot-reloads extensions without restarting Pi
nemotron-tool-repair Repairs malformed tool calls from Nemotron models
agent-plan-mode Integrates task management into Pi sessions

Web search

The web-search extension registers two LLM-callable tools:

  • web_search — searches DuckDuckGo and returns up to 8 results (title, URL, snippet)
  • web_fetch — fetches a URL and returns up to 12,000 characters of readable text

Example prompts:

Search for the vLLM 0.9.0 changelog
Find the Qwen3-Coder model card and summarize the recommended vLLM flags

No API key or account required. Uses DuckDuckGo's free HTML endpoint.

Single-VM setup

A single VM can be deployed with the default config (GPT-OSS 120B):

ruby hyperstack.rb create                # uses hyperstack-vm.toml
ruby hyperstack.rb test
pi-hyperstack                            # fish abbreviation → hyperstack/openai/gpt-oss-120b
ruby hyperstack.rb delete

VM configuration

Config file Default model WireGuard IP Hostname
hyperstack-vm1.toml Nemotron-3-Super 120B (AWQ-4bit) 192.168.3.1 hyperstack1.wg1
hyperstack-vm2.toml Qwen3-Coder-Next 80B (AWQ-4bit) 192.168.3.3 hyperstack2.wg1
hyperstack-vm.toml GPT-OSS 120B (single-VM mode) 192.168.3.1 hyperstack.wg1

Each VM has independent state files so they can be managed separately:

ruby hyperstack.rb --config hyperstack-vm1.toml status
ruby hyperstack.rb --config hyperstack-vm2.toml status

Switching models

Each VM has named model presets in its TOML config. Hot-switch without reprovisioning:

ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super

Available presets (both VMs share the same set):

Preset Model VRAM Context
nemotron-super Nemotron-3-Super 120B (Mamba+MoE, 12B active) ~60 GB 131K
qwen3-coder-next Qwen3-Coder-Next 80B (MoE, AWQ-4bit) ~45 GB 262K
gpt-oss-120b GPT-OSS 120B (MoE, MXFP4) ~65 GB 131K
gpt-oss-20b GPT-OSS 20B (MoE, MXFP4) ~14 GB 65K
qwen25-coder-32b Qwen2.5-Coder-32B-Instruct (AWQ) ~18 GB 32K
qwen3-coder-30b Qwen3-Coder-30B-A3B (MoE, AWQ) ~18 GB 65K
deepseek-r1-32b DeepSeek-R1-Distill-Qwen-32B (AWQ) ~18 GB 32K
qwen3-32b Qwen3-32B (AWQ) ~18 GB 32K
devstral Devstral-Small-2507 (AWQ-4bit) ~15 GB 32K

CLI reference

ruby hyperstack.rb [--config path] <command> [options]

Commands:
  create       Deploy a new VM and run full provisioning
  create-both  Deploy VM1 + VM2 in parallel (uses hyperstack-vm1/vm2.toml)
  delete       Destroy the tracked VM
  delete-both  Destroy both VM1 and VM2
  status       Show VM and WireGuard status
  watch        Live dashboard: vLLM + GPU stats for all active VMs (refreshes every 5 s)
  test         Run end-to-end inference tests (vLLM)
  model switch <preset>  Hot-switch the running vLLM model

create / create-both options:
  --replace          Delete existing tracked VM before creating
  --dry-run          Print the plan without making changes
  --vllm / --no-vllm    Override config: enable/disable vLLM setup
  --ollama / --no-ollama Override config: enable/disable Ollama setup

Configuration

Edit hyperstack-vm1.toml / hyperstack-vm2.toml (or hyperstack-vm.toml for single-VM). Key sections:

Section Purpose
[vm] Flavor, image, environment name
[vllm] Model, container settings, and vLLM runtime options
[vllm.presets.*] Named model presets for hot-switching
[ollama] Ollama settings (disabled by default; set install = true to use instead)
[network] Ports, WireGuard subnet, allowed CIDRs
[wireguard] Auto-setup script path

allowed_ssh_cidrs and allowed_wireguard_cidrs accept either explicit CIDRs such as ["203.0.113.4/32"] or ["auto"]. auto resolves the current public operator IP at runtime; set HYPERSTACK_OPERATOR_CIDR to override that detection when needed.

SSH host keys are pinned per state file in <state>.known_hosts. delete and --replace clear that trust file for intentional reprovisioning; unexpected host key changes now fail closed.

Automated setup reference

hyperstack.rb handles the full VM lifecycle automatically. All steps below (VM creation, WireGuard tunnel, vLLM Docker container) run in a single command.

Single-VM setup

# Deploy VM, configure WireGuard tunnel, pull and start vLLM (~10 min)
ruby hyperstack.rb create

# Run end-to-end inference test over the tunnel
ruby hyperstack.rb test

# Launch Pi coding agent connected to GPT-OSS 120B on the VM
pi-hyperstack   # fish abbreviation from hyperstack.fish

# Tear down the VM and remove WireGuard peer
ruby hyperstack.rb delete

Two-VM setup

# Deploy both VMs in parallel, set up tunnel and vLLM on each (~10 min)
ruby hyperstack.rb create-both

# Test each VM individually
ruby hyperstack.rb --config hyperstack-vm1.toml test
ruby hyperstack.rb --config hyperstack-vm2.toml test

# Launch Pi coding agents — one per terminal
pi-hyperstack-nemotron   # fish abbreviation → Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # fish abbreviation → Qwen3-Coder-Next 80B on VM2

# Tear down both VMs
ruby hyperstack.rb delete-both

Hot-switching models without reprovisioning

# Switch the running vLLM container to a different model preset
ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super

See the VM configuration and Switching models sections for available presets and config options.

Manual vLLM Docker setup

This section covers manual vLLM deployment for debugging or running outside the automation. The hyperstack.rb provisioner handles all of this automatically.

Prerequisites

  • VM with NVIDIA GPU, CUDA ≥ 12.x, driver ≥ 535, and Docker with nvidia-container-toolkit
  • WireGuard wg1 tunnel configured (see wg1-setup.sh)
  • If Ollama was previously running: sudo systemctl stop ollama && sudo systemctl disable ollama

Storage setup

Model cache on ephemeral NVMe (fast; re-downloads if lost on VM restart):

sudo mkdir -p /ephemeral/hug
sudo chmod -R 0777 /ephemeral/hug

Run the vLLM container

The model downloads on first start (~45 GB, ~2.5 min). Cold start after download: ~4–5 min.

docker pull vllm/vllm-openai:latest

docker run -d \
  --gpus all \
  --ipc=host \
  --network host \
  --name vllm_qwen3 \
  --restart always \
  -v /ephemeral/hug:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --max-model-len 262144 \
  --host 0.0.0.0 \
  --port 11434

Key flags:

Flag Purpose
--gpus all Expose all GPUs to the container
--ipc=host Shared memory required by CUDA (avoids /dev/shm limits)
--network host Host networking so WireGuard port 11434 is directly reachable
--restart always Auto-restart the container on VM reboot
-v /ephemeral/hug:... Model cache on fast ephemeral NVMe
--tensor-parallel-size 1 Single GPU (use 2/4 for multi-GPU)
--enable-auto-tool-choice Enable function/tool calling
--tool-call-parser qwen3_coder Parser for Qwen3-Coder tool format
--enable-prefix-caching Block-level KV cache reuse across requests
--gpu-memory-utilization 0.92 Use 92% of VRAM; rest for OS/overhead
--max-model-len 262144 Full 256k context window
--host 0.0.0.0 Bind to all interfaces (WireGuard access requires this)
--port 11434 Reuse Ollama port for firewall compatibility

Verify startup

# Wait for "Application startup complete"
docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"

# Confirm model is loaded
curl -s http://localhost:11434/v1/models | python3 -m json.tool

# Quick inference test
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
       "messages":[{"role":"user","content":"Hello"}],
       "max_tokens":50}'

Firewall

sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp comment 'vLLM via wg1'

Client configuration

Use the VM's WireGuard IP (.1 for VM1, .3 for VM2):

# VM1 (hyperstack1.wg1 = 192.168.3.1)
OPENAI_BASE_URL=http://192.168.3.1:11434/v1 OPENAI_API_KEY=EMPTY pi

# VM2 (hyperstack2.wg1 = 192.168.3.3)
OPENAI_BASE_URL=http://192.168.3.3:11434/v1 OPENAI_API_KEY=EMPTY pi

Replacing the running container

To serve a different model, stop the current container and start a new one:

docker stop vllm_qwen3 && docker rm vllm_qwen3

# Example: smaller 30B model (fits easily, faster)
docker run -d \
  --gpus all --ipc=host --network host \
  --name vllm_qwen3_30b --restart always \
  -v /ephemeral/hug:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-Coder-30B-AWQ \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 --max-model-len 131072 \
  --host 0.0.0.0 --port 11434

Why vLLM instead of Ollama

  • FlashAttention v2: ~1.5–2× faster prefill for long prompts
  • Block-level prefix caching: partial KV cache reuse even when the prompt changes mid-sequence (Ollama requires an exact prefix match from token 0)
  • Chunked prefill: can interleave prefill and decode
  • Marlin kernels for AWQ MoE quantization

Monitoring vLLM

The watch command provides a built-in terminal dashboard that polls all active VMs every 5 seconds:

ruby hyperstack.rb watch

When two VMs are active the panels are shown side-by-side; a single VM uses a vertical layout. Press Ctrl-C to exit.

Each VM panel shows:

Row Source What it means
GPU header nvidia-smi Device index, name, temperature, power draw
util bar nvidia-smi GPU compute utilisation %
VRAM bar nvidia-smi GPU memory used / total
throughput vLLM engine log Rolling-average prefill tok/s and decode tok/s
requests vLLM engine log Running / waiting / swapped request counts
KV cache bar vLLM engine log GPU KV-cache fill %
cache hits bar vLLM engine log Prefix-cache hit rate %

Stats are collected via a single SSH call per VM over the WireGuard tunnel (hyperstack1.wg1 etc.). nvidia-smi provides hardware metrics; vLLM engine stats are read from docker logs --tail 200 filtered to the "Engine 0" line that vLLM emits every few seconds.

For lower-level ad-hoc inspection:

# Live engine stats (throughput, KV cache, prefix cache hit rate)
ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 0"'

# GPU stats (every 5 s)
ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'

# Last-minute stats (one-shot, no follow)
ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_nemotron_super 2>&1 | grep "Engine 0"'

# Request-level monitoring
ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "POST"'

Engine metrics key fields:

Field Meaning
Avg prompt throughput Prefill speed (tokens/s) — higher is faster
Avg generation throughput Decode speed (tokens/s)
GPU KV cache usage % of KV cache memory in use (proportional to active context vs max capacity)
Prefix cache hit rate % of prompt tokens served from cache
Running / Waiting Active and queued request counts

Healthy baseline (H100 SXM 80GB, Nemotron-3-Super-120B AWQ):

Metric Expected
Prefill throughput 5,000–11,000 tok/s
Decode throughput 20–100 tok/s (varies with batch size)
KV cache usage 2–5% for typical sessions
Temperature 50–70°C under load, <50°C idle
Power ~100 W idle, 300–350 W under load per GPU

Warning signs:

  • Waiting > 0 for extended periods — requests queuing, model overloaded
  • KV cache usage near 100% — context too long, reduce --max-model-len
  • Decode throughput < 20 tok/s sustained — possible thermal throttling
  • Prefill throughput < 2,000 tok/s — check for CPU offload or driver issues

Troubleshooting

Problem Fix
OOM on startup with --max-model-len 262144 Reduce to 131072 or 65536
Prefix cache hit rate stays at 0% Normal when prompts vary heavily turn-to-turn
vLLM container won't start (CUDA mismatch) Check nvidia-smi; vLLM requires CUDA ≥ 12.x and driver ≥ 535
Still OOM after reducing context Lower gpu_memory_utilization to 0.85 or use a smaller model

VRAM sizing guide

Rule of thumb for a single A100 80 GB at 92% utilization (~75 GiB usable):

Model size (params) AWQ 4-bit VRAM Max context (remaining for KV)
7–8B ~5 GiB 262k+ (plenty of KV headroom)
14B ~9 GiB 262k+ (plenty of KV headroom)
30–32B ~18 GiB 262k (~57 GiB for KV cache)
70–80B (MoE, 3B active) ~45 GiB 262k (~27 GiB for KV cache)
70B (dense) ~38 GiB 131k (~37 GiB for KV cache)
120B+ won't fit use multi-GPU or smaller quant

Supported quantization formats:

  • AWQ (recommended): fast Marlin kernels, good quality
  • GPTQ: similar to AWQ, widely available
  • FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
  • BF16/FP16: full precision, needs more VRAM

Search HuggingFace for vLLM-compatible quantized models: https://huggingface.co/models?search=<model-name>+awq

Performance characteristics

Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit:

Metric vLLM (AWQ 4-bit) Ollama (Q4_K_M)
Prefill throughput 5,000–11,000 tok/s ~1,000 tok/s (est.)
Decode throughput 40–99 tok/s ~40 tok/s
Per-turn latency ~10–15 s ~28 s (32k ctx)
Context window 262k (full, no truncation) 32k (was truncating)
VRAM usage 75 GiB (more KV cache) 52–61 GiB

Photo enhancement (ComfyUI)

A separate VM setup (hyperstack-vm-photo.toml) runs ComfyUI on an L40 GPU for Photolemur-style automatic photo enhancement. No prompts needed — drop photos in, get enhanced photos out.

How it works

The pipeline runs Real-ESRGAN x4plus in "enhance in place" mode: upscale 4× (noise reduction, sharpening, colour correction) → scale back to the original resolution. Output is saved as JPEG at quality 92, so file sizes stay close to the originals.

Quickstart

# Provision the L40 VM (~$1/hr, ~8 min first-time setup including model download)
ruby hyperstack.rb --config hyperstack-vm-photo.toml create

# Check connectivity
ruby photo-enhance.rb --test

# Enhance all photos in a directory (outputs <name>_enhanced.jpg alongside originals)
ruby photo-enhance.rb --indir ~/Pictures/my-album

# Watch mode: process new arrivals automatically
ruby photo-enhance.rb --indir ~/Pictures/my-album --watch

# Destroy VM when done
ruby hyperstack.rb --config hyperstack-vm-photo.toml delete

Configuration (hyperstack-vm-photo.toml)

Key Default Description
[vm].flavor_name n3-L40x1 Hyperstack GPU flavor (L40 48 GB, ~$1/hr)
[network].wireguard_server_ip 192.168.3.4 WireGuard IP (after VM1=.1, VM2=.3)
[comfyui].port 8188 ComfyUI REST API port (WireGuard subnet only)
[comfyui].models_dir /ephemeral/comfyui/models Model weights (ephemeral NVMe)
[comfyui].models ["RealESRGAN_x4plus"] Pre-downloaded models

Custom workflows

The workflow JSON lives at workflows/photo-enhance.json. The NODE_INPUT_IMAGE placeholder is substituted at runtime by photo-enhance.rb with the uploaded filename. Swap in any ComfyUI-compatible workflow (e.g. add SUPIR for deeper restoration) by editing the JSON or passing --workflow path/to/other.json.

Performance (L40 48 GB)

Operation Time per photo
Real-ESRGAN enhance + scale back ~50–60 s
Upload + download overhead ~3 s

About

My "local" LLM setup with Hyperstack.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors