RTX 6000 Pro Wiki — Running Large LLMs on PCIe GPUs

Community-sourced knowledge base for running large language models (Qwen3.5-397B, MiniMax M2.5, MiMo-V2.5-Pro, Kimi-K2.5, Kimi-K2.6, GLM-5) on NVIDIA RTX 6000 Pro (Blackwell, SM120) GPUs in 2×, 4×, and 8× PCIe configurations without NVLink.

Synthesized from ~5,000 Discord messages, 300+ screenshots, and months of community experimentation.

Quick Links

Models

Model	Params	Active	Min GPUs	Best Decode	Page
Qwen3.5-397B	397B MoE	17B	4×	350 tok/s (8×, SGLang)	→
Qwen3.5-27B/122B	27B–122B	—	1×	—	→
MiniMax M2.5	229B MoE	—	2×	85-89 tok/s (NVFP4)	→
MiMo-V2.5-Pro	MoE	—	8×	TP8 NVFP4/MXFP8 + MTP/EAGLE	→
Kimi-K2.5	530B MoE	—	8×	101 tok/s (PCIe switch)	→
Kimi-K2.6	MoE	—	8×	Community image + MLA Eagle	→
Kimi-K2.6 v6	MoE	—	8×	LightSeek Eagle3.1 MLA + vLLM V2	→
Kimi-K2.6 v5	MoE	—	8×	CUDA 13.2 vLLM V2 + p/q MTP	→
DeepSeek-V4-Pro TP16 Lucifer	MoE	—	16×	Lucifer FP8 KV + MTP TP16 overlay	→
GLM-5	744B MoE	40B	8×	105 tok/s (MTP)	→
GLM-5.1	MoE	—	8×	vLLM b12x NSA/MTP port	→

Hardware & Topology

PCIe Topology — Switches, Turin vs Genoa, NUMA
PCIe Bandwidth — P2P measurements, BAR1, latency
GPU Configurations — 4×/8× builds, VRAM, power, rigs
ASUS ESC8000A-E13P + Broadcom Switches — Topology, ACS disable, P2P proof, benchmarks
ASRock WRX90 + 3× c-payne Switches (hierarchy) — Root switch, uniform BW, no collapse bug
ASRock WRX90 + 2× c-payne Switches (flat) — Flat topology, CPU-routed cross-switch, comparison
ASRock WRX90 + 4× c-payne Switches (16 GPU) — 16 GPUs across 4 switches, three cabling variants (2/3/4-root) compared
ASRockRack + EPYC Turin 9575F + 4× c-payne (16 GPU) — Same 16-GPU layout on Turin EPYC, no collapse, 204 GB/s aggregate WRITE

Inference Engines

vLLM — Config, MTP, model-specific commands
SGLang — Config, DCP, MOE backends
FlashInfer — CUTLASS, SM120, bug fixes

Optimization

NCCL Tuning — Env vars, P2P levels, graph XML fix, tuner plugin
PCIe Oneshot AllReduce — 5–11% faster decode, setup guide, benchmarks
NVFP4 Quantization — Setup, calibration, models
Speculative Decoding — MTP configs, EAGLE
Docker Images — Images, compose, custom builds
I/O Tuning (md RAID5) — stripe_cache, group_idle, Docker overlay2
GLM-5.1 vLLM b12x NSA/MTP Port — fast prefill, PCIe barriers, KV cache limits, and upstream delta vs b12x/SGLang

Community

Daily Summaries — Auto-generated daily digests of Discord activity

Results & Troubleshooting

Benchmark Results — Consolidated tables across all models
KLD Evaluation — Quantization quality (KL divergence vs FP8 reference)
Common Issues — Errors + fixes

Key Findings

MTP=2 is the sweet spot — +51-72% throughput across all models, MTP>3 unstable
NCCL graph XML fix is still the public Turin recipe — current upstream NCCL draft fix is NVIDIA/nccl#2127, which aims to remove the no-XML pathological ring regression
PCIe switches dramatically help single-batch latency — 101 vs 60 tok/s for Kimi K2.5
BF16 KV cache mandatory on SM120 for GLM-5 — FP8 produces garbled output
SGLang is the only option for GLM-5 — vLLM lacks SM120-compatible MLA+sparse attention backend
NVFP4 is native to SM120 — 2× decode speedup over FP8 for supported models
DCP is essential for Kimi K2.5 long context — Without it, 200K context drops to <10 tok/s

Hardware Overview

All results are on NVIDIA RTX PRO 6000 (Blackwell GB202, SM120):

96 GB GDDR7 per GPU (768 GB total for 8×)
PCIe 5.0 x16 (~64 GB/s per direction)
No NVLink — all inter-GPU communication via PCIe
Typical configs: AMD EPYC Turin/Genoa, 4× or 8× GPUs

Contributing

This wiki is synthesized from Discord discussions. If you have corrections, additional benchmarks, or new configurations, please open an issue or PR.

Generated March 2026. Data sourced from community Discord server.

Name		Name	Last commit message	Last commit date
Latest commit History 328 Commits
benchmarks		benchmarks
daily-summaries		daily-summaries
data		data
docs/wiki		docs/wiki
hardware		hardware
images		images
inference-engines		inference-engines
logs/asus-report		logs/asus-report
models		models
optimization		optimization
patches		patches
scripts		scripts
troubleshooting		troubleshooting
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RTX 6000 Pro Wiki — Running Large LLMs on PCIe GPUs

Quick Links

Models

Hardware & Topology

Inference Engines

Optimization

Community

Results & Troubleshooting

Key Findings

Hardware Overview

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RTX 6000 Pro Wiki — Running Large LLMs on PCIe GPUs

Quick Links

Models

Hardware & Topology

Inference Engines

Optimization

Community

Results & Troubleshooting

Key Findings

Hardware Overview

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages