Skip to content

local-inference-lab/rtx6kpro

Repository files navigation

RTX 6000 Pro Wiki — Running Large LLMs on PCIe GPUs

Community-sourced knowledge base for running large language models (Qwen3.5-397B, MiniMax M2.5, MiMo-V2.5-Pro, Kimi-K2.5, Kimi-K2.6, GLM-5) on NVIDIA RTX 6000 Pro (Blackwell, SM120) GPUs in 2×, 4×, and 8× PCIe configurations without NVLink.

Synthesized from ~5,000 Discord messages, 300+ screenshots, and months of community experimentation.

Quick Links

Models

Model Params Active Min GPUs Best Decode Page
Qwen3.5-397B 397B MoE 17B 350 tok/s (8×, SGLang)
Qwen3.5-27B/122B 27B–122B
MiniMax M2.5 229B MoE 85-89 tok/s (NVFP4)
MiMo-V2.5-Pro MoE TP8 NVFP4/MXFP8 + MTP/EAGLE
Kimi-K2.5 530B MoE 101 tok/s (PCIe switch)
Kimi-K2.6 MoE Community image + MLA Eagle
Kimi-K2.6 v6 MoE LightSeek Eagle3.1 MLA + vLLM V2
Kimi-K2.6 v5 MoE CUDA 13.2 vLLM V2 + p/q MTP
DeepSeek-V4-Pro TP16 Lucifer MoE 16× Lucifer FP8 KV + MTP TP16 overlay
GLM-5 744B MoE 40B 105 tok/s (MTP)
GLM-5.1 MoE vLLM b12x NSA/MTP port

Hardware & Topology

Inference Engines

  • vLLM — Config, MTP, model-specific commands
  • SGLang — Config, DCP, MOE backends
  • FlashInfer — CUTLASS, SM120, bug fixes

Optimization

Community

Results & Troubleshooting

Key Findings

  1. MTP=2 is the sweet spot — +51-72% throughput across all models, MTP>3 unstable
  2. NCCL graph XML fix is still the public Turin recipe — current upstream NCCL draft fix is NVIDIA/nccl#2127, which aims to remove the no-XML pathological ring regression
  3. PCIe switches dramatically help single-batch latency — 101 vs 60 tok/s for Kimi K2.5
  4. BF16 KV cache mandatory on SM120 for GLM-5 — FP8 produces garbled output
  5. SGLang is the only option for GLM-5 — vLLM lacks SM120-compatible MLA+sparse attention backend
  6. NVFP4 is native to SM120 — 2× decode speedup over FP8 for supported models
  7. DCP is essential for Kimi K2.5 long context — Without it, 200K context drops to <10 tok/s

Hardware Overview

All results are on NVIDIA RTX PRO 6000 (Blackwell GB202, SM120):

  • 96 GB GDDR7 per GPU (768 GB total for 8×)
  • PCIe 5.0 x16 (~64 GB/s per direction)
  • No NVLink — all inter-GPU communication via PCIe
  • Typical configs: AMD EPYC Turin/Genoa, 4× or 8× GPUs

Contributing

This wiki is synthesized from Discord discussions. If you have corrections, additional benchmarks, or new configurations, please open an issue or PR.


Generated March 2026. Data sourced from community Discord server.

About

RTX 6000 Pro Wiki — Running Large LLMs (Qwen3.5-397B, Kimi-K2.5, GLM-5) on PCIe GPUs without NVLink

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors