GLM-5.2 on RTX 4090

Running the full GLM-5.2 — the 753B-parameter SOTA open-weights model — in native FP8 on consumer NVIDIA RTX 4090 GPUs. As far as we know, the first time a DeepSeek-Sparse-Attention (DSA) model has run correctly on Ada (sm_89). The stock sglang / vLLM stack hard-requires Hopper (H100/H200) or Blackwell (B200) for GLM-5.2 — its sparse-attention kernels are gated to sm_90 / sm_100 with no Ada fallback. This repo is a drop-in ada_dsa.py that ports that whole kernel stack to the 4090.

prompt:  "The capital of France is"
output:  "Paris. Distance from Paris to Lyon is 391 km, while direct flight time is 1 h 5 min"

Highlights

Full 753B model, full FP8 — not a distilled or int4 variant. The complete GLM-5.2-FP8 weights.
32× RTX 4090-48GB (4 nodes × 8), pipeline + tensor parallel — proven, coherent chat / reasoning / code.
~24 tok/s single-stream, ~656 tok/s aggregate (64 concurrent streams, CUDA-graph) — interactive single-user speed and real serving throughput for the full 753B on commodity cards.
Every ported kernel validated against a reference, down to ~1e-6 — including 0.999999 cosine on the live model's real tensors.
Open — the kernels, the one-call installer, and the verification scripts are all here.

Hardware sizing

The FP8 weights are ~753 GB, so the model has to be split across enough cards to hold the weights plus KV-cache and activations. Roughly:

GPU	VRAM/card	GPUs needed	Layout	Status
RTX 4090	48 GB	32	TP=8 × PP=4 (4 nodes)	✅ proven (this repo)
RTX 4090	24 GB	~40–48	TP=8 × PP=5–6 (5–6 nodes)	sizing estimate
RTX 5090	32 GB	~32	TP=8 × PP=4 (4 nodes)	sizing estimate¹

¹ The RTX 5090 is sm_120 (consumer Blackwell), which the stock sm_90/sm_100 DSA kernels also don't cover — so it needs this same port (widen the capability guard to include sm_120). Only the 4090-48GB config is tested here; the others are VRAM-fit estimates (assume ~6–8 GB/card reserved for KV + activations + CUDA context, more for larger context windows).

Usage

Tested against an sglang build with the nsa / tilelang DSA backend, in an environment where tilelang is available (we grafted tilelang 0.1.11 + tvm-ffi from KTransformers).

Put ada_dsa.py on PYTHONPATH.
Apply the sglang edits (idempotent, backs up originals):
```
python apply_sglang_patches.py
```
This (a) calls ada_dsa.apply_patches() from nsa_indexer.py to swap the SM90+/SM100 DSA kernels, (b) adds the one-line deep_gemm guard needed for CUDA-graph, and (c) applies the GLM-MoE-DSA index-sharing edit to models/deepseek_v2.py. (All three are documented in TECHNICAL.md if you'd rather do them by hand.)

Launch (example: 32× RTX 4090-48GB, TP=8 × PP=4):

export SGLANG_NSA_FUSE_TOPK=1            # use the ported fused page-mapping transforms
export SGLANG_ENABLE_JIT_DEEPGEMM=0
export SGLANG_PP_LAYER_PARTITION=18,20,20,20   # 78 layers across 4 pipeline stages

python -m sglang.launch_server \
  --model-path zai-org/GLM-5.2-FP8 \
  --tp-size 8 --pp-size 4 --nnodes 4 --dist-init-addr <rank0-ip>:30200 \
  --trust-remote-code --kv-cache-dtype fp8_e4m3 --mem-fraction-static 0.90 \
  --attention-backend nsa --nsa-decode-backend tilelang --nsa-prefill-backend tilelang \
  --fp8-gemm-backend triton \
  --disable-shared-experts-fusion \      # REQUIRED on Ada (the MoE fix), see TECHNICAL.md
  --tool-call-parser glm47 --reasoning-parser glm45 \
  --node-rank <0|1|2|3> --host 0.0.0.0 --port 8000

--disable-shared-experts-fusion is required on Ada. CUDA-graph is on by default here and gives the full speed (about 24 tok/s single-stream vs about 2.5 in eager); it needs one extra one-line guard in deep_gemm_wrapper/entrypoint.py (see TECHNICAL.md). If you'd rather not patch that, add --disable-cuda-graph and run eager. Configure NCCL transport (NCCL_P2P_DISABLE / NCCL_IB_DISABLE) to match your fabric.

How it works

ada_dsa.py monkeypatches GLM-5.2's SM90+/SM100-only DSA kernels — the lightning-indexer GEMM, the top-k + page-mapping, and the MLA sparse decode — with portable Triton + a non-WGMMA tilelang path, only on sub-Hopper GPUs. Plus one config fix for the MoE, and a small model-level patch so GLM's shared DSA layers reuse the previous full layer's top-k instead of recomputing it. Full write-up, kernel-by-kernel walkthrough, and the verification table: TECHNICAL.md.

Status

GLM-5.2 runs correctly on consumer hardware where the stock stack hard-crashes, at about 24 tokens/sec single-stream (CUDA-graph; about 2.5 in eager mode), and ~656 tok/s aggregate across 64 concurrent streams. That's interactive speed for the full 753B model on commodity cards, and real serving throughput at batch. The portable indexer / top-k / page-transform stack is model-agnostic and should apply to other DSA models (e.g. DeepSeek-V3.2-style) with minor adjustment.

License

Apache-2.0.

Built by @renning22.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
gemm_configs		gemm_configs
patches		patches
verify		verify
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TECHNICAL.md		TECHNICAL.md
ada_dsa.py		ada_dsa.py
apply_sglang_patches.py		apply_sglang_patches.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GLM-5.2 on RTX 4090

Highlights

Hardware sizing

Usage

How it works

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GLM-5.2 on RTX 4090

Highlights

Hardware sizing

Usage

How it works

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages