Skip to content

tonyd2wild/DeepSeek-v4-Flash-DSpark-1M-NVFP4-KV-2x-DGX-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepSeek V4 Flash DSpark 1M NVFP4 KV on 2x DGX Spark

Self-contained two-node DGX Spark recipe for serving DeepSeek-V4-Flash-DSpark with vLLM TP=2, DSpark speculative decoding, and a 1M-token max model length using the experimental nvfp4_ds_mla KV-cache path.

This repo captures the validated 1M NVFP4 checkpoint and the 2026-06-30 agent-stability refresh:

  • max_model_len=1048576
  • kv_cache_dtype=nvfp4_ds_mla
  • reported KV pool: about 1.9M-2.0M tokens
  • reported max concurrency for 1M requests: about 1.8x-1.9x
  • single-stream decode stayed above 50 tok/s
  • direct prompts completed with no Chinese drift or repeated junk
  • 2/4/6 concurrent direct prompts completed cleanly
  • DSpark in-server concurrency patch validated at max_model_len=200000, max_num_seqs=16, with static C16 at 315.1 tok/s aggregate and staggered C16 at 205.0 tok/s aggregate

If you already deployed an older copy and saw agent garble, loops, Chinese drift, or prompt/tool XML leaking into replies, start with AGENT_GARBLE_FIX.md. The fix path keeps the 1M NVFP4 profile; it does not switch production to fp8 or a smaller fallback model.

Result

2026-06-30 Clean Agent-Serving Checkpoint

The latest clean endpoint was reproduced on Asusi/Spark4 before sending the model back through Hermes/OpenClaw-style harnesses.

Runtime:

  • endpoint tested: http://100.90.25.78:8888/v1
  • served model: deepseek-v4-flash-dspark
  • image used on that lane: vllm-dspark-runtime:mia-raf-pr1-nvfp4-keys-c
  • model path: /cache/huggingface/fraserprice/DeepSeek-V4-Flash-DSpark
  • kv_cache_dtype=nvfp4_ds_mla
  • max_model_len=1048576
  • max_num_seqs=6
  • max_num_batched_tokens=8192
  • gpu_memory_utilization=0.80
  • MTP_NUM_TOKENS=5
  • thinking=false
  • --generation-config vllm
  • --override-generation-config '{"temperature":0.0,"top_p":1.0}'
  • explicit per-node VLLM_HOST_IP values

Boot evidence:

GPU KV cache size: 1,990,142 tokens
Maximum concurrency for 1,048,576 tokens per request: 1.90x
Application startup complete.

Direct validation:

  • /v1/models reported "max_model_len": 1048576
  • deterministic sanity prompt returned NVFP4 DSPARK OK
  • five longer English prompts completed with no CJK drift and no repeated junk
  • code-gate server decode mean: 54.22 tok/s
  • 2/4/6 concurrent direct prompts all succeeded cleanly

Concurrency:

concurrency success aggregate tok/s stability
2 2/2 60.95 no CJK/repeat junk
4 4/4 83.21 no CJK/repeat junk
6 6/6 104.11 no CJK/repeat junk

The checkpoint note is in benchmarks/20260630-asusi-spark4-nvfp4-1m-agent-stability.md.

1M NVFP4 Profile

Validated on 2x DGX Spark, one GPU per node, TP=2, single stream.

Case server tok/s TTFC acceptance accepted/draft
p256/g64 54.46 0.506s 0.667 3.33
p256/g256 65.38 0.324s 0.718 3.59
p512/g64 56.26 2.738s 0.625 3.13
p512/g256 54.41 0.422s 0.550 2.75
p512/g256 warmup1 56.73 0.417s 0.585 2.92

Boot logs reported:

GPU KV cache size: 2,044,166 tokens
Maximum concurrency for 1,048,576 tokens per request: 1.95x

The API reported:

{"max_model_len":1048576}

The checkpoint note is in benchmarks/20260629-dspark-nvfp4-1m-context-checkpoint.md.

DSpark Concurrency Profile

Validated on the same 2x DGX Spark TP=2 deployment using Keys' DSpark concurrency patch, kv_cache_dtype=nvfp4_ds_mla, max_model_len=200000, max_num_seqs=16, MTP_NUM_TOKENS=5, and VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1.

Patch source:

The live fix documented here keeps kv_cache_dtype=nvfp4_ds_mla and refreshes the repo's already-vendored Keys overlay with the path-adjusted Patch 2b update from that commit. In Patch 2b, ragged query_start_loc detection no longer depends on num_rejected_tokens_gpu. Treat the service as validated only after the built-in OpenAI-compatible chat smoke request plus agent-client validation pass on the live service.

Static simultaneous batch, one TP=2 replica:

concurrency best aggregate tok/s per-stream tok/s acceptance
1 57.6 57.6 0.635
4 140.8 35.2 0.619
8 252.6 31.6 0.635
16 315.1 19.7 0.609

Staggered independent arrivals, one TP=2 replica:

concurrency success aggregate tok/s acceptance
4 4/4 109.2 0.544
8 8/8 147.3 0.534
16 16/16 205.0 0.567

Correctness sanity check: deterministic victim output remained byte-identical under churn. A medium-churn condense test measured 0.529 acceptance and 99.7 tok/s across the churn window.

The checkpoint note is in benchmarks/20260629-dspark-keys-concurrency-checkpoint.md.

Historical 60 tok/s DSpark Baseline

The older ~60 tok/s number was reproduced, but it is a separate diagnostic profile, not this repo's default 1M NVFP4 deployment:

  • image rebuilt from rafaelcaricio/vllm#1 commit 3519c3b88
  • max_model_len=262144
  • max_num_seqs=1
  • kv_cache_dtype=fp8
  • MTP_NUM_TOKENS=5
  • thinking=false
  • temperature=0.0, top_p=1.0
  • measured 63.97 tok/s on the code_completion gate with 67.9% DSpark acceptance

Use this to diagnose image/runtime drift. Do not confuse it with the production 1M NVFP4 path. The checkpoint note is in benchmarks/20260630-dspark-pr-head-262k-fp8-speed-baseline.md.

2026-06-29 Full-1M Concurrency Microbench

The 200K/16 profile above maximizes raw concurrency. For agent fleets that want the full 1M context ceiling AND concurrency, run max_model_len=1048576 with max_num_seqs=6. Every request can still grow to 1M while up to 6 sessions run at once, because the shared KV pool — not a per-slot reservation — is the real limit (see How the KV cache works).

Validated on the 2026-06-29 code-completion microbench deployment (NVFP4, max_model_len=1048576, max_num_seqs=6, VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1, VLLM_USE_B12X_WO_PROJECTION=1):

  • Boot: GPU KV cache size: 1,901,239 tokens, Maximum concurrency for 1,048,576 tokens per request: 1.81x
  • 6 concurrent requests: 6/6 success, ~182 tok/s aggregate (~30 tok/s per stream), no OOM / no preemption failures
  • Single-stream decode on this same profile: ~67 tok/s (code)

This is the right shape when most sessions sit far below 1M (typical agent turns) but you still want the 1M ceiling available. The newer 2026-06-30 agent-stability checkpoint above is the safer number to cite for Hermes/OpenClaw harness validation.

Higher concurrency is not free: under sustained pressure you can see added scheduler churn, prefill contention, and KV fragmentation. 1M/6 is validated for normal-length agent traffic; for guaranteed deep-context work under load, 1M/2 is conservative and 500K/4 is a balanced middle.

How the KV cache works (why 1M + concurrency is safe)

Three independent knobs, often confused:

knob what it is this build
KV cache pool total shared KV memory in tokens, sized from gpu_memory_utilization after weights load ~1.9–2.04M tokens (NVFP4)
max_model_len per-request ceiling — how long any one request may grow 1,048,576 (1M)
max_num_seqs concurrency cap — max active sequences the scheduler runs at once 6

The pool is shared and allocated on demand: PagedAttention hands KV blocks to each request as it generates tokens and frees them when it finishes. max_model_len and max_num_seqs are ceilings, not reservations — vLLM does NOT pre-allocate max_num_seqs × max_model_len of KV. So the real constraint is:

sum(live tokens across all active requests) <= KV pool (~1.9M)

Worked examples at 1M ceiling / 6 slots:

6 requests x  50k tokens =  300k   fits easily
6 requests x 200k tokens =  1.2M   fits
6 requests x 317k tokens =  1.9M   ~at the limit
2 requests x 1M   tokens =  2.0M   ~at the limit  (this is the "1.81x" boot number)
6 requests x 1M   tokens =  6.0M   impossible — excess requests queue/preempt

The boot log's Maximum concurrency for 1,048,576 tokens per request: 1.81x only means ~1.8 simultaneous full-1M requests fit. But agent turns are almost never near 1M, so 6 normal-length sessions share the pool comfortably while the 1M ceiling stays available for the rare long one. That is exactly why 1M + max_num_seqs=6 is safe: you are not reserving 6×1M, you are sharing one ~1.9M pool across short requests under a high ceiling.

Gotcha: gibberish, loops, Chinese drift, or prompt/XML leakage

If the model boots and basic prompts like hi work, but real agent traffic randomly turns into repeated characters, Chinese drift, leaked tool/schema XML, or Telegram-visible junk, do not assume the weights are bad.

On this deployment there are three checks to make before blaming the weights:

  1. Runtime concurrency safety: make sure the Keys Patch 2b logic is present in recipe/overlay/vllm/v1/spec_decode/dspark_proposer.py. The important behavior is that ragged query_start_loc handling does not depend on num_rejected_tokens_gpu, and the no-rejection path creates a zero rejected token tensor instead of falling through to unsafe request reshaping. Without this, concurrent DSpark requests can mix context.
  2. Runtime image provenance: make sure the image really contains the current DSpark overlay. A reused local tag named vllm-dspark-runtime:clean caused misleading failures even though a nearby PR-head image worked. Rebuild from the intended overlay commit when in doubt.
  3. Decode/fallback safety: for long OpenAI-compatible agent prompts, avoid unstable sampling and hidden fallback transitions. The server default should ignore the model card's sampling defaults and apply a small sampling floor:
{
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 40,
  "repetition_penalty": 1.05,
  "include_reasoning": false,
  "reasoning_effort": "none",
  "chat_template_kwargs": {
    "thinking": false,
    "enable_thinking": false
  }
}

The compose launcher now includes --generation-config vllm, builds --override-generation-config from the GENERATION_* env values, and sets thinking=false so default requests do not inherit unstable model-card sampling. Explicit client request parameters still win. For exact deterministic curl checks, send temperature: 0 in the request body.

Also clear agent fallback lists during validation. A model that looks fixed in direct vLLM tests can still appear poisoned if the orchestration layer silently falls back, reboots a session, or replays a stale prompt/tool transcript into the visible message stream. Keep OpenClaw/Hermes changes separate from model runtime validation unless you are deliberately testing that harness.

Validation gates to run after a live fix:

direct vLLM prompts: clean
direct concurrent vLLM prompts: clean
agent harness prompts: clean, DeepSeek, no fallback
MTP5 accepted-token positions 0..4 active

This keeps NVFP4 KV and MTP5. Do not switch to fp8 or drop to a smaller fallback model just to hide the symptom unless you intentionally accept the context and quality tradeoff.

Important Caveat

This is the Stage C padded NVFP4 path. It keeps DeepSeek V4's known-good 584-byte sparse-MLA cache envelope while routing the runtime through nvfp4_ds_mla.

It is not the unresolved true-layout 416-byte NVFP4 kernel fix. The true-layout experiments were useful for diagnosis but failed past roughly 411 real prompt tokens, so they are intentionally not presented here as the reproducible recipe.

Credits

See CREDITS.md for the full attribution and license notes.

This recipe stands on prior public work:

Our contribution here is the 1M NVFP4-KV checkpoint recipe, the Stage A/B/C runtime patches, sanitized two-node launch config, applying and validating Keys' concurrency patch on the NVFP4 profile, and measured benchmark artifacts from the validated runs.

License Notes

Repo scripts and docs are published under this repo's LICENSE. The vLLM overlay/runtime files are vLLM-derived and retain their Apache-2.0 lineage and SPDX headers where present. Base images, FlashInfer/TileLang/Triton/CUDA/NCCL, and model weights are separate upstream artifacts with their own licenses and usage terms.

Files

path purpose
recipe/overlay/ base DSpark vLLM overlay files
recipe/Dockerfile.dspark-runtime-overlay builds the base DSpark runtime overlay
recipe/nvfp4/Dockerfile.stage-a adds nvfp4_ds_mla dtype plumbing
recipe/nvfp4/Dockerfile.stage-b enables DeepSeek V4 nvfp4_ds_mla probe path
recipe/nvfp4/Dockerfile.stage-c switches DeepSeek V4 NVFP4 to the validated 584-byte padded envelope
docker-compose.dspark.yml two-node vLLM/DSpark service
.env.dspark.example sanitized cluster configuration template
build-dspark-vllm-runtime.sh builds the Stage C image locally and on the worker
prepare-dspark-model-cache.sh downloads/verifies the model cache
start-deepseek-v4-flash-dspark.sh worker-first launch and smoke test; honors worker path/cache/IP overrides
stop-deepseek-v4-flash-dspark.sh stops head and worker services
status-deepseek-v4-flash-dspark.sh shows head/worker container state
logs-deepseek-v4-flash-dspark.sh tails head/worker DSpark logs
smoke-deepseek-v4-flash-dspark.sh direct concurrent OpenAI-compatible smoke test
validate-dspark-config.sh renders and checks the local DSpark compose/env config
patches/keys-concurrency.patch full path-adjusted Keys concurrency patch reference
docs/PATCHES.md plain-English Patch 1 / Patch 2 / Patch 2b concurrency explanation
UPSTREAM_V024_STATUS.md current vLLM v0.24.0 vs DSpark PR #46995 upgrade notes
AGENT_GARBLE_FIX.md update path for older deployments that saw agent garble/drift/loops
scripts/agent_sanity_bench.py direct OpenAI-compatible 1/2/4/6 concurrency and garble check
scripts/capture_runtime.sh captures head/worker Docker inspect, ps, and log tails before/after changes
benchmarks/keys-concurrency/ benchmark scripts from Keys' patch repo
benchmarks/ measured 1M and concurrency checkpoint evidence

Quick Start

Run from the head node.

cp .env.dspark.example .env.dspark

Edit these values for your cluster:

  • WORKER_HOST
  • WORKER_SCRIPT_DIR if the worker checkout/deployment path differs from the head
  • MASTER_ADDR
  • NCCL_IB_HCA
  • NCCL_SOCKET_IFNAME
  • NCCL_IB_GID_INDEX
  • HF_CACHE
  • WORKER_HF_CACHE if the worker cache path differs from the head
  • VLLM_HOST_IP and WORKER_VLLM_HOST_IP for each node's fabric IP

Keep these agent-serving defaults unless you are deliberately experimenting:

  • VLLM_HOST=0.0.0.0 if Hermes/OpenClaw or another machine must reach the API
  • MAX_MODEL_LEN=1048576
  • MAX_NUM_SEQS=6
  • MTP_NUM_TOKENS=5
  • GENERATION_TEMPERATURE=0.6
  • GENERATION_TOP_P=0.95
  • GENERATION_TOP_K=40
  • GENERATION_REPETITION_PENALTY=1.05
  • VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1
  • VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0

Build the base overlay and Stage C NVFP4 image:

./build-dspark-vllm-runtime.sh

Prepare the model cache:

./prepare-dspark-model-cache.sh

Start the service:

./start-deepseek-v4-flash-dspark.sh

The API serves at:

http://HEAD_NODE_IP:8888/v1

For head-node-only tests, set VLLM_HOST=127.0.0.1. For Hermes/OpenClaw or another machine to use the endpoint, keep VLLM_HOST=0.0.0.0 and control access at the network/firewall layer.

Runtime Profile

1M Keys-Concurrency Profile

Core vLLM flags:

  • --tensor-parallel-size 2
  • --distributed-executor-backend mp
  • --nnodes 2
  • --kv-cache-dtype nvfp4_ds_mla
  • --block-size 256
  • --max-model-len 1048576
  • --max-num-seqs 6
  • --max-num-batched-tokens 8192
  • --gpu-memory-utilization 0.80
  • --speculative-config '{"method":"dspark","num_speculative_tokens":${MTP_NUM_TOKENS:-5}}'

Key runtime env:

  • VLLM_USE_B12X_MOE=1
  • VLLM_USE_B12X_WO_PROJECTION=1
  • VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1
  • VLLM_DSPARK_CONFIDENCE_SCHEDULER=off
  • VLLM_DSPARK_LOCAL_ARGMAX=1
  • VLLM_DSPARK_REPLICATE_MARKOV_W1=1
  • VLLM_DSPARK_FUSED_MARKOV_ARGMAX=0
  • VLLM_DSPARK_REFERENCE_KV_QUANT_DEQUANT=0
  • VLLM_DSV4_B12X_COMPRESSED_MLA=0
  • VLLM_DSV4_DSPARK_DEFER_TARGET_CAPTURE=0
  • B12X_W4A16_TC_DECODE=0

200k Concurrency Profile

For DSpark concurrency, use the included overlay files with Keys' concurrency patch and set:

  • MAX_MODEL_LEN=200000
  • MAX_NUM_SEQS=16
  • VLLM_USE_B12X_WO_PROJECTION=1
  • VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1

Run the static and staggered checks:

python3 benchmarks/keys-concurrency/bench_concurrent.py http://127.0.0.1:8888 1,4,8,16
python3 benchmarks/keys-concurrency/staggered_bench.py http://127.0.0.1:8888 16 0.4
python3 benchmarks/keys-concurrency/correctness_test.py http://127.0.0.1:8888

1M Single-Stream Legacy Profile

For conservative single-stream testing, set MAX_NUM_SEQS=1 and VLLM_USE_B12X_WO_PROJECTION=0. Keep MTP_NUM_TOKENS=5 unless you are deliberately running an experiment; upstream Mia and Keys both validate the DSpark path at MTP5.

Verify

After launch:

curl -fsS http://127.0.0.1:8888/v1/models

Confirm the returned model entry reports:

"max_model_len": 1048576

Then check logs:

docker compose --env-file .env.dspark -f docker-compose.dspark.yml logs vllm-dspark \
  | grep -E "GPU KV cache size|Maximum concurrency"

Expected 1M/6 checkpoint values are around:

GPU KV cache size: 1.9M-2.0M tokens
Maximum concurrency for 1,048,576 tokens per request: 1.8x-1.9x

Before pointing an agent harness at the endpoint, run the direct sanity bench:

DSPARK_BASE_URL=http://HEAD_NODE_IP:8888/v1 \
CONCURRENCY=1,2,4,6 \
python3 scripts/agent_sanity_bench.py

Every row should report bad_outputs: 0. If this direct test is clean but an agent still garbles, investigate the agent session, fallback list, or harness prompt replay before blaming the DSpark weights.

Capture runtime evidence before and after any fix:

scripts/capture_runtime.sh runtime-before-change
scripts/capture_runtime.sh runtime-after-change

Notes

  • The old speed checkpoint is single stream, not aggregate throughput.
  • The high-concurrency benchmark is aggregate throughput and was validated at max_model_len=200000, not full 1M context.
  • Full 1M context and high concurrency compete for the same KV pool. The 1M/6 profile is intended for normal agent traffic where most sessions sit far below the 1M ceiling; it is not six simultaneous full-1M requests.
  • To combine DSpark concurrency with longer context, pick a lower context target first, then raise concurrency slowly while watching boot logs, KV allocation, acceptance, and request errors.
  • 1M was validated as booted/advertised max_model_len with KV headroom and short-prompt speed probes. This repo does not claim a full 1M-token retrieval or correctness benchmark.
  • The measured probes were p256/p512 with g64/g256. Rebenchmark if you change sampling, batching, context length, WO projection, compressed MLA, or the confidence scheduler.
  • The current validated agent-serving profile is MAX_NUM_SEQS=6, MTP_NUM_TOKENS=5, VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1, VLLM_USE_B12X_WO_PROJECTION=1, and VLLM_DSV4_B12X_COMPRESSED_MLA=0.
  • Worker-first startup avoids a race during multi-node mp initialization.
  • Requires matching images on both nodes, correct NCCL/RoCE settings, and a two-node Blackwell-class/DGX Spark setup.
  • The API binds to 127.0.0.1 by default; exposing it is a deliberate security choice.
  • The next max-sequence ladder to try is approximately 1.25M, 1.5M, then 1.75M, with the same boot/log/speed gates. Raw KV math alone is not enough because DeepSeek V4 sparse MLA also allocates max-length-dependent workspaces.

About

DeepSeek V4 Flash DSpark 1M NVFP4 KV recipe for 2x DGX Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages