Skip to content

[RFC]: Tracking Spec Decode Support #27691

@MatthewBonanni

Description

@MatthewBonanni

Motivation.

I've started to try to understand how spec decode composes with other features. Each row represents one combination, and the status column indicates whether this combination runs without issues. This issue will track the status for now.

Model Hardware Spec Decode TP Size DP Size EP Enabled? CUDA Graph Mode DCP Size DBO Enabled? KV Cache DType Attn Backend Runs? GSM8K Score Notes
L3-8B H100 - 1 1 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN 0.64
L3-8B H100 EAGLE 1 1 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN 0.65
L3-8B H100 EAGLE 2 1 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN 0.65
L3-8B H100 - 2 1 N FULL_AND_PIECEWISE 1 N auto TRITON_ATTN 0.64
L3-8B H100 EAGLE 2 1 N FULL_AND_PIECEWISE 1 N auto TRITON_ATTN - IMA
L3-8B H100 - 2 1 N FULL_AND_PIECEWISE 1 N auto FLASHINFER 0.65
L3-8B H100 EAGLE 2 1 N FULL_AND_PIECEWISE 1 N auto FLASHINFER - IMA
L3-8B H100 EAGLE 2 1 N FULL_AND_PIECEWISE 1 N auto FLEX_ATTENTION ⚠️ 0.65 OOMs during benchmark (10 prompts works, 1000 fails) (memory leak?)
Also hits recompile_limit reached with fullgraph=True
GSM8K with 32 concurrent requests works
L3-8B H100 - 1 2 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN 0.65
L3-8B H100 EAGLE 1 2 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN - Crashes on startup (assert should_attempt_dp_padding == should_dp_pad)
L3-8B H100 EAGLE 1 2 N PIECEWISE 1 N auto FLASH_ATTN - Crashes on startup (assert should_attempt_dp_padding == should_dp_pad)
L3-8B H100 EAGLE 1 2 N NONE 1 N auto FLASH_ATTN - Hangs during inference, even with DeepEP kernels
L3-8B H100 EAGLE 1 1 N FULL_AND_PIECEWISE 1 N fp8 FLASH_ATTN 0.65
GPTOSS-20B H100 - 1 1 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN 0.36
GPTOSS-20B H100 EAGLE3 2 1 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN 0.36 Low acceptance rates (less than 10%)
Q3-32B H100 EAGLE3 2 1 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN 0.63
Q3-MoE H100 EAGLE3 8 1 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN 0.87
DSR1 H200 MTP 8 1 N FULL_AND_PIECEWISE 1 N auto FLASHMLA
DSR1 H200 MTP 8 1 N FULL_AND_PIECEWISE 1 N auto FLASH_ATTN_MLA
DSR1 H200 MTP 8 1 N FULL_AND_PIECEWISE 8 N auto FLASH_ATTN_MLA
DSR1 H200 MTP 1 8 Y FULL_AND_PIECEWISE 1 N auto FLASH_ATTN_MLA Hangs during inference, even with DeepEP kernels
DSR1 B200 MTP 8 1 N FULL_AND_PIECEWISE 1 N auto FLASHMLA FlashMLA dense is hopper-only
DSR1 B200 MTP 8 1 N FULL_AND_PIECEWISE 1 N auto CUTLASS_MLA ⚠️ 0.96 Works but uses prefill pathway so performance will suffer
DSR1 B200 MTP 8 1 N FULL_AND_PIECEWISE 1 N auto FLASHINFER_MLA 0.96 NOTE: supports q_len > 1, we should change reorder_batch_threshold (currently 1)
DSV3.2 H200 MTP 8 1 N FULL_AND_PIECEWISE 1 N auto FLASHMLA_SPARSE
DSV3.2 H200 MTP 8 1 N FULL_AND_PIECEWISE 1 N fp8 FLASHMLA_SPARSE
DSV3.2 H200 MTP 8 1 N FULL_AND_PIECEWISE 8 N fp8 FLASHMLA_SPARSE
DSV3.2 H200 MTP 1 8 Y FULL_AND_PIECEWISE 1 N fp8 FLASHMLA_SPARSE
DSV3.2 B200 MTP 8 1 N FULL_AND_PIECEWISE 1 N auto FLASHMLA_SPARSE
DSV3.2 B200 MTP 8 1 N FULL_AND_PIECEWISE 1 N fp8 FLASHMLA_SPARSE
DSV3.2 B200 MTP 8 1 N FULL_AND_PIECEWISE 8 N fp8 FLASHMLA_SPARSE
DSV3.2 B200 MTP 1 8 Y FULL_AND_PIECEWISE 1 N fp8 FLASHMLA_SPARSE

Proposed Change.

Current most significant outstanding issue is that I could not get spec decode to work with DP in any configuration.

Eventually, the goal is to have all of these configurations covered in CI somehow.

Feedback Period.

No response

CC List.

@robertgshaw2-redhat

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions