-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Open
Labels
Description
Motivation.
I've started to try to understand how spec decode composes with other features. Each row represents one combination, and the status column indicates whether this combination runs without issues. This issue will track the status for now.
| Model | Hardware | Spec Decode | TP Size | DP Size | EP Enabled? | CUDA Graph Mode | DCP Size | DBO Enabled? | KV Cache DType | Attn Backend | Runs? | GSM8K Score | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L3-8B | H100 | - | 1 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
✅ | 0.64 | |
| L3-8B | H100 | EAGLE | 1 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
✅ | 0.65 | |
| L3-8B | H100 | EAGLE | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
✅ | 0.65 | |
| L3-8B | H100 | - | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
TRITON_ATTN |
✅ | 0.64 | |
| L3-8B | H100 | EAGLE | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
TRITON_ATTN |
❌ | - | IMA |
| L3-8B | H100 | - | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASHINFER |
✅ | 0.65 | |
| L3-8B | H100 | EAGLE | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASHINFER |
❌ | - | IMA |
| L3-8B | H100 | EAGLE | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLEX_ATTENTION |
0.65 | OOMs during benchmark (10 prompts works, 1000 fails) (memory leak?) Also hits recompile_limit reached with fullgraph=TrueGSM8K with 32 concurrent requests works |
|
| L3-8B | H100 | - | 1 | 2 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
✅ | 0.65 | |
| L3-8B | H100 | EAGLE | 1 | 2 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
❌ | - | Crashes on startup (assert should_attempt_dp_padding == should_dp_pad) |
| L3-8B | H100 | EAGLE | 1 | 2 | N | PIECEWISE |
1 | N | auto |
FLASH_ATTN |
❌ | - | Crashes on startup (assert should_attempt_dp_padding == should_dp_pad) |
| L3-8B | H100 | EAGLE | 1 | 2 | N | NONE |
1 | N | auto |
FLASH_ATTN |
❌ | - | Hangs during inference, even with DeepEP kernels |
| L3-8B | H100 | EAGLE | 1 | 1 | N | FULL_AND_PIECEWISE |
1 | N | fp8 |
FLASH_ATTN |
✅ | 0.65 | |
| GPTOSS-20B | H100 | - | 1 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
✅ | 0.36 | |
| GPTOSS-20B | H100 | EAGLE3 | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
✅ | 0.36 | Low acceptance rates (less than 10%) |
| Q3-32B | H100 | EAGLE3 | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
✅ | 0.63 | |
| Q3-MoE | H100 | EAGLE3 | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN |
✅ | 0.87 | |
| DSR1 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASHMLA |
✅ | ||
| DSR1 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN_MLA |
✅ | ||
| DSR1 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
8 | N | auto |
FLASH_ATTN_MLA |
✅ | ||
| DSR1 | H200 | MTP | 1 | 8 | Y | FULL_AND_PIECEWISE |
1 | N | auto |
FLASH_ATTN_MLA |
❌ | Hangs during inference, even with DeepEP kernels | |
| DSR1 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASHMLA |
❌ | FlashMLA dense is hopper-only | |
| DSR1 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
CUTLASS_MLA |
0.96 | Works but uses prefill pathway so performance will suffer | |
| DSR1 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASHINFER_MLA |
✅ | 0.96 | NOTE: supports q_len > 1, we should change reorder_batch_threshold (currently 1) |
| DSV3.2 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASHMLA_SPARSE |
|||
| DSV3.2 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | fp8 |
FLASHMLA_SPARSE |
|||
| DSV3.2 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
8 | N | fp8 |
FLASHMLA_SPARSE |
|||
| DSV3.2 | H200 | MTP | 1 | 8 | Y | FULL_AND_PIECEWISE |
1 | N | fp8 |
FLASHMLA_SPARSE |
|||
| DSV3.2 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | auto |
FLASHMLA_SPARSE |
|||
| DSV3.2 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | fp8 |
FLASHMLA_SPARSE |
|||
| DSV3.2 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
8 | N | fp8 |
FLASHMLA_SPARSE |
|||
| DSV3.2 | B200 | MTP | 1 | 8 | Y | FULL_AND_PIECEWISE |
1 | N | fp8 |
FLASHMLA_SPARSE |
Proposed Change.
Current most significant outstanding issue is that I could not get spec decode to work with DP in any configuration.
Eventually, the goal is to have all of these configurations covered in CI somehow.
Feedback Period.
No response
CC List.
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.