# Speculative Decoding

SGLang now provides an EAGLE-based (EAGLE-2/EAGLE-3) speculative decoding option. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.

### Performance Highlights

Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.
For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).

| Method | Throughput (tokens/s) |
|--------|----------------|
| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |
| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |
| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |

## EAGLE Decoding

To enable EAGLE speculative decoding the following parameters are relevant:
* `speculative_draft_model_path`: Specifies draft model. This parameter is required.
* `speculative_num_steps`: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.
* `speculative_eagle_topk`: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.
* `speculative_num_draft_tokens`: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.

These parameters are the same for EAGLE-2 and EAGLE-3.

You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).

In the documentation below, we set `--cuda-graph-max-bs` to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with `--cuda-graph-max-bs`, `--max-running-requests`, `--mem-fraction-static` for the best performance. 

### EAGLE-2 decoding

You can enable EAGLE-2 decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model.

In [1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

import openai

[2025-12-12 11:43:40] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-12 11:43:40] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-12 11:43:40] INFO utils.py:164: NumExpr defaulting to 16 threads.


In [2]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 11:43:46] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:43:46] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:43:46] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 11:43:54] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:43:54] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:43:54] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 11:43:54] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:43:54] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:43:54] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 11:43:57.077883 687417 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:43:57.077898 687417 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:43:57.077919 687417 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:15761
I1212 11:43:57.078002 687417 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:43:57.080941 687417 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:43:57.108397 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 11:43:57.276298 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 11:43:57.277107 687417 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:bc:45
I1212 11:43:57.306001 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 11:43:57.306828 687417 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:3c:45
I1212 11:43:57.332873 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 11:43:57.334085 687417 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:ba:9c:86
I1212 11:43:57.360332 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 11:43:57.361151 687417 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:bc:45
I1212 11:43:57.388319 687417 rdma_cont

W1212 11:43:58.225963 687417 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fd2f7fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 11:44:00] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.90s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.44s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=35.87 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=35.87 GB):  25%|██▌       | 1/4 [00:00<00:00,  4.79it/s]Capturing batches (bs=3 avail_mem=35.72 GB):  25%|██▌       | 1/4 [00:00<00:00,  4.79it/s]Capturing batches (bs=2 avail_mem=35.65 GB):  25%|██▌       | 1/4 [00:00<00:00,  4.79it/s]Capturing batches (bs=2 avail_mem=35.65 GB):  75%|███████▌  | 3/4 [00:00<00:00,  8.35it/s]Capturing batches (bs=1 avail_mem=35.57 GB):  75%|███████▌  | 3/4 [00:00<00:00,  8.35it/s]

Capturing batches (bs=1 avail_mem=35.57 GB): 100%|██████████| 4/4 [00:00<00:00,  7.71it/s]Capturing batches (bs=1 avail_mem=35.57 GB): 100%|██████████| 4/4 [00:00<00:00,  7.48it/s]


[2025-12-12 11:44:07] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 11:44:07] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 11:44:07.506192 687417 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:44:07.506209 687417 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:44:07.506223 687417 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:15608
I1212 11:44:07.506314 687417 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:44:07.507126 687417 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:44:07.532215 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 11:44:07.532862 687417 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:bc:45


I1212 11:44:07.560115 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 11:44:07.560710 687417 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:3c:45
I1212 11:44:07.588160 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 11:44:07.588784 687417 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:bc:45
I1212 11:44:07.612303 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 11:44:07.612947 687417 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:3c:45
I1212 11:44:07.640188 687417 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 11:44:07.640798 687417 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:bc:45
I1212 11:44:07.668210 687417 rdma_co

W1212 11:44:08.486732 687417 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fca7ffff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.31s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.31s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=36.68 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=36.68 GB):  25%|██▌       | 1/4 [00:04<00:14,  4.94s/it]Capturing batches (bs=3 avail_mem=36.53 GB):  25%|██▌       | 1/4 [00:04<00:14,  4.94s/it]

Capturing batches (bs=3 avail_mem=36.53 GB):  50%|█████     | 2/4 [00:05<00:04,  2.42s/it]Capturing batches (bs=2 avail_mem=36.53 GB):  50%|█████     | 2/4 [00:05<00:04,  2.42s/it]

Capturing batches (bs=2 avail_mem=36.53 GB):  75%|███████▌  | 3/4 [00:05<00:01,  1.45s/it]Capturing batches (bs=1 avail_mem=36.51 GB):  75%|███████▌  | 3/4 [00:05<00:01,  1.45s/it]

Capturing batches (bs=1 avail_mem=36.51 GB): 100%|██████████| 4/4 [00:09<00:00,  2.30s/it]Capturing batches (bs=1 avail_mem=36.51 GB): 100%|██████████| 4/4 [00:09<00:00,  2.37s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=36.49 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=36.43 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=36.43 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=36.41 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=36.41 GB): 100%|██████████| 4/4 [00:00<00:00, 53.44it/s]


[2025-12-12 11:44:22] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [3]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [4]:
terminate_process(server_process)

### EAGLE-2 Decoding with `torch.compile`

You can also enable `torch.compile` for further optimizations and optionally set `--torch-compile-max-bs`:


In [5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
            --enable-torch-compile --torch-compile-max-bs 2 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 11:44:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:44:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:44:32] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 11:44:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:44:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:44:41] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 11:44:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:44:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:44:41] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 11:44:44.257965 689794 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:44:44.257984 689794 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:44:44.258009 689794 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:15157
I1212 11:44:44.258105 689794 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:44:44.261240 689794 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:44:44.288609 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 11:44:44.441143 689794 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:3c:45
I1212 11:44:44.468621 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 11:44:44.469620 689794 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:ba:9c:86
I1212 11:44:44.496359 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 11:44:44.497205 689794 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:bc:45
I1212 11:44:44.524333 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 11:44:44.525131 689794 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:3c:45


W1212 11:44:45.230474 689794 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f5fd3fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 11:44:47] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.66s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.15s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.23s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=40.18 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=40.18 GB):  25%|██▌       | 1/4 [00:00<00:00,  4.24it/s]Capturing batches (bs=3 avail_mem=40.02 GB):  25%|██▌       | 1/4 [00:00<00:00,  4.24it/s]Capturing batches (bs=3 avail_mem=40.02 GB):  50%|█████     | 2/4 [00:00<00:00,  6.16it/s]Capturing batches (bs=2 avail_mem=39.94 GB):  50%|█████     | 2/4 [00:00<00:00,  6.16it/s]

  torch._dynamo.utils.warn_once(msg)


Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.04294399917125702, "best_triton_pos": 1, "best_triton_time": 0.04956800118088722, "best_triton_kernel": "triton_mm_18", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8"}
AUTOTUNE mm(128x4096, 4096x12288)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0429 ms 100.0% 
  triton_mm_18 0.0496 ms 86.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_12 0.0529 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_8 0.0552 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=6

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "triton_mm_27", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4", "best_time": 0.020767999812960625, "best_triton_pos": 0}
AUTOTUNE mm(128x4096, 4096x4096)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_27 0.0208 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  mm 0.0221 ms 93.9% 
  triton_mm_31 0.0270 ms 76.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_23 0.0301 ms 69.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.07315199822187424, "best_triton_pos": 1, "best_triton_time": 0.07593599706888199, "best_triton_kernel": "triton_mm_49", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4"}
AUTOTUNE mm(128x4096, 4096x22016)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0732 ms 100.0% 
  triton_mm_49 0.0759 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_56 0.0773 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_55 0.0786 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=12

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "triton_mm_65", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4", "best_time": 0.043296001851558685, "best_triton_pos": 0}
AUTOTUNE mm(128x11008, 11008x4096)
strides: [11008, 1], [1, 11008]
dtypes: torch.float16, torch.float16
  triton_mm_65 0.0433 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  mm 0.0468 ms 92.5% 
  triton_mm_69 0.0546 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_61 0.0641 ms 67.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "triton_mm_93", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.10156799852848053, "best_triton_pos": 0}
AUTOTUNE mm(128x4096, 4096x32000)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_93 0.1016 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  mm 0.1040 ms 97.7% 
  triton_mm_94 0.1044 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_88 0.1088 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4

Capturing batches (bs=2 avail_mem=39.94 GB):  75%|███████▌  | 3/4 [00:24<00:11, 11.20s/it]Capturing batches (bs=1 avail_mem=37.34 GB):  75%|███████▌  | 3/4 [00:24<00:11, 11.20s/it]

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.0469760000705719, "best_triton_pos": 1, "best_triton_time": 0.048287998884916306, "best_triton_kernel": "triton_mm_107", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4"}
AUTOTUNE mm(64x4096, 4096x12288)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0470 ms 100.0% 
  triton_mm_107 0.0483 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_111 0.0484 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_103 0.0485 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.021824000403285027, "best_triton_pos": 1, "best_triton_time": 0.022048000246286392, "best_triton_kernel": "triton_mm_116", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"}
AUTOTUNE mm(64x4096, 4096x4096)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0218 ms 100.0% 
  triton_mm_116 0.0220 ms 99.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_120 0.0230 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_124 0.0261 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_136", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.0737600028514862, "best_triton_pos": 0}
AUTOTUNE mm(64x4096, 4096x22016)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_136 0.0738 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_137 0.0743 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  mm 0.0745 ms 99.0% 
  triton_mm_140 0.0746 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  t

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4", "best_time": 0.040832001715898514, "best_triton_pos": 0}
AUTOTUNE mm(64x11008, 11008x4096)
strides: [11008, 1], [1, 11008]
dtypes: torch.float16, torch.float16
  triton_mm_150 0.0408 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_154 0.0488 ms 83.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  mm 0.0489 ms 83.6% 
  triton_mm_158 0.0553 ms 73.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_war

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_175", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4", "best_time": 0.10063999891281128, "best_triton_pos": 0}
AUTOTUNE mm(64x4096, 4096x32000)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_175 0.1006 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_179 0.1007 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_174 0.1009 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_171 0.10

Capturing batches (bs=1 avail_mem=37.34 GB): 100%|██████████| 4/4 [00:44<00:00, 14.64s/it]Capturing batches (bs=1 avail_mem=37.34 GB): 100%|██████████| 4/4 [00:44<00:00, 11.15s/it]


[2025-12-12 11:45:39] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 11:45:39] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 11:45:39.236771 689794 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:45:39.236793 689794 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:45:39.236809 689794 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:15629
I1212 11:45:39.236897 689794 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:45:39.237836 689794 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:45:39.241987 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 11:45:39.242607 689794 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:bc:45


I1212 11:45:39.268368 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 11:45:39.268977 689794 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:3c:45
I1212 11:45:39.297166 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 11:45:39.297758 689794 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:bc:45
I1212 11:45:39.326217 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 11:45:39.326820 689794 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:3c:45
I1212 11:45:39.353277 689794 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 11:45:39.353885 689794 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:bc:45
I1212 11:45:39.381290 689794 rdma_co

W1212 11:45:40.151105 689794 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f5473fff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.28s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.28s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=35.72 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=35.72 GB):  25%|██▌       | 1/4 [00:05<00:17,  5.69s/it]Capturing batches (bs=3 avail_mem=53.57 GB):  25%|██▌       | 1/4 [00:05<00:17,  5.69s/it]

Capturing batches (bs=3 avail_mem=53.57 GB):  50%|█████     | 2/4 [00:06<00:05,  2.77s/it]Capturing batches (bs=2 avail_mem=53.55 GB):  50%|█████     | 2/4 [00:06<00:05,  2.77s/it]

Capturing batches (bs=2 avail_mem=53.55 GB):  75%|███████▌  | 3/4 [00:06<00:01,  1.62s/it]Capturing batches (bs=1 avail_mem=53.51 GB):  75%|███████▌  | 3/4 [00:06<00:01,  1.62s/it]

Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:10<00:00,  2.68s/it]Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:10<00:00,  2.74s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=53.46 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=53.39 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=3 avail_mem=53.39 GB):  50%|█████     | 2/4 [00:00<00:00,  6.16it/s]Capturing batches (bs=2 avail_mem=53.39 GB):  50%|█████     | 2/4 [00:00<00:00,  6.16it/s]

Capturing batches (bs=2 avail_mem=53.39 GB):  75%|███████▌  | 3/4 [00:00<00:00,  4.20it/s]Capturing batches (bs=1 avail_mem=53.36 GB):  75%|███████▌  | 3/4 [00:00<00:00,  4.20it/s]Capturing batches (bs=1 avail_mem=53.36 GB): 100%|██████████| 4/4 [00:00<00:00,  4.73it/s]Capturing batches (bs=1 avail_mem=53.36 GB): 100%|██████████| 4/4 [00:00<00:00,  4.79it/s]


[2025-12-12 11:45:56] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [6]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [7]:
terminate_process(server_process)

### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling

By employing a truncated high-frequency token vocabulary in the draft model, Eagle speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, checkout [the paper](https://arxiv.org/pdf/arXiv:2502.14856).

In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency token in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency token by directly downloading these token from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).

Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx). 

In [8]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16  --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 11:46:06] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:46:06] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:46:06] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 11:46:08] Casting torch.bfloat16 to torch.float16.


[2025-12-12 11:46:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:46:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:46:14] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 11:46:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:46:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:46:14] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 11:46:16] Casting torch.bfloat16 to torch.float16.


[2025-12-12 11:46:17] Casting torch.bfloat16 to torch.float16.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 11:46:17.921324 696658 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:46:17.921346 696658 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:46:17.921371 696658 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:15632
I1212 11:46:17.921450 696658 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:46:17.924540 696658 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:46:17.952394 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 11:46:18.108355 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 11:46:18.108978 696658 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:bc:45
I1212 11:46:18.136298 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 11:46:18.136902 696658 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:3c:45
I1212 11:46:18.164868 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 11:46:18.165730 696658 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:ba:9c:86
I1212 11:46:18.192314 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 11:46:18.192917 696658 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:bc:45
I1212 11:46:18.216320 696658 rdma_cont

W1212 11:46:18.862995 696658 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f97ebfff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 11:46:21] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:13,  4.49s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:09<00:09,  4.52s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:04,  4.51s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.69s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=60.13 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=60.13 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.64it/s]Capturing batches (bs=3 avail_mem=59.89 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.64it/s]

Capturing batches (bs=2 avail_mem=59.82 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.64it/s]Capturing batches (bs=2 avail_mem=59.82 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.08it/s]Capturing batches (bs=1 avail_mem=59.73 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.08it/s]Capturing batches (bs=1 avail_mem=59.73 GB): 100%|██████████| 4/4 [00:00<00:00, 12.41it/s]


[2025-12-12 11:46:39] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 11:46:39] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-12 11:46:39] Overriding the draft model's max_position_embeddings to 8192.


I1212 11:46:39.328552 696658 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:46:39.328572 696658 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:46:39.328591 696658 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:16884
I1212 11:46:39.328706 696658 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:46:39.329634 696658 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:46:39.356285 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 11:46:39.356964 696658 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:bc:45
I1212 11:46:39.384265 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 11:46:39.384905 696658 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:0

I1212 11:46:39.552219 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 11:46:39.552826 696658 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:bc:45
I1212 11:46:39.580220 696658 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 11:46:39.580837 696658 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:3c:45


W1212 11:46:40.236034 696658 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f90bffff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.04s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.04s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=57.56 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=57.56 GB):  25%|██▌       | 1/4 [00:03<00:10,  3.59s/it]Capturing batches (bs=3 avail_mem=57.44 GB):  25%|██▌       | 1/4 [00:03<00:10,  3.59s/it]

Capturing batches (bs=3 avail_mem=57.44 GB):  50%|█████     | 2/4 [00:04<00:03,  1.80s/it]Capturing batches (bs=2 avail_mem=40.55 GB):  50%|█████     | 2/4 [00:04<00:03,  1.80s/it]Capturing batches (bs=2 avail_mem=40.55 GB):  75%|███████▌  | 3/4 [00:04<00:01,  1.06s/it]Capturing batches (bs=1 avail_mem=40.03 GB):  75%|███████▌  | 3/4 [00:04<00:01,  1.06s/it]

Capturing batches (bs=1 avail_mem=40.03 GB): 100%|██████████| 4/4 [00:06<00:00,  1.53s/it]Capturing batches (bs=1 avail_mem=40.03 GB): 100%|██████████| 4/4 [00:06<00:00,  1.64s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=39.85 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=39.78 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=39.78 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=39.76 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=39.76 GB): 100%|██████████| 4/4 [00:00<00:00, 38.95it/s]Capturing batches (bs=1 avail_mem=39.76 GB): 100%|██████████| 4/4 [00:00<00:00, 38.88it/s]


[2025-12-12 11:46:51] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [9]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [10]:
terminate_process(server_process)

### EAGLE-3 Decoding

You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model.

In [11]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct  --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
        --cuda-graph-max-bs 2 --dtype float16 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 11:47:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:47:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:47:01] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 11:47:04] Casting torch.bfloat16 to torch.float16.


[2025-12-12 11:47:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:47:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:47:10] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 11:47:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:47:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:47:10] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 11:47:12] Casting torch.bfloat16 to torch.float16.


[2025-12-12 11:47:13] Casting torch.bfloat16 to torch.float16.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 11:47:13.334028 699895 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:47:13.334043 699895 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:47:13.334065 699895 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:16878
I1212 11:47:13.334141 699895 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:47:13.337005 699895 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:47:13.364519 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 11:47:13.536360 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 11:47:13.536975 699895 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:3c:45
I1212 11:47:13.564883 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 11:47:13.565752 699895 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:ba:9c:86
I1212 11:47:13.592350 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 11:47:13.592960 699895 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:bc:45
I1212 11:47:13.620412 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 11:47:13.621052 699895 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:3c:45


W1212 11:47:14.332824 699895 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fd9bbfff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 11:47:17] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:14,  4.85s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:09<00:09,  4.66s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:14<00:04,  4.68s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:15<00:00,  3.33s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:15<00:00,  3.83s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=43.59 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=43.59 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.87it/s]Capturing batches (bs=3 avail_mem=43.41 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.87it/s]

Capturing batches (bs=2 avail_mem=43.34 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.87it/s]Capturing batches (bs=2 avail_mem=43.34 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.81it/s]Capturing batches (bs=1 avail_mem=43.25 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.81it/s]Capturing batches (bs=1 avail_mem=43.25 GB): 100%|██████████| 4/4 [00:00<00:00, 12.76it/s]


[2025-12-12 11:47:35] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 11:47:35] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-12 11:47:35] Overriding the draft model's max_position_embeddings to 131072.
I1212 11:47:35.582149 699895 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:47:35.582173 699895 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:47:35.582190 699895 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:15563
I1212 11:47:35.582297 699895 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:47:35.583184 699895 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:47:35.608733 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 11:47:35.609472 699895 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID

I1212 11:47:35.748350 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 11:47:35.749009 699895 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:3c:45
I1212 11:47:35.776886 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 11:47:35.777866 699895 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:ba:9c:86
I1212 11:47:35.804354 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 11:47:35.805003 699895 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:bc:45
I1212 11:47:35.832347 699895 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 11:47:35.833001 699895 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:3c:45


W1212 11:47:36.418496 699895 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fd283fff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.82it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.82it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=41.89 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=41.89 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.80s/it]Capturing batches (bs=3 avail_mem=41.75 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.80s/it]

Capturing batches (bs=3 avail_mem=41.75 GB):  50%|█████     | 2/4 [00:03<00:02,  1.38s/it]Capturing batches (bs=2 avail_mem=41.67 GB):  50%|█████     | 2/4 [00:03<00:02,  1.38s/it]Capturing batches (bs=2 avail_mem=41.67 GB):  75%|███████▌  | 3/4 [00:03<00:00,  1.22it/s]Capturing batches (bs=1 avail_mem=41.62 GB):  75%|███████▌  | 3/4 [00:03<00:00,  1.22it/s]

Capturing batches (bs=1 avail_mem=41.62 GB): 100%|██████████| 4/4 [00:05<00:00,  1.29s/it]Capturing batches (bs=1 avail_mem=41.62 GB): 100%|██████████| 4/4 [00:05<00:00,  1.34s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=41.07 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=40.99 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=40.99 GB):  50%|█████     | 2/4 [00:00<00:00, 13.99it/s]Capturing batches (bs=2 avail_mem=58.06 GB):  50%|█████     | 2/4 [00:00<00:00, 13.99it/s]Capturing batches (bs=1 avail_mem=58.03 GB):  50%|█████     | 2/4 [00:00<00:00, 13.99it/s]Capturing batches (bs=1 avail_mem=58.03 GB): 100%|██████████| 4/4 [00:00<00:00, 24.81it/s]


[2025-12-12 11:47:45] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [12]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [13]:
terminate_process(server_process)

## Multi Token Prediction

We support [MTP(Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use Xiaomi/MiMo-7B-RL model as example here (deepseek mtp usage refer to [deepseek doc](../basic_usage/deepseek.md#multi-token-prediction))

In [14]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \
    --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
    --mem-fraction 0.5 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 11:47:57] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:47:57] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:47:57] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 11:48:05] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:48:05] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:48:05] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 11:48:05] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 11:48:05] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 11:48:05] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 11:48:08.947060 701963 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:48:08.947078 701963 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:48:08.947100 701963 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:16725
I1212 11:48:08.947177 701963 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:48:08.950017 701963 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:48:08.980522 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 11:48:09.148325 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 11:48:09.148952 701963 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:3c:45
I1212 11:48:09.176306 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 11:48:09.176932 701963 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:bc:45
I1212 11:48:09.204340 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 11:48:09.204967 701963 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:3c:45
I1212 11:48:09.232864 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 11:48:09.233783 701963 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:ba:9c:86
I1212 11:48:09.260334 701963 rdma_cont

W1212 11:48:09.900230 701963 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fd1a7fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 11:48:12] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.35it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.26it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.26it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=41.57 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=41.57 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.48it/s]Capturing batches (bs=3 avail_mem=41.43 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.48it/s]Capturing batches (bs=2 avail_mem=41.32 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.48it/s]Capturing batches (bs=1 avail_mem=41.17 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.48it/s]Capturing batches (bs=1 avail_mem=41.17 GB): 100%|██████████| 4/4 [00:00<00:00, 10.58it/s]Capturing batches (bs=1 avail_mem=41.17 GB): 100%|██████████| 4/4 [00:00<00:00,  9.17it/s]


[2025-12-12 11:48:19] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 11:48:19] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 11:48:19.243047 701963 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 11:48:19.243067 701963 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.186.156.134 port: 12001
I1212 11:48:19.243083 701963 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.186.156.134:15909
I1212 11:48:19.243198 701963 transfer_engine.cpp:185] Auto-discovering topology...
I1212 11:48:19.244103 701963 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 11:48:19.268342 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 11:48:19.269029 701963 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:bc:45
I1212 11:48:19.296625 

I1212 11:48:19.408306 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 11:48:19.408944 701963 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:3c:45
I1212 11:48:19.436785 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 11:48:19.437745 701963 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:ba:9c:86
I1212 11:48:19.464293 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 11:48:19.464946 701963 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:bc:45
I1212 11:48:19.492852 701963 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 11:48:19.493568 701963 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:3c:45


W1212 11:48:20.181221 701963 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fca6bfff010, len: 2147483648: Operation not permitted [1]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.56it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  7.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  7.44it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=59.46 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=59.39 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=59.39 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=59.37 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=59.37 GB): 100%|██████████| 4/4 [00:00<00:00, 59.00it/s]


[2025-12-12 11:48:23] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [15]:
import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "XiaomiMiMo/MiMo-7B-RL",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

In [16]:
terminate_process(server_process)

## References

EAGLE process is as follows:

- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$. 
- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.
- EAGLE-2 additionally uses the draft model to evaluate how probable certain branches in the draft tree are, dynamically stopping the expansion of unlikely branches. After the expansion phase, reranking is employed to select only the top `speculative_num_draft_tokens` final nodes as draft tokens.
- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.

This enhances drafting accuracy by operating on the features instead of tokens for more regular inputs and passing the tokens from the next timestep additionally to minimize randomness effects from sampling. Furthermore the dynamic adjustment of the draft tree and selection of reranked final nodes increases acceptance rate of draft tokens further. For more details see [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) paper.


For guidance how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train).