# Speculative Decoding

SGLang now provides an EAGLE-based (EAGLE-2/EAGLE-3) speculative decoding option. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.

### Performance Highlights

Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.
For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).

| Method | Throughput (tokens/s) |
|--------|----------------|
| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |
| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |
| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |

## EAGLE Decoding

To enable EAGLE speculative decoding the following parameters are relevant:
* `speculative_draft_model_path`: Specifies draft model. This parameter is required.
* `speculative_num_steps`: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.
* `speculative_eagle_topk`: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.
* `speculative_num_draft_tokens`: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.

These parameters are the same for EAGLE-2 and EAGLE-3.

You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).

In the documentation below, we set `--cuda-graph-max-bs` to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with `--cuda-graph-max-bs`, `--max-running-requests`, `--mem-fraction-static` for the best performance. 

### EAGLE-2 decoding

You can enable EAGLE-2 decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model.

In [1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

import openai

[2025-12-12 04:15:22] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-12 04:15:22] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-12 04:15:22] INFO utils.py:164: NumExpr defaulting to 16 threads.


In [2]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 04:15:28] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:15:28] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:15:28] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 04:15:39] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:15:39] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:15:39] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 04:15:39] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:15:39] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:15:39] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 04:15:42.973518 623872 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:15:42.973536 623872 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:15:42.973562 623872 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:15619
I1212 04:15:42.973649 623872 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:15:42.976532 623872 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 04:15:42.999832 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 04:15:43.240815 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 04:15:43.241680 623872 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:42:71
I1212 04:15:43.267710 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 04:15:43.268541 623872 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:c2:71
I1212 04:15:43.273226 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 04:15:43.274060 623872 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:42:71
I1212 04:15:43.299773 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 04:15:43.300592 623872 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:c2:71
I1212 04:15:43.323773 623872 rdma_co

W1212 04:15:44.136868 623872 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f993bfff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 04:15:46] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.69s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.17s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.25s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=19.97 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=19.97 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.22it/s]Capturing batches (bs=3 avail_mem=19.82 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.22it/s]

Capturing batches (bs=2 avail_mem=19.75 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.22it/s]Capturing batches (bs=2 avail_mem=19.75 GB):  75%|███████▌  | 3/4 [00:00<00:00, 10.72it/s]Capturing batches (bs=1 avail_mem=19.63 GB):  75%|███████▌  | 3/4 [00:00<00:00, 10.72it/s]Capturing batches (bs=1 avail_mem=19.63 GB): 100%|██████████| 4/4 [00:00<00:00,  9.94it/s]


[2025-12-12 04:15:55] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 04:15:55] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 04:15:55.453265 623872 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:15:55.453285 623872 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:15:55.453300 623872 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:16022
I1212 04:15:55.453383 623872 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:15:55.454066 623872 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.


I1212 04:15:55.479924 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 04:15:55.480582 623872 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:c2:71
I1212 04:15:55.507882 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 04:15:55.508528 623872 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:42:71
I1212 04:15:55.536435 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 04:15:55.537417 623872 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:b8:f2:9c
I1212 04:15:55.563867 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 04:15:55.564491 623872 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:42:71
I1212 04:15:55.591856 623872 rdma_cont

I1212 04:15:55.703756 623872 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 04:15:55.704361 623872 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:c2:71


W1212 04:15:56.435395 623872 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f90bbfff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.16s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.16s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=18.47 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=18.47 GB):  25%|██▌       | 1/4 [00:04<00:13,  4.57s/it]Capturing batches (bs=3 avail_mem=18.38 GB):  25%|██▌       | 1/4 [00:04<00:13,  4.57s/it]

Capturing batches (bs=3 avail_mem=18.38 GB):  50%|█████     | 2/4 [00:05<00:04,  2.25s/it]Capturing batches (bs=2 avail_mem=18.38 GB):  50%|█████     | 2/4 [00:05<00:04,  2.25s/it]

Capturing batches (bs=2 avail_mem=18.38 GB):  75%|███████▌  | 3/4 [00:05<00:01,  1.36s/it]Capturing batches (bs=1 avail_mem=18.35 GB):  75%|███████▌  | 3/4 [00:05<00:01,  1.36s/it]

Capturing batches (bs=1 avail_mem=18.35 GB): 100%|██████████| 4/4 [00:08<00:00,  2.11s/it]Capturing batches (bs=1 avail_mem=18.35 GB): 100%|██████████| 4/4 [00:08<00:00,  2.19s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=53.65 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=53.60 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=53.60 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=53.58 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=53.58 GB): 100%|██████████| 4/4 [00:00<00:00, 113.21it/s]


[2025-12-12 04:16:08] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [3]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [4]:
terminate_process(server_process)

### EAGLE-2 Decoding with `torch.compile`

You can also enable `torch.compile` for further optimizations and optionally set `--torch-compile-max-bs`:


In [5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
            --enable-torch-compile --torch-compile-max-bs 2 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 04:16:19] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:16:19] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:16:19] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 04:16:26] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:16:26] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:16:26] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:16:26] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:16:26] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 04:16:26] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 04:16:29.320971 627238 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:16:29.320991 627238 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:16:29.321019 627238 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:15743
I1212 04:16:29.321089 627238 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:16:29.323909 627238 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 04:16:29.351920 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 04:16:29.504679 627238 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:c2:71
I1212 04:16:29.531746 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 04:16:29.532543 627238 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:42:71
I1212 04:16:29.559744 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 04:16:29.580902 627238 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:c2:71
I1212 04:16:29.607827 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 04:16:29.608616 627238 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:42:71
I1212 04:16:29.631798 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 04:16:29.632593 627238 rdma_co

W1212 04:16:30.340906 627238 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fa6ebfff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 04:16:32] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.51s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.07s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.14s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=35.83 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=35.83 GB):  25%|██▌       | 1/4 [00:00<00:00,  4.88it/s]Capturing batches (bs=3 avail_mem=35.68 GB):  25%|██▌       | 1/4 [00:00<00:00,  4.88it/s]Capturing batches (bs=2 avail_mem=35.60 GB):  25%|██▌       | 1/4 [00:00<00:00,  4.88it/s]

  torch._dynamo.utils.warn_once(msg)


Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.04822399839758873, "best_triton_pos": 1, "best_triton_time": 0.04927999898791313, "best_triton_kernel": "triton_mm_18", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8"}
AUTOTUNE mm(128x4096, 4096x12288)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0482 ms 100.0% 
  triton_mm_18 0.0493 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_12 0.0531 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_8 0.0549 ms 87.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=6

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.02191999927163124, "best_triton_pos": 1, "best_triton_time": 0.02316799946129322, "best_triton_kernel": "triton_mm_27", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"}
AUTOTUNE mm(128x4096, 4096x4096)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0219 ms 100.0% 
  triton_mm_27 0.0232 ms 94.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_31 0.0266 ms 82.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_23 0.0299 ms 73.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32,

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "triton_mm_49", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.07532799988985062, "best_triton_pos": 0}
AUTOTUNE mm(128x4096, 4096x22016)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_49 0.0753 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  mm 0.0762 ms 98.9% 
  triton_mm_55 0.0781 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_50 0.0796 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
 

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.04614400118589401, "best_triton_pos": 1, "best_triton_time": 0.049247998744249344, "best_triton_kernel": "triton_mm_65", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"}
AUTOTUNE mm(128x11008, 11008x4096)
strides: [11008, 1], [1, 11008]
dtypes: torch.float16, torch.float16
  mm 0.0461 ms 100.0% 
  triton_mm_65 0.0492 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_69 0.0544 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_61 0.0658 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "triton_mm_93", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.10156799852848053, "best_triton_pos": 0}
AUTOTUNE mm(128x4096, 4096x32000)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_93 0.1016 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_94 0.1046 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  mm 0.1055 ms 96.3% 
  triton_mm_88 0.1082 ms 93.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4

Capturing batches (bs=2 avail_mem=35.60 GB):  75%|███████▌  | 3/4 [00:21<00:08,  8.01s/it]Capturing batches (bs=1 avail_mem=53.78 GB):  75%|███████▌  | 3/4 [00:21<00:08,  8.01s/it]

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.047231998294591904, "best_triton_pos": 1, "best_triton_time": 0.04870399832725525, "best_triton_kernel": "triton_mm_107", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4"}
AUTOTUNE mm(64x4096, 4096x12288)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0472 ms 100.0% 
  triton_mm_107 0.0487 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_111 0.0487 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_103 0.0493 ms 95.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.021727999672293663, "best_triton_pos": 1, "best_triton_time": 0.022752000018954277, "best_triton_kernel": "triton_mm_116", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"}
AUTOTUNE mm(64x4096, 4096x4096)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0217 ms 100.0% 
  triton_mm_116 0.0228 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_120 0.0236 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_124 0.0265 ms 81.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.07267200201749802, "best_triton_pos": 1, "best_triton_time": 0.07472000271081924, "best_triton_kernel": "triton_mm_140", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4"}
AUTOTUNE mm(64x4096, 4096x22016)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0727 ms 100.0% 
  triton_mm_140 0.0747 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_136 0.0749 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_137 0.0750 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=6

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.04816000163555145, "best_triton_pos": 1, "best_triton_time": 0.049984000623226166, "best_triton_kernel": "triton_mm_150", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"}
AUTOTUNE mm(64x11008, 11008x4096)
strides: [11008, 1], [1, 11008]
dtypes: torch.float16, torch.float16
  mm 0.0482 ms 100.0% 
  triton_mm_150 0.0500 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_154 0.0518 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_158 0.0584 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOC

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.09942399710416794, "best_triton_pos": 0}
AUTOTUNE mm(64x4096, 4096x32000)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_174 0.0994 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_170 0.1007 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_175 0.1007 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_179 0.1011

Capturing batches (bs=1 avail_mem=53.78 GB): 100%|██████████| 4/4 [00:39<00:00, 11.51s/it]Capturing batches (bs=1 avail_mem=53.78 GB): 100%|██████████| 4/4 [00:39<00:00,  9.95s/it]


[2025-12-12 04:17:18] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 04:17:18] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend


I1212 04:17:18.513509 627238 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:17:18.513525 627238 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:17:18.513540 627238 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:15454
I1212 04:17:18.513609 627238 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:17:18.514308 627238 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 04:17:18.539746 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 04:17:18.540347 627238 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:c2:71
I1212 04:17:18.567693 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 04:17:18.568264 627238 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:0

I1212 04:17:18.727624 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 04:17:18.728168 627238 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:42:71
I1212 04:17:18.755582 627238 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 04:17:18.756129 627238 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:c2:71


W1212 04:17:19.447867 627238 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f9e27fff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.21s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.21s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=53.69 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=53.69 GB):  25%|██▌       | 1/4 [00:05<00:16,  5.58s/it]Capturing batches (bs=3 avail_mem=53.57 GB):  25%|██▌       | 1/4 [00:05<00:16,  5.58s/it]

Capturing batches (bs=3 avail_mem=53.57 GB):  50%|█████     | 2/4 [00:06<00:05,  2.71s/it]Capturing batches (bs=2 avail_mem=53.55 GB):  50%|█████     | 2/4 [00:06<00:05,  2.71s/it]Capturing batches (bs=2 avail_mem=53.55 GB):  75%|███████▌  | 3/4 [00:06<00:01,  1.56s/it]Capturing batches (bs=1 avail_mem=53.51 GB):  75%|███████▌  | 3/4 [00:06<00:01,  1.56s/it]

Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:10<00:00,  2.67s/it]Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:10<00:00,  2.71s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=37.33 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=37.26 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=37.26 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=37.24 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=37.24 GB): 100%|██████████| 4/4 [00:00<00:00, 92.86it/s]


[2025-12-12 04:17:36] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [6]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [7]:
terminate_process(server_process)

### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling

By employing a truncated high-frequency token vocabulary in the draft model, Eagle speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, checkout [the paper](https://arxiv.org/pdf/arXiv:2502.14856).

In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency token in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency token by directly downloading these token from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).

Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx). 

In [8]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16  --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 04:17:45] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:17:45] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:17:45] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 04:17:48] Casting torch.bfloat16 to torch.float16.


[2025-12-12 04:17:56] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:17:56] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:17:56] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 04:17:56] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:17:56] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:17:56] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 04:17:58] Casting torch.bfloat16 to torch.float16.


[2025-12-12 04:17:58] Casting torch.bfloat16 to torch.float16.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 04:17:59.019937 634307 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:17:59.019954 634307 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:17:59.019987 634307 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:16642
I1212 04:17:59.020071 634307 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:17:59.023181 634307 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 04:17:59.048826 634307 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 04:17:59.219836 634307 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 04:17:59.220458 634307 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:c2:71


W1212 04:17:59.927673 634307 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f3123fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 04:18:02] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:03<00:11,  3.97s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:07<00:07,  3.98s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:12<00:04,  4.04s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  2.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  3.31s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=59.11 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=59.11 GB):  25%|██▌       | 1/4 [00:00<00:00,  7.10it/s]Capturing batches (bs=3 avail_mem=58.87 GB):  25%|██▌       | 1/4 [00:00<00:00,  7.10it/s]Capturing batches (bs=2 avail_mem=58.81 GB):  25%|██▌       | 1/4 [00:00<00:00,  7.10it/s]

Capturing batches (bs=2 avail_mem=58.81 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.73it/s]Capturing batches (bs=1 avail_mem=58.72 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.73it/s]Capturing batches (bs=1 avail_mem=58.72 GB): 100%|██████████| 4/4 [00:00<00:00, 13.07it/s]


[2025-12-12 04:18:18] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 04:18:18] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-12 04:18:18] Overriding the draft model's max_position_embeddings to 8192.
I1212 04:18:18.971099 634307 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:18:18.971119 634307 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:18:18.971135 634307 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:16329
I1212 04:18:18.971206 634307 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:18:18.971897 634307 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.


I1212 04:18:18.999907 634307 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 04:18:19.000528 634307 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:c2:71
I1212 04:18:19.027860 634307 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 04:18:19.028477 634307 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:42:71
I1212 04:18:19.056300 634307 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 04:18:19.057253 634307 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:b8:f2:9c
I1212 04:18:19.083818 634307 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 04:18:19.084409 634307 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:42:71
I1212 04:18:19.111848 634307 rdma_cont

I1212 04:18:19.220679 634307 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 04:18:19.221287 634307 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:c2:71


W1212 04:18:19.840058 634307 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f29f7fff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.10it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.10it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=35.75 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=35.75 GB):  25%|██▌       | 1/4 [00:03<00:10,  3.55s/it]Capturing batches (bs=3 avail_mem=9.77 GB):  25%|██▌       | 1/4 [00:03<00:10,  3.55s/it] 

Capturing batches (bs=3 avail_mem=9.77 GB):  50%|█████     | 2/4 [00:04<00:03,  1.79s/it]Capturing batches (bs=2 avail_mem=9.66 GB):  50%|█████     | 2/4 [00:04<00:03,  1.79s/it]

Capturing batches (bs=2 avail_mem=9.66 GB):  75%|███████▌  | 3/4 [00:04<00:01,  1.08s/it]Capturing batches (bs=1 avail_mem=9.57 GB):  75%|███████▌  | 3/4 [00:04<00:01,  1.08s/it]

Capturing batches (bs=1 avail_mem=9.57 GB): 100%|██████████| 4/4 [00:06<00:00,  1.57s/it]Capturing batches (bs=1 avail_mem=9.57 GB): 100%|██████████| 4/4 [00:06<00:00,  1.66s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=9.28 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=9.20 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=9.20 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=9.18 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=9.18 GB): 100%|██████████| 4/4 [00:00<00:00, 90.05it/s]


[2025-12-12 04:18:31] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [9]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [10]:
terminate_process(server_process)

### EAGLE-3 Decoding

You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model.

In [11]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct  --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
        --cuda-graph-max-bs 2 --dtype float16 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 04:18:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:18:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:18:41] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 04:18:43] Casting torch.bfloat16 to torch.float16.


[2025-12-12 04:18:49] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:18:49] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:18:49] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 04:18:49] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:18:49] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:18:49] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 04:18:51] Casting torch.bfloat16 to torch.float16.


[2025-12-12 04:18:52] Casting torch.bfloat16 to torch.float16.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 04:18:52.719902 637264 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:18:52.719919 637264 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:18:52.719944 637264 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:16663
I1212 04:18:52.720031 637264 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:18:52.722960 637264 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 04:18:52.747819 637264 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 04:18:52.968834 637264 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 04:18:52.969566 637264 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:42:71
I1212 04:18:52.973800 637264 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 04:18:52.974450 637264 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:c2:71
I1212 04:18:52.978684 637264 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 04:18:52.979365 637264 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:42:71
I1212 04:18:52.983610 637264 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 04:18:52.984259 637264 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:c2:71
I1212 04:18:52.988494 637264 rdma_co

W1212 04:18:53.623688 637264 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fe18ffff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 04:18:56] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:12,  4.26s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:08<00:08,  4.17s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:12<00:04,  4.11s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  2.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  3.37s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=59.49 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=59.49 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.10it/s]Capturing batches (bs=3 avail_mem=59.31 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.10it/s]

Capturing batches (bs=2 avail_mem=59.24 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.10it/s]Capturing batches (bs=2 avail_mem=59.24 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.19it/s]Capturing batches (bs=1 avail_mem=59.16 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.19it/s]Capturing batches (bs=1 avail_mem=59.16 GB): 100%|██████████| 4/4 [00:00<00:00, 12.41it/s]


[2025-12-12 04:19:13] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 04:19:13] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-12 04:19:13] Overriding the draft model's max_position_embeddings to 131072.
I1212 04:19:13.584232 637264 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:19:13.584250 637264 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:19:13.584266 637264 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:15627
I1212 04:19:13.584343 637264 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:19:13.585057 637264 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 04:19:13.589444 637264 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 04:19:13.590101 637264 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID

W1212 04:19:14.331749 637264 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fda57fff010, len: 2147483648: Operation not permitted [1]


Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.58it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.58it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=42.86 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=42.86 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.80s/it]Capturing batches (bs=3 avail_mem=26.95 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.80s/it]

Capturing batches (bs=3 avail_mem=26.95 GB):  50%|█████     | 2/4 [00:03<00:02,  1.37s/it]Capturing batches (bs=2 avail_mem=26.84 GB):  50%|█████     | 2/4 [00:03<00:02,  1.37s/it]Capturing batches (bs=2 avail_mem=26.84 GB):  75%|███████▌  | 3/4 [00:03<00:00,  1.26it/s]Capturing batches (bs=1 avail_mem=26.80 GB):  75%|███████▌  | 3/4 [00:03<00:00,  1.26it/s]

Capturing batches (bs=1 avail_mem=26.80 GB): 100%|██████████| 4/4 [00:05<00:00,  1.26s/it]Capturing batches (bs=1 avail_mem=26.80 GB): 100%|██████████| 4/4 [00:05<00:00,  1.31s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=26.76 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=26.69 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=26.69 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=26.67 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=26.67 GB): 100%|██████████| 4/4 [00:00<00:00, 97.00it/s]


[2025-12-12 04:19:23] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [12]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [13]:
terminate_process(server_process)

## Multi Token Prediction

We support [MTP(Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use Xiaomi/MiMo-7B-RL model as example here (deepseek mtp usage refer to [deepseek doc](../basic_usage/deepseek.md#multi-token-prediction))

In [14]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \
    --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
    --mem-fraction 0.5 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 04:19:34] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:19:34] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:19:34] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 04:19:42] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:19:42] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 04:19:42] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:19:42] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 04:19:42] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 04:19:42] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 04:19:46.532043 640060 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:19:46.532061 640060 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:19:46.532090 640060 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:15468
I1212 04:19:46.532189 640060 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:19:46.535046 640060 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 04:19:46.559783 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx

I1212 04:19:46.956833 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 04:19:46.957543 640060 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:42:71
I1212 04:19:46.983738 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 04:19:46.984372 640060 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:c2:71
I1212 04:19:47.011787 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 04:19:47.012414 640060 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:42:71
I1212 04:19:47.035776 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 04:19:47.036402 640060 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:c2:71
I1212 04:19:47.063779 640060 rdma_co

W1212 04:19:47.714603 640060 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f8747fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 04:19:50] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.49it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.34it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.31it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=39.44 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=39.44 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.63it/s]Capturing batches (bs=3 avail_mem=39.30 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.63it/s]Capturing batches (bs=2 avail_mem=39.23 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.63it/s]Capturing batches (bs=2 avail_mem=39.23 GB):  75%|███████▌  | 3/4 [00:00<00:00,  6.82it/s]Capturing batches (bs=1 avail_mem=39.14 GB):  75%|███████▌  | 3/4 [00:00<00:00,  6.82it/s]Capturing batches (bs=1 avail_mem=39.14 GB): 100%|██████████| 4/4 [00:00<00:00,  6.98it/s]


[2025-12-12 04:19:57] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 04:19:57] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 04:19:57.977771 640060 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 04:19:57.977787 640060 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.242.156 port: 12001
I1212 04:19:57.977800 640060 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.242.156:16387
I1212 04:19:57.977877 640060 transfer_engine.cpp:185] Auto-discovering topology...
I1212 04:19:57.978541 640060 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 04:19:58.003883 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 04:19:58.004519 640060 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:c2:71
I1212 04:19:58.008531 

I1212 04:19:58.119896 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 04:19:58.120484 640060 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:42:71
I1212 04:19:58.147924 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 04:19:58.148540 640060 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:c2:71
I1212 04:19:58.175808 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 04:19:58.176415 640060 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:42:71
I1212 04:19:58.203760 640060 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 04:19:58.204481 640060 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:c2:71


W1212 04:19:58.987854 640060 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f8007fff010, len: 2147483648: Operation not permitted [1]


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.58it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  7.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  7.19it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.52 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=55.44 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=3 avail_mem=55.44 GB):  50%|█████     | 2/4 [00:00<00:00,  9.12it/s]Capturing batches (bs=2 avail_mem=55.44 GB):  50%|█████     | 2/4 [00:00<00:00,  9.12it/s]Capturing batches (bs=1 avail_mem=55.42 GB):  50%|█████     | 2/4 [00:00<00:00,  9.12it/s]Capturing batches (bs=1 avail_mem=55.42 GB): 100%|██████████| 4/4 [00:00<00:00, 14.20it/s]


[2025-12-12 04:20:03] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [15]:
import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "XiaomiMiMo/MiMo-7B-RL",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

In [16]:
terminate_process(server_process)

## References

EAGLE process is as follows:

- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$. 
- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.
- EAGLE-2 additionally uses the draft model to evaluate how probable certain branches in the draft tree are, dynamically stopping the expansion of unlikely branches. After the expansion phase, reranking is employed to select only the top `speculative_num_draft_tokens` final nodes as draft tokens.
- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.

This enhances drafting accuracy by operating on the features instead of tokens for more regular inputs and passing the tokens from the next timestep additionally to minimize randomness effects from sampling. Furthermore the dynamic adjustment of the draft tree and selection of reranked final nodes increases acceptance rate of draft tokens further. For more details see [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) paper.


For guidance how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train).