# Speculative Decoding

SGLang now provides an EAGLE-based (EAGLE-2/EAGLE-3) speculative decoding option. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.

### Performance Highlights

Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.
For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).

| Method | Throughput (tokens/s) |
|--------|----------------|
| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |
| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |
| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |

## EAGLE Decoding

To enable EAGLE speculative decoding the following parameters are relevant:
* `speculative_draft_model_path`: Specifies draft model. This parameter is required.
* `speculative_num_steps`: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.
* `speculative_eagle_topk`: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.
* `speculative_num_draft_tokens`: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.

These parameters are the same for EAGLE-2 and EAGLE-3.

You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).

In the documentation below, we set `--cuda-graph-max-bs` to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with `--cuda-graph-max-bs`, `--max-running-requests`, `--mem-fraction-static` for the best performance. 

### EAGLE-2 decoding

You can enable EAGLE-2 decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model.

In [1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

import openai

[2025-12-12 05:23:43] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-12 05:23:43] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-12 05:23:43] INFO utils.py:164: NumExpr defaulting to 16 threads.


In [2]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 05:23:50] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:23:50] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:23:50] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 05:24:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:24:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:24:01] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 05:24:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:24:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:24:01] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 05:24:04.743409 836092 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:24:04.743430 836092 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:24:04.743456 836092 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:15797
I1212 05:24:04.743525 836092 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:24:04.746387 836092 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 05:24:04.771056 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_

I1212 05:24:04.940001 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 05:24:04.940827 836092 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:31:5d
I1212 05:24:04.945062 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 05:24:04.945799 836092 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:b1:5d
I1212 05:24:04.949944 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 05:24:04.950675 836092 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:31:5d
I1212 05:24:04.954841 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 05:24:04.955612 836092 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:b1:5d
I1212 05:24:04.959796 836092 rdma_co

W1212 05:24:05.758213 836092 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f982ffff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 05:24:08] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.14s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.56s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=21.37 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=21.37 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.40it/s]Capturing batches (bs=3 avail_mem=21.22 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.40it/s]

Capturing batches (bs=2 avail_mem=21.15 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.40it/s]Capturing batches (bs=2 avail_mem=21.15 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.49it/s]Capturing batches (bs=1 avail_mem=21.06 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.49it/s]Capturing batches (bs=1 avail_mem=21.06 GB): 100%|██████████| 4/4 [00:00<00:00, 11.21it/s]


[2025-12-12 05:24:14] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 05:24:14] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 05:24:14.551776 836092 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:24:14.551795 836092 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:24:14.551810 836092 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:16711
I1212 05:24:14.551899 836092 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:24:14.552603 836092 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.


I1212 05:24:14.579113 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 05:24:14.579798 836092 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:31:5d
I1212 05:24:14.607625 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 05:24:14.608647 836092 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:b8:61:92
I1212 05:24:14.635006 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 05:24:14.636004 836092 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:b1:5d
I1212 05:24:14.663901 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 05:24:14.664551 836092 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:31:5d
I1212 05:24:14.691018 836092 rdma_cont

I1212 05:24:14.802954 836092 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 05:24:14.803604 836092 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:b1:5d


W1212 05:24:15.498453 836092 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f8faffff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.28s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.28s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=18.50 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=18.50 GB):  25%|██▌       | 1/4 [00:04<00:13,  4.52s/it]Capturing batches (bs=3 avail_mem=18.37 GB):  25%|██▌       | 1/4 [00:04<00:13,  4.52s/it]

Capturing batches (bs=3 avail_mem=18.37 GB):  50%|█████     | 2/4 [00:05<00:04,  2.23s/it]Capturing batches (bs=2 avail_mem=37.21 GB):  50%|█████     | 2/4 [00:05<00:04,  2.23s/it]

Capturing batches (bs=2 avail_mem=37.21 GB):  75%|███████▌  | 3/4 [00:05<00:01,  1.36s/it]Capturing batches (bs=1 avail_mem=37.19 GB):  75%|███████▌  | 3/4 [00:05<00:01,  1.36s/it]

Capturing batches (bs=1 avail_mem=37.19 GB): 100%|██████████| 4/4 [00:08<00:00,  2.20s/it]Capturing batches (bs=1 avail_mem=37.19 GB): 100%|██████████| 4/4 [00:08<00:00,  2.24s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=37.16 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=37.11 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=37.11 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=37.09 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=37.09 GB): 100%|██████████| 4/4 [00:00<00:00, 48.05it/s]


[2025-12-12 05:24:28] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [3]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [4]:
terminate_process(server_process)

### EAGLE-2 Decoding with `torch.compile`

You can also enable `torch.compile` for further optimizations and optionally set `--torch-compile-max-bs`:


In [5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
            --enable-torch-compile --torch-compile-max-bs 2 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 05:24:38] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:24:38] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:24:38] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 05:24:47] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:24:47] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:24:47] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 05:24:47] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:24:47] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:24:47] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 05:24:49.746986 839339 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:24:49.747002 839339 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:24:49.747026 839339 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:16533
I1212 05:24:49.747110 839339 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:24:49.750032 839339 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 05:24:49.775020 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_

I1212 05:24:49.943980 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 05:24:49.944864 839339 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:31:5d
I1212 05:24:49.970952 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 05:24:49.971801 839339 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:b1:5d
I1212 05:24:49.998997 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 05:24:49.999836 839339 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:31:5d
I1212 05:24:50.027009 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 05:24:50.027823 839339 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:b1:5d
I1212 05:24:50.054971 839339 rdma_co

W1212 05:24:50.743562 839339 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f28bbfff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 05:24:53] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.52s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.13s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=35.84 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=35.84 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.75it/s]Capturing batches (bs=3 avail_mem=35.68 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.75it/s]Capturing batches (bs=3 avail_mem=35.68 GB):  50%|█████     | 2/4 [00:00<00:00,  4.77it/s]Capturing batches (bs=2 avail_mem=35.60 GB):  50%|█████     | 2/4 [00:00<00:00,  4.77it/s]

  torch._dynamo.utils.warn_once(msg)


Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.048767998814582825, "best_triton_pos": 1, "best_triton_time": 0.04927999898791313, "best_triton_kernel": "triton_mm_18", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8"}
AUTOTUNE mm(128x4096, 4096x12288)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0488 ms 100.0% 
  triton_mm_18 0.0493 ms 99.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_12 0.0526 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_8 0.0548 ms 88.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.022112000733613968, "best_triton_pos": 1, "best_triton_time": 0.023615999147295952, "best_triton_kernel": "triton_mm_27", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"}
AUTOTUNE mm(128x4096, 4096x4096)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0221 ms 100.0% 
  triton_mm_27 0.0236 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_31 0.0267 ms 82.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_23 0.0304 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=3

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "triton_mm_49", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.07612799853086472, "best_triton_pos": 0}
AUTOTUNE mm(128x4096, 4096x22016)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_49 0.0761 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  mm 0.0773 ms 98.5% 
  triton_mm_55 0.0793 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_50 0.0798 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
 

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.049536000937223434, "best_triton_pos": 1, "best_triton_time": 0.05152000114321709, "best_triton_kernel": "triton_mm_65", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"}
AUTOTUNE mm(128x11008, 11008x4096)
strides: [11008, 1], [1, 11008]
dtypes: torch.float16, torch.float16
  mm 0.0495 ms 100.0% 
  triton_mm_65 0.0515 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_69 0.0566 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_61 0.0679 ms 72.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_

Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "triton_mm_93", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.10291200131177902, "best_triton_pos": 0}
AUTOTUNE mm(128x4096, 4096x32000)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_93 0.1029 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_94 0.1050 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  mm 0.1067 ms 96.5% 
  triton_mm_88 0.1086 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4

Capturing batches (bs=2 avail_mem=35.60 GB):  75%|███████▌  | 3/4 [00:27<00:12, 12.41s/it]Capturing batches (bs=1 avail_mem=23.50 GB):  75%|███████▌  | 3/4 [00:27<00:12, 12.41s/it]

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.047359999269247055, "best_triton_pos": 1, "best_triton_time": 0.048287998884916306, "best_triton_kernel": "triton_mm_107", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4"}
AUTOTUNE mm(64x4096, 4096x12288)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0474 ms 100.0% 
  triton_mm_107 0.0483 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_111 0.0485 ms 97.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_103 0.0490 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOC

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.021344000473618507, "best_triton_pos": 1, "best_triton_time": 0.022336000576615334, "best_triton_kernel": "triton_mm_116", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"}
AUTOTUNE mm(64x4096, 4096x4096)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  mm 0.0213 ms 100.0% 
  triton_mm_116 0.0223 ms 95.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_120 0.0234 ms 91.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_124 0.0267 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_140", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.07440000027418137, "best_triton_pos": 0}
AUTOTUNE mm(64x4096, 4096x22016)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_140 0.0744 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_136 0.0747 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  mm 0.0750 ms 99.2% 
  triton_mm_137 0.0750 ms 99.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
 

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4", "best_time": 0.0480320006608963, "best_triton_pos": 0}
AUTOTUNE mm(64x11008, 11008x4096)
strides: [11008, 1], [1, 11008]
dtypes: torch.float16, torch.float16
  triton_mm_150 0.0480 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  mm 0.0486 ms 98.7% 
  triton_mm_154 0.0496 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_158 0.0552 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps

Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.10051199793815613, "best_triton_pos": 0}
AUTOTUNE mm(64x4096, 4096x32000)
strides: [4096, 1], [1, 4096]
dtypes: torch.float16, torch.float16
  triton_mm_174 0.1005 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_179 0.1009 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_170 0.1012 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_175 0.1012 

Capturing batches (bs=1 avail_mem=23.50 GB): 100%|██████████| 4/4 [00:45<00:00, 14.85s/it]Capturing batches (bs=1 avail_mem=23.50 GB): 100%|██████████| 4/4 [00:45<00:00, 11.50s/it]


[2025-12-12 05:25:45] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 05:25:45] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 05:25:45.392925 839339 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:25:45.392946 839339 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:25:45.392959 839339 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:15300
I1212 05:25:45.393035 839339 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:25:45.393744 839339 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 05:25:45.398111 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 05:25:45.398732 839339 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:31:5d


I1212 05:25:45.423287 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 05:25:45.424168 839339 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:b8:61:92
I1212 05:25:45.450973 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 05:25:45.451551 839339 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:b1:5d
I1212 05:25:45.478940 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 05:25:45.479514 839339 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:31:5d
I1212 05:25:45.502928 839339 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 05:25:45.503518 839339 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:b1:5d
I1212 05:25:45.530870 839339 rdma_cont

W1212 05:25:46.279367 839339 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f1f57fff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.29s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.29s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=53.69 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=53.69 GB):  25%|██▌       | 1/4 [00:05<00:17,  5.86s/it]Capturing batches (bs=3 avail_mem=53.04 GB):  25%|██▌       | 1/4 [00:05<00:17,  5.86s/it]

Capturing batches (bs=3 avail_mem=53.04 GB):  50%|█████     | 2/4 [00:06<00:05,  2.87s/it]Capturing batches (bs=2 avail_mem=37.42 GB):  50%|█████     | 2/4 [00:06<00:05,  2.87s/it]

Capturing batches (bs=2 avail_mem=37.42 GB):  75%|███████▌  | 3/4 [00:06<00:01,  1.67s/it]Capturing batches (bs=1 avail_mem=37.38 GB):  75%|███████▌  | 3/4 [00:06<00:01,  1.67s/it]

Capturing batches (bs=1 avail_mem=37.38 GB): 100%|██████████| 4/4 [00:11<00:00,  2.98s/it]Capturing batches (bs=1 avail_mem=37.38 GB): 100%|██████████| 4/4 [00:11<00:00,  2.97s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=18.97 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=18.97 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.29it/s]Capturing batches (bs=3 avail_mem=18.82 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.29it/s]Capturing batches (bs=2 avail_mem=18.82 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.29it/s]Capturing batches (bs=1 avail_mem=18.80 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.29it/s]

Capturing batches (bs=1 avail_mem=18.80 GB): 100%|██████████| 4/4 [00:00<00:00, 19.38it/s]


[2025-12-12 05:26:04] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [6]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [7]:
terminate_process(server_process)

### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling

By employing a truncated high-frequency token vocabulary in the draft model, Eagle speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, checkout [the paper](https://arxiv.org/pdf/arXiv:2502.14856).

In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency token in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency token by directly downloading these token from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).

Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx). 

In [8]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16  --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 05:26:15] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:26:15] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:26:15] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 05:26:17] Casting torch.bfloat16 to torch.float16.


[2025-12-12 05:26:27] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:26:27] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:26:27] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 05:26:27] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:26:27] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:26:27] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 05:26:29] Casting torch.bfloat16 to torch.float16.


[2025-12-12 05:26:30] Casting torch.bfloat16 to torch.float16.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 05:26:30.775945 846638 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:26:30.775964 846638 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:26:30.775986 846638 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:16268
I1212 05:26:30.776067 846638 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:26:30.779088 846638 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 05:26:30.806949 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_

I1212 05:26:30.971988 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 05:26:30.972653 846638 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:31:5d
I1212 05:26:30.998873 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 05:26:30.999521 846638 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:b1:5d
I1212 05:26:31.031080 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 05:26:31.031776 846638 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:31:5d
I1212 05:26:31.059099 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 05:26:31.059808 846638 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:b1:5d
I1212 05:26:31.065058 846638 rdma_co

W1212 05:26:31.771111 846638 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f9ef7fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 05:26:34] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:06<00:18,  6.15s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:13<00:13,  6.64s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:17<00:05,  5.81s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:19<00:00,  4.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:19<00:00,  4.88s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=28.11 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=28.11 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.83it/s]Capturing batches (bs=3 avail_mem=26.50 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.83it/s]Capturing batches (bs=2 avail_mem=26.41 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.83it/s]Capturing batches (bs=2 avail_mem=26.41 GB):  75%|███████▌  | 3/4 [00:00<00:00,  7.38it/s]Capturing batches (bs=1 avail_mem=26.32 GB):  75%|███████▌  | 3/4 [00:00<00:00,  7.38it/s]

Capturing batches (bs=1 avail_mem=26.32 GB): 100%|██████████| 4/4 [00:00<00:00,  5.78it/s]Capturing batches (bs=1 avail_mem=26.32 GB): 100%|██████████| 4/4 [00:00<00:00,  5.81it/s]


[2025-12-12 05:26:58] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 05:26:58] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-12 05:26:58] Overriding the draft model's max_position_embeddings to 8192.
I1212 05:26:58.539350 846638 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:26:58.539372 846638 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:26:58.539388 846638 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:15172
I1212 05:26:58.539460 846638 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:26:58.540222 846638 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.


I1212 05:26:58.567024 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 05:26:58.567669 846638 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:31:5d
I1212 05:26:58.595124 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 05:26:58.595939 846638 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:b8:61:92
I1212 05:26:58.622927 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 05:26:58.623518 846638 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:b1:5d
I1212 05:26:58.650935 846638 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 05:26:58.651520 846638 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:31:5d
I1212 05:26:58.678953 846638 rdma_cont

W1212 05:26:59.401706 846638 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f97cbfff010, len: 2147483648: Operation not permitted [1]


Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.13s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.13s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=25.06 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=25.06 GB):  25%|██▌       | 1/4 [00:04<00:12,  4.25s/it]Capturing batches (bs=3 avail_mem=24.85 GB):  25%|██▌       | 1/4 [00:04<00:12,  4.25s/it]

Capturing batches (bs=3 avail_mem=24.85 GB):  50%|█████     | 2/4 [00:05<00:04,  2.32s/it]Capturing batches (bs=2 avail_mem=24.83 GB):  50%|█████     | 2/4 [00:05<00:04,  2.32s/it]

Capturing batches (bs=2 avail_mem=24.83 GB):  75%|███████▌  | 3/4 [00:05<00:01,  1.38s/it]Capturing batches (bs=1 avail_mem=24.79 GB):  75%|███████▌  | 3/4 [00:05<00:01,  1.38s/it]

Capturing batches (bs=1 avail_mem=24.79 GB): 100%|██████████| 4/4 [00:08<00:00,  1.92s/it]Capturing batches (bs=1 avail_mem=24.79 GB): 100%|██████████| 4/4 [00:08<00:00,  2.06s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=24.74 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=24.67 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=24.66 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=24.65 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=24.65 GB): 100%|██████████| 4/4 [00:00<00:00, 34.92it/s]Capturing batches (bs=1 avail_mem=24.65 GB): 100%|██████████| 4/4 [00:00<00:00, 34.87it/s]


[2025-12-12 05:27:12] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [9]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [10]:
terminate_process(server_process)

### EAGLE-3 Decoding

You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model.

In [11]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct  --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
        --cuda-graph-max-bs 2 --dtype float16 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 05:27:25] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:27:25] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:27:25] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 05:27:28] Casting torch.bfloat16 to torch.float16.


[2025-12-12 05:27:34] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:27:34] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:27:34] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 05:27:34] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:27:34] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:27:34] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 05:27:36] Casting torch.bfloat16 to torch.float16.


[2025-12-12 05:27:37] Casting torch.bfloat16 to torch.float16.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 05:27:37.608700 849558 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:27:37.608721 849558 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:27:37.608747 849558 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:15627
I1212 05:27:37.608827 849558 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:27:37.611685 849558 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 05:27:37.638981 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_

I1212 05:27:37.867990 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 05:27:37.878762 849558 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:31:5d
I1212 05:27:37.928673 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 05:27:37.929967 849558 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:49:b1:5d
I1212 05:27:37.937362 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 05:27:37.938099 849558 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:31:5d
I1212 05:27:37.943321 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 05:27:37.944000 849558 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:b1:5d
I1212 05:27:37.949177 849558 rdma_co

W1212 05:27:38.864748 849558 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f315bfff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 05:27:41] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:13,  4.59s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:09<00:09,  4.72s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:04,  4.61s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:15<00:00,  3.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:15<00:00,  3.77s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=42.04 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=42.04 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.30it/s]Capturing batches (bs=3 avail_mem=41.86 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.30it/s]Capturing batches (bs=2 avail_mem=41.79 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.30it/s]Capturing batches (bs=2 avail_mem=41.79 GB):  75%|███████▌  | 3/4 [00:00<00:00,  7.15it/s]Capturing batches (bs=1 avail_mem=41.71 GB):  75%|███████▌  | 3/4 [00:00<00:00,  7.15it/s]

Capturing batches (bs=1 avail_mem=41.71 GB): 100%|██████████| 4/4 [00:00<00:00,  7.19it/s]


[2025-12-12 05:28:00] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 05:28:00] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-12 05:28:00] Overriding the draft model's max_position_embeddings to 131072.
I1212 05:28:00.845216 849558 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:28:00.845288 849558 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:28:00.845314 849558 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:16685
I1212 05:28:00.845404 849558 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:28:00.846107 849558 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 05:28:00.874432 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 05:28:00.875121 849558 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: 

I1212 05:28:00.988659 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 05:28:00.989559 849558 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:b1:5d
I1212 05:28:00.994022 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 05:28:00.994634 849558 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:31:5d
I1212 05:28:01.025841 849558 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 05:28:01.026510 849558 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:b1:5d


W1212 05:28:01.716662 849558 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f2a17fff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.64it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.63it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=25.94 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=25.94 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.87s/it]Capturing batches (bs=3 avail_mem=25.91 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.87s/it]

Capturing batches (bs=3 avail_mem=25.91 GB):  50%|█████     | 2/4 [00:03<00:03,  1.51s/it]Capturing batches (bs=2 avail_mem=25.87 GB):  50%|█████     | 2/4 [00:03<00:03,  1.51s/it]

Capturing batches (bs=2 avail_mem=25.87 GB):  75%|███████▌  | 3/4 [00:03<00:00,  1.04it/s]Capturing batches (bs=1 avail_mem=25.82 GB):  75%|███████▌  | 3/4 [00:03<00:00,  1.04it/s]

Capturing batches (bs=1 avail_mem=25.82 GB): 100%|██████████| 4/4 [00:06<00:00,  1.49s/it]Capturing batches (bs=1 avail_mem=25.82 GB): 100%|██████████| 4/4 [00:06<00:00,  1.51s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=25.76 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=25.69 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=25.69 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=25.69 GB):  75%|███████▌  | 3/4 [00:00<00:00, 28.93it/s]Capturing batches (bs=1 avail_mem=25.67 GB):  75%|███████▌  | 3/4 [00:00<00:00, 28.93it/s]Capturing batches (bs=1 avail_mem=25.67 GB): 100%|██████████| 4/4 [00:00<00:00, 32.52it/s]


[2025-12-12 05:28:11] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [12]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [13]:
terminate_process(server_process)

## Multi Token Prediction

We support [MTP(Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use Xiaomi/MiMo-7B-RL model as example here (deepseek mtp usage refer to [deepseek doc](../basic_usage/deepseek.md#multi-token-prediction))

In [14]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \
    --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
    --mem-fraction 0.5 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 05:28:23] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:28:23] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:28:23] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 05:28:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:28:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:28:32] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 05:28:33] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 05:28:33] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 05:28:33] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 05:28:36.071269 851969 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:28:36.072510 851969 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:28:36.072574 851969 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:15417
I1212 05:28:36.072650 851969 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:28:36.083628 851969 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 05:28:36.090286 851969 rdma_context.cpp:533] Find best gid index: 3 on mlx5_

W1212 05:28:37.257460 851969 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f9b83fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 05:28:39] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:03<00:10,  3.45s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:07<00:07,  3.98s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:11<00:03,  3.66s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  3.00s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  3.26s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=48.68 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=48.68 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.21it/s]Capturing batches (bs=3 avail_mem=48.54 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.21it/s]Capturing batches (bs=2 avail_mem=48.40 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.21it/s]Capturing batches (bs=1 avail_mem=48.32 GB):  25%|██▌       | 1/4 [00:00<00:00,  3.21it/s]

Capturing batches (bs=1 avail_mem=48.32 GB): 100%|██████████| 4/4 [00:00<00:00,  7.16it/s]Capturing batches (bs=1 avail_mem=48.32 GB): 100%|██████████| 4/4 [00:00<00:00,  6.56it/s]


[2025-12-12 05:28:56] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 05:28:56] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 05:28:56.871023 851969 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 05:28:56.871044 851969 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.184.97.146 port: 12001
I1212 05:28:56.871060 851969 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.184.97.146:16492
I1212 05:28:56.871130 851969 transfer_engine.cpp:185] Auto-discovering topology...
I1212 05:28:56.871801 851969 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 05:28:56.899008 851969 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 05:28:56.899613 851969 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:48:31:5d
I1212 05:28:56.927567 85

I1212 05:28:57.039906 851969 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 05:28:57.040642 851969 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:31:5d
I1212 05:28:57.066996 851969 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 05:28:57.067664 851969 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4a:b1:5d
I1212 05:28:57.071921 851969 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 05:28:57.072563 851969 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:31:5d
I1212 05:28:57.076822 851969 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 05:28:57.077585 851969 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4b:b1:5d


W1212 05:28:57.734529 851969 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f9447fff010, len: 2147483648: Operation not permitted [1]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.06it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  4.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  4.52it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=7.84 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=7.76 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=7.74 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=7.72 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=7.72 GB): 100%|██████████| 4/4 [00:00<00:00, 59.46it/s]


[2025-12-12 05:29:01] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [15]:
import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "XiaomiMiMo/MiMo-7B-RL",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

In [16]:
terminate_process(server_process)

## References

EAGLE process is as follows:

- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$. 
- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.
- EAGLE-2 additionally uses the draft model to evaluate how probable certain branches in the draft tree are, dynamically stopping the expansion of unlikely branches. After the expansion phase, reranking is employed to select only the top `speculative_num_draft_tokens` final nodes as draft tokens.
- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.

This enhances drafting accuracy by operating on the features instead of tokens for more regular inputs and passing the tokens from the next timestep additionally to minimize randomness effects from sampling. Furthermore the dynamic adjustment of the draft tree and selection of reranked final nodes increases acceptance rate of draft tokens further. For more details see [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) paper.


For guidance how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train).