# Speculative Decoding

SGLang now provides an EAGLE-based (EAGLE-2/EAGLE-3) speculative decoding option. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.

### Performance Highlights

Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.
For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).

| Method | Throughput (tokens/s) |
|--------|----------------|
| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |
| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |
| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |

## EAGLE Decoding

To enable EAGLE speculative decoding the following parameters are relevant:
* `speculative_draft_model_path`: Specifies draft model. This parameter is required.
* `speculative_num_steps`: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.
* `speculative_eagle_topk`: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.
* `speculative_num_draft_tokens`: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.

These parameters are the same for EAGLE-2 and EAGLE-3.

You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).

In the documentation below, we set `--cuda-graph-max-bs` to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with `--cuda-graph-max-bs`, `--max-running-requests`, `--mem-fraction-static` for the best performance. 

### EAGLE-2 decoding

You can enable EAGLE-2 decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model.

In [1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

import openai

[2025-12-12 01:42:43] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-12 01:42:43] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-12 01:42:43] INFO utils.py:164: NumExpr defaulting to 16 threads.


In [2]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 01:42:49] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:42:49] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:42:49] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 01:42:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:42:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:42:58] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 01:42:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:42:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:42:58] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 01:43:00.745153 2587261 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:43:00.745170 2587261 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:43:00.745196 2587261 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:16390
I1212 01:43:00.745281 2587261 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:43:00.747958 2587261 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:43:00.775279 2587261 rdma_context.cpp:533] Find best gid index: 3 on m

I1212 01:43:00.951295 2587261 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 01:43:00.952170 2587261 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:48:45
I1212 01:43:00.979611 2587261 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 01:43:00.980696 2587261 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:bb:08:86
I1212 01:43:01.007189 2587261 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 01:43:01.008011 2587261 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:c8:45
I1212 01:43:01.035131 2587261 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 01:43:01.035935 2587261 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:48:45
I1212 01:43:01.063112 2587261 

W1212 01:43:02.129072 2587261 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f9b13fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 01:43:04] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.80s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.37s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.16 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.16 GB):  25%|██▌       | 1/4 [00:00<00:00,  5.51it/s]Capturing batches (bs=3 avail_mem=55.01 GB):  25%|██▌       | 1/4 [00:00<00:00,  5.51it/s]

Capturing batches (bs=2 avail_mem=54.95 GB):  25%|██▌       | 1/4 [00:00<00:00,  5.51it/s]Capturing batches (bs=2 avail_mem=54.95 GB):  75%|███████▌  | 3/4 [00:00<00:00, 11.70it/s]Capturing batches (bs=1 avail_mem=54.86 GB):  75%|███████▌  | 3/4 [00:00<00:00, 11.70it/s]Capturing batches (bs=1 avail_mem=54.86 GB): 100%|██████████| 4/4 [00:00<00:00, 10.49it/s]


[2025-12-12 01:43:10] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 01:43:10] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend


I1212 01:43:10.992125 2587261 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:43:10.992143 2587261 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:43:10.992156 2587261 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:16416
I1212 01:43:10.992236 2587261 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:43:10.992877 2587261 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:43:11.020256 2587261 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 01:43:11.020936 2587261 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:48:45
I1212 01:43:11.047183 2587261 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 01:43:11.047829 2587261 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00

I1212 01:43:11.215188 2587261 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 01:43:11.215797 2587261 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:48:45
I1212 01:43:11.243199 2587261 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 01:43:11.243815 2587261 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:c8:45


W1212 01:43:11.960671 2587261 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f9297fff010, len: 2147483648: Operation not permitted [1]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.36s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.36s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=53.79 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=53.79 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.79s/it]Capturing batches (bs=3 avail_mem=53.72 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.79s/it]

Capturing batches (bs=3 avail_mem=53.72 GB):  50%|█████     | 2/4 [00:03<00:02,  1.37s/it]Capturing batches (bs=2 avail_mem=53.70 GB):  50%|█████     | 2/4 [00:03<00:02,  1.37s/it]Capturing batches (bs=1 avail_mem=53.67 GB):  50%|█████     | 2/4 [00:03<00:02,  1.37s/it]

Capturing batches (bs=1 avail_mem=53.67 GB): 100%|██████████| 4/4 [00:05<00:00,  1.14s/it]Capturing batches (bs=1 avail_mem=53.67 GB): 100%|██████████| 4/4 [00:05<00:00,  1.29s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=53.65 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=53.60 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=53.59 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=53.57 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=53.57 GB): 100%|██████████| 4/4 [00:00<00:00, 114.10it/s]


[2025-12-12 01:43:21] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [3]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [4]:
terminate_process(server_process)

### EAGLE-2 Decoding with `torch.compile`

You can also enable `torch.compile` for further optimizations and optionally set `--torch-compile-max-bs`:


In [5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
            --enable-torch-compile --torch-compile-max-bs 2 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 01:43:31] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:43:31] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:43:31] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 01:43:40] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:43:40] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:43:40] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 01:43:40] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:43:40] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:43:40] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 01:43:44.787106 2588059 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:43:44.787127 2588059 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:43:44.787148 2588059 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:15422
I1212 01:43:44.787233 2588059 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:43:44.790194 2588059 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:43:44.796389 2588059 rdma_context.cpp:533] Find best gid index: 3 on m

I1212 01:43:44.972301 2588059 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 01:43:44.973070 2588059 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:c8:45
I1212 01:43:45.003132 2588059 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 01:43:45.003930 2588059 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:48:45
I1212 01:43:45.031414 2588059 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 01:43:45.032409 2588059 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:bb:08:86
I1212 01:43:45.059135 2588059 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 01:43:45.059923 2588059 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:c8:45
I1212 01:43:45.087173 2588059 

W1212 01:43:45.804032 2588059 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fe98bfff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 01:43:48] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.73s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.32s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.13 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=55.13 GB):  25%|██▌       | 1/4 [00:00<00:02,  1.04it/s]Capturing batches (bs=3 avail_mem=54.97 GB):  25%|██▌       | 1/4 [00:00<00:02,  1.04it/s]Capturing batches (bs=3 avail_mem=54.97 GB):  50%|█████     | 2/4 [00:01<00:00,  2.19it/s]Capturing batches (bs=2 avail_mem=54.89 GB):  50%|█████     | 2/4 [00:01<00:00,  2.19it/s]

  torch._dynamo.utils.warn_once(msg)


Capturing batches (bs=2 avail_mem=54.89 GB):  75%|███████▌  | 3/4 [00:08<00:03,  3.80s/it]Capturing batches (bs=1 avail_mem=54.80 GB):  75%|███████▌  | 3/4 [00:08<00:03,  3.80s/it]

Capturing batches (bs=1 avail_mem=54.80 GB): 100%|██████████| 4/4 [00:16<00:00,  5.33s/it]Capturing batches (bs=1 avail_mem=54.80 GB): 100%|██████████| 4/4 [00:16<00:00,  4.13s/it]


[2025-12-12 01:44:10] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 01:44:10] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 01:44:10.241904 2588059 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:44:10.241923 2588059 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:44:10.241940 2588059 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:16700
I1212 01:44:10.242026 2588059 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:44:10.242698 2588059 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.


I1212 01:44:10.271169 2588059 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 01:44:10.271833 2588059 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:48:45
I1212 01:44:10.300525 2588059 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 01:44:10.301156 2588059 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:c8:45
I1212 01:44:10.327135 2588059 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 01:44:10.327735 2588059 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:48:45
I1212 01:44:10.355146 2588059 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 01:44:10.355773 2588059 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:c8:45
I1212 01:44:10.383190 258805

I1212 01:44:10.471843 2588059 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:c8:45


W1212 01:44:11.059454 2588059 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7fe0fbfff010, len: 2147483648: Operation not permitted [1]


Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.37s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.37s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=53.70 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=53.70 GB):  25%|██▌       | 1/4 [00:03<00:09,  3.20s/it]Capturing batches (bs=3 avail_mem=53.58 GB):  25%|██▌       | 1/4 [00:03<00:09,  3.20s/it]

Capturing batches (bs=3 avail_mem=53.58 GB):  50%|█████     | 2/4 [00:03<00:03,  1.58s/it]Capturing batches (bs=2 avail_mem=53.56 GB):  50%|█████     | 2/4 [00:03<00:03,  1.58s/it]Capturing batches (bs=2 avail_mem=53.56 GB):  75%|███████▌  | 3/4 [00:03<00:00,  1.08it/s]Capturing batches (bs=1 avail_mem=53.51 GB):  75%|███████▌  | 3/4 [00:03<00:00,  1.08it/s]

Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:06<00:00,  1.70s/it]Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:06<00:00,  1.67s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=53.46 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=53.39 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=53.39 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=53.37 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=53.37 GB): 100%|██████████| 4/4 [00:00<00:00, 42.14it/s]


[2025-12-12 01:44:23] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [6]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [7]:
terminate_process(server_process)

### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling

By employing a truncated high-frequency token vocabulary in the draft model, Eagle speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, checkout [the paper](https://arxiv.org/pdf/arXiv:2502.14856).

In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency token in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency token by directly downloading these token from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).

Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx). 

In [8]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16  --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 01:44:33] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:44:33] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:44:33] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 01:44:35] Casting torch.bfloat16 to torch.float16.


[2025-12-12 01:44:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:44:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:44:41] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 01:44:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:44:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:44:41] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 01:44:43] Casting torch.bfloat16 to torch.float16.


[2025-12-12 01:44:43] Casting torch.bfloat16 to torch.float16.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 01:44:44.327113 2589066 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:44:44.327131 2589066 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:44:44.327150 2589066 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:16414
I1212 01:44:44.327219 2589066 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:44:44.329898 2589066 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:44:44.359144 2589066 rdma_context.cpp:533] Find best gid index: 3 on m

I1212 01:44:44.527235 2589066 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 01:44:44.527890 2589066 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:48:45
I1212 01:44:44.555404 2589066 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 01:44:44.556217 2589066 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:bb:08:86
I1212 01:44:44.583105 2589066 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 01:44:44.583719 2589066 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:c8:45
I1212 01:44:44.611119 2589066 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 01:44:44.611732 2589066 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:48:45
I1212 01:44:44.639138 2589066 

W1212 01:44:45.242744 2589066 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f8307fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 01:44:47] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:13,  4.39s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:08<00:08,  4.35s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:04,  4.41s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.64s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=60.13 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=60.13 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.72it/s]Capturing batches (bs=3 avail_mem=59.89 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.72it/s]Capturing batches (bs=3 avail_mem=59.89 GB):  50%|█████     | 2/4 [00:00<00:00,  4.00it/s]Capturing batches (bs=2 avail_mem=59.83 GB):  50%|█████     | 2/4 [00:00<00:00,  4.00it/s]

Capturing batches (bs=2 avail_mem=59.83 GB):  75%|███████▌  | 3/4 [00:00<00:00,  4.20it/s]Capturing batches (bs=1 avail_mem=59.74 GB):  75%|███████▌  | 3/4 [00:00<00:00,  4.20it/s]Capturing batches (bs=1 avail_mem=59.74 GB): 100%|██████████| 4/4 [00:00<00:00,  4.89it/s]Capturing batches (bs=1 avail_mem=59.74 GB): 100%|██████████| 4/4 [00:00<00:00,  4.38it/s]


[2025-12-12 01:45:06] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 01:45:06] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-12 01:45:06] Overriding the draft model's max_position_embeddings to 8192.


I1212 01:45:06.456552 2589066 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:45:06.456569 2589066 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:45:06.456585 2589066 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:15586
I1212 01:45:06.456684 2589066 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:45:06.457707 2589066 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:45:06.483124 2589066 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 01:45:06.483834 2589066 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:48:45
I1212 01:45:06.511091 2589066 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 01:45:06.511736 2589066 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00

I1212 01:45:06.667083 2589066 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 01:45:06.667732 2589066 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:48:45
I1212 01:45:06.695000 2589066 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 01:45:06.695633 2589066 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:c8:45


W1212 01:45:07.514832 2589066 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f7bdbfff010, len: 2147483648: Operation not permitted [1]


Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.26s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.26s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=58.58 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=58.58 GB):  25%|██▌       | 1/4 [00:03<00:10,  3.63s/it]Capturing batches (bs=3 avail_mem=58.47 GB):  25%|██▌       | 1/4 [00:03<00:10,  3.63s/it]

Capturing batches (bs=3 avail_mem=58.47 GB):  50%|█████     | 2/4 [00:04<00:03,  1.82s/it]Capturing batches (bs=2 avail_mem=58.44 GB):  50%|█████     | 2/4 [00:04<00:03,  1.82s/it]Capturing batches (bs=2 avail_mem=58.44 GB):  75%|███████▌  | 3/4 [00:04<00:01,  1.08s/it]Capturing batches (bs=1 avail_mem=58.40 GB):  75%|███████▌  | 3/4 [00:04<00:01,  1.08s/it]

Capturing batches (bs=1 avail_mem=58.40 GB): 100%|██████████| 4/4 [00:06<00:00,  1.54s/it]Capturing batches (bs=1 avail_mem=58.40 GB): 100%|██████████| 4/4 [00:06<00:00,  1.66s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=58.36 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=58.28 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=58.28 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=58.26 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=58.26 GB): 100%|██████████| 4/4 [00:00<00:00, 94.36it/s]


[2025-12-12 01:45:19] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [9]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [10]:
terminate_process(server_process)

### EAGLE-3 Decoding

You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model.

In [11]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct  --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
        --cuda-graph-max-bs 2 --dtype float16 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 01:45:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:45:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:45:30] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 01:45:32] Casting torch.bfloat16 to torch.float16.


[2025-12-12 01:45:38] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:45:38] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:45:38] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-12 01:45:38] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:45:38] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:45:38] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 01:45:40] Casting torch.bfloat16 to torch.float16.


[2025-12-12 01:45:41] Casting torch.bfloat16 to torch.float16.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 01:45:41.507543 2589953 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:45:41.507561 2589953 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:45:41.507583 2589953 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:15606
I1212 01:45:41.507669 2589953 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:45:41.510650 2589953 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:45:41.539106 2589953 rdma_context.cpp:533] Find best gid index: 3 on m

I1212 01:45:41.689023 2589953 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:c8:45
I1212 01:45:41.711138 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 01:45:41.711810 2589953 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:48:45
I1212 01:45:41.739398 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 01:45:41.740223 2589953 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:bb:08:86
I1212 01:45:41.767114 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 01:45:41.767753 2589953 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:c8:45
I1212 01:45:41.795132 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 01:45:41.795776 2589953 

W1212 01:45:42.608451 2589953 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f2cb3fff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 01:45:45] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:13,  4.62s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:09<00:09,  4.55s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:04,  4.49s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.69s/it]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=60.00 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=60.00 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.01it/s]Capturing batches (bs=3 avail_mem=59.82 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.01it/s]

Capturing batches (bs=2 avail_mem=59.75 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.01it/s]Capturing batches (bs=2 avail_mem=59.75 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.02it/s]Capturing batches (bs=1 avail_mem=59.67 GB):  75%|███████▌  | 3/4 [00:00<00:00, 12.02it/s]Capturing batches (bs=1 avail_mem=59.67 GB): 100%|██████████| 4/4 [00:00<00:00, 12.18it/s]


[2025-12-12 01:46:04] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 01:46:04] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-12 01:46:04] Overriding the draft model's max_position_embeddings to 131072.
I1212 01:46:04.507577 2589953 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:46:04.507598 2589953 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:46:04.507613 2589953 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:16626
I1212 01:46:04.507719 2589953 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:46:04.508410 2589953 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:46:04.535192 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 01:46:04.535861 2589953 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, 

I1212 01:46:04.647143 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 01:46:04.647773 2589953 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:48:45
I1212 01:46:04.675567 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 01:46:04.676472 2589953 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:bb:08:86
I1212 01:46:04.703140 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 01:46:04.703747 2589953 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:c8:45
I1212 01:46:04.731151 2589953 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 01:46:04.731772 2589953 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:48:45
I1212 01:46:04.759145 2589953 

W1212 01:46:05.346603 2589953 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f257ffff010, len: 2147483648: Operation not permitted [1]


Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.66it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.66it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=58.32 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=58.32 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.84s/it]Capturing batches (bs=3 avail_mem=58.27 GB):  25%|██▌       | 1/4 [00:02<00:08,  2.84s/it]

Capturing batches (bs=3 avail_mem=58.27 GB):  50%|█████     | 2/4 [00:03<00:02,  1.39s/it]Capturing batches (bs=2 avail_mem=58.23 GB):  50%|█████     | 2/4 [00:03<00:02,  1.39s/it]Capturing batches (bs=1 avail_mem=58.19 GB):  50%|█████     | 2/4 [00:03<00:02,  1.39s/it]

Capturing batches (bs=1 avail_mem=58.19 GB): 100%|██████████| 4/4 [00:05<00:00,  1.18s/it]Capturing batches (bs=1 avail_mem=58.19 GB): 100%|██████████| 4/4 [00:05<00:00,  1.33s/it]


  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=58.13 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=58.06 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=58.06 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=58.04 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=58.04 GB): 100%|██████████| 4/4 [00:00<00:00, 99.31it/s]


[2025-12-12 01:46:15] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [12]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

In [13]:
terminate_process(server_process)

## Multi Token Prediction

We support [MTP(Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use Xiaomi/MiMo-7B-RL model as example here (deepseek mtp usage refer to [deepseek doc](../basic_usage/deepseek.md#multi-token-prediction))

In [14]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \
    --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
    --mem-fraction 0.5 --log-level warning
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-12-12 01:46:25] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:46:25] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:46:25] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-12 01:46:34] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:46:34] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:46:34] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-12 01:46:35] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-12 01:46:35] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-12 01:46:35] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
I1212 01:46:38.336585 2590750 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:46:38.336601 2590750 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:46:38.336624 2590750 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:16855
I1212 01:46:38.336719 2590750 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:46:38.339804 2590750 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:46:38.367527 2590750 rdma_context.cpp:533] Find best gid index: 3 on m

I1212 01:46:38.519394 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 01:46:38.520249 2590750 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:bb:08:86
I1212 01:46:38.526223 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 01:46:38.526860 2590750 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:c8:45
I1212 01:46:38.532799 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 01:46:38.533429 2590750 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:48:45
I1212 01:46:38.539347 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 01:46:38.539975 2590750 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:c8:45


W1212 01:46:39.490888 2590750 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f9d4bfff010, len: 2147483648: Operation not permitted [1]


[2025-12-12 01:46:42] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.43it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.29it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.24it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=60.52 GB):   0%|          | 0/4 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=60.52 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.77it/s]Capturing batches (bs=3 avail_mem=60.38 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.77it/s]Capturing batches (bs=2 avail_mem=60.31 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.77it/s]Capturing batches (bs=1 avail_mem=60.22 GB):  25%|██▌       | 1/4 [00:00<00:01,  2.77it/s]Capturing batches (bs=1 avail_mem=60.22 GB): 100%|██████████| 4/4 [00:00<00:00,  9.03it/s]Capturing batches (bs=1 avail_mem=60.22 GB): 100%|██████████| 4/4 [00:00<00:00,  7.72it/s]


[2025-12-12 01:46:48] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-12 01:46:48] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
I1212 01:46:48.881862 2590750 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:46:48.881880 2590750 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:46:48.881896 2590750 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:16905
I1212 01:46:48.881989 2590750 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:46:48.882701 2590750 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:46:48.903301 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 01:46:48.903981 2590750 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:48:45
I1212 01:46:48.9311

I1212 01:46:48.959179 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce2/
I1212 01:46:48.959836 2590750 rdma_context.cpp:126] RDMA device: mlx5_roce2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:48:45
I1212 01:46:48.987309 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce3/
I1212 01:46:48.987954 2590750 rdma_context.cpp:126] RDMA device: mlx5_roce3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4d:c8:45
I1212 01:46:49.015101 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce4/
I1212 01:46:49.015717 2590750 rdma_context.cpp:126] RDMA device: mlx5_roce4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:48:45
I1212 01:46:49.043598 2590750 rdma_context.cpp:533] Find best gid index: 3 on mlx5_eth0/
I1212 01:46:49.044548 2590750 rdma_context.cpp:126] RDMA device: mlx5_eth0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:bb:08:86
I1212 01:46:49.071105 2590750 

W1212 01:46:49.851728 2590750 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f960ffff010, len: 2147483648: Operation not permitted [1]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  5.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  8.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  7.87it/s]



  0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=59.47 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=3 avail_mem=59.39 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=2 avail_mem=59.39 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=59.37 GB):   0%|          | 0/4 [00:00<?, ?it/s]Capturing batches (bs=1 avail_mem=59.37 GB): 100%|██████████| 4/4 [00:00<00:00, 52.13it/s]


[2025-12-12 01:46:53] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.


In [15]:
import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "XiaomiMiMo/MiMo-7B-RL",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

In [16]:
terminate_process(server_process)

## References

EAGLE process is as follows:

- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$. 
- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.
- EAGLE-2 additionally uses the draft model to evaluate how probable certain branches in the draft tree are, dynamically stopping the expansion of unlikely branches. After the expansion phase, reranking is employed to select only the top `speculative_num_draft_tokens` final nodes as draft tokens.
- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.

This enhances drafting accuracy by operating on the features instead of tokens for more regular inputs and passing the tokens from the next timestep additionally to minimize randomness effects from sampling. Furthermore the dynamic adjustment of the draft tree and selection of reranked final nodes increases acceptance rate of draft tokens further. For more details see [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) paper.


For guidance how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train).