# SGLang Native APIs

Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:

- `/generate` (text generation model)
- `/get_model_info`
- `/get_server_info`
- `/health`
- `/health_generate`
- `/flush_cache`
- `/update_weights`
- `/encode`(embedding model)
- `/classify`(reward model)

We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.

## Launch A Server

In [1]:
import requests
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")

INFO 02-19 22:14:44 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:14:46,587 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:14:47] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30044, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=520123744, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False,

INFO 02-19 22:14:51 __init__.py:190] Automatically detected platform cuda.


INFO 02-19 22:14:51 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:14:53,314 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-19 22:14:53,410 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:14:54 TP0] Init torch distributed begin.


[2025-02-19 22:14:55 TP0] Load weight begin. avail mem=50.95 GB


[2025-02-19 22:14:55 TP0] Using model weights format ['*.safetensors']
[2025-02-19 22:14:55 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.06it/s]

[2025-02-19 22:14:56 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=48.52 GB
[2025-02-19 22:14:56 TP0] KV Cache is allocated. K size: 21.20 GB, V size: 21.20 GB.
[2025-02-19 22:14:56 TP0] Memory pool end. avail mem=4.08 GB


[2025-02-19 22:14:56 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

2025-02-19 22:14:57,303 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False
2025-02-19 22:14:57,328 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False


  4%|▍         | 1/23 [00:01<00:27,  1.24s/it]

  9%|▊         | 2/23 [00:01<00:13,  1.53it/s]

 13%|█▎        | 3/23 [00:01<00:09,  2.18it/s]

 17%|█▋        | 4/23 [00:01<00:07,  2.70it/s]

 22%|██▏       | 5/23 [00:02<00:05,  3.16it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.54it/s]

 30%|███       | 7/23 [00:02<00:04,  3.81it/s]

 35%|███▍      | 8/23 [00:02<00:03,  3.97it/s]

 39%|███▉      | 9/23 [00:03<00:03,  4.15it/s]

 43%|████▎     | 10/23 [00:03<00:03,  4.28it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.14it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.21it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.30it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.40it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.45it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.46it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.48it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.44it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.51it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.48it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.51it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.43it/s]

100%|██████████| 23/23 [00:06<00:00,  4.33it/s]100%|██████████| 23/23 [00:06<00:00,  3.70it/s]
[2025-02-19 22:15:02 TP0] Capture cuda graph end. Time elapsed: 6.22 s


[2025-02-19 22:15:03 TP0] max_total_num_tokens=1389614, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-02-19 22:15:03] INFO:     Started server process [1839263]
[2025-02-19 22:15:03] INFO:     Waiting for application startup.
[2025-02-19 22:15:03] INFO:     Application startup complete.
[2025-02-19 22:15:03] INFO:     Uvicorn running on http://0.0.0.0:30044 (Press CTRL+C to quit)


[2025-02-19 22:15:03] INFO:     127.0.0.1:37514 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-19 22:15:04] INFO:     127.0.0.1:37528 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-19 22:15:04 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


2025-02-19 22:15:04,973 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:15:05,000 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False


[2025-02-19 22:15:05] INFO:     127.0.0.1:37542 - "POST /generate HTTP/1.1" 200 OK
[2025-02-19 22:15:05] The server is fired up and ready to roll!




                    NOTE: Typically, the server runs in a separate terminal.
                    In this notebook, we run the server and notebook code together, so their outputs are combined.
                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
                    We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
                    


## Generate (text generation model)
Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](../references/sampling_params.md).

In [2]:
url = f"http://localhost:{port}/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print_highlight(response.json())

[2025-02-19 22:15:08 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 1, cache hit rate: 6.67%, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-02-19 22:15:08,747 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-02-19 22:15:08,758 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-02-19 22:15:08 TP0] Decode batch. #running-req: 1, #token: 41, token usage: 0.00, gen throughput (token/s): 7.21, #queue-req: 0


[2025-02-19 22:15:08 TP0] Decode batch. #running-req: 1, #token: 81, token usage: 0.00, gen throughput (token/s): 359.53, #queue-req: 0
[2025-02-19 22:15:09] INFO:     127.0.0.1:37558 - "POST /generate HTTP/1.1" 200 OK
{'text': ' Paris\nWhat is the capital of Spain? Madrid\nWhat is the capital of Italy? Rome\nWhat is the capital of India?\nDo you think this is acceptable? \nNo. I would expect a multilingual, diverse and imaginative designer to be able to design a nation. One cannot just place a brief, acohesive language on a map and expect people to understand the bigger picture. What do you think?', 'meta_info': {'id': '8e671af5c8ae46f8bb5e9585354bb949', 'finish_reason': {'type': 'stop', 'matched': 128009}, 'prompt_tokens': 8, 'completion_tokens': 85, 'cached_tokens': 1}}


## Get Model Info

Get the information of the model.

- `model_path`: The path/name of the model.
- `is_generation`: Whether the model is used as generation model or embedding model.
- `tokenizer_path`: The path/name of the tokenizer.

In [3]:
url = f"http://localhost:{port}/get_model_info"

response = requests.get(url)
response_json = response.json()
print_highlight(response_json)
assert response_json["model_path"] == "meta-llama/Llama-3.2-1B-Instruct"
assert response_json["is_generation"] is True
assert response_json["tokenizer_path"] == "meta-llama/Llama-3.2-1B-Instruct"
assert response_json.keys() == {"model_path", "is_generation", "tokenizer_path"}

[2025-02-19 22:15:09] INFO:     127.0.0.1:50414 - "GET /get_model_info HTTP/1.1" 200 OK
{'model_path': 'meta-llama/Llama-3.2-1B-Instruct', 'tokenizer_path': 'meta-llama/Llama-3.2-1B-Instruct', 'is_generation': True}


## Get Server Info
Gets the server information including CLI arguments, token limits, and memory pool sizes.
- Note: `get_server_info` merges the following deprecated endpoints:
  - `get_server_args`
  - `get_memory_pool_size` 
  - `get_max_total_num_tokens`

In [4]:
# get_server_info

url = f"http://localhost:{port}/get_server_info"

response = requests.get(url)
print_highlight(response.text)

[2025-02-19 22:15:09] INFO:     127.0.0.1:50424 - "GET /get_server_info HTTP/1.1" 200 OK
{"model_path":"meta-llama/Llama-3.2-1B-Instruct","tokenizer_path":"meta-llama/Llama-3.2-1B-Instruct","tokenizer_mode":"auto","load_format":"auto","trust_remote_code":false,"dtype":"auto","kv_cache_dtype":"auto","quantization_param_path":null,"quantization":null,"context_length":null,"device":"cuda","served_model_name":"meta-llama/Llama-3.2-1B-Instruct","chat_template":null,"is_embedding":false,"revision":null,"skip_tokenizer_init":false,"host":"0.0.0.0","port":30044,"mem_fraction_static":0.88,"max_running_requests":null,"max_total_tokens":null,"chunked_prefill_size":8192,"max_prefill_tokens":16384,"schedule_policy":"lpm","schedule_conservativeness":1.0,"cpu_offload_gb":0,"prefill_only_one_req":false,"tp_size":1,"stream_interval":1,"stream_output":false,"random_seed":520123744,"constrained_json_whitespace_pattern":null,"watchdog_timeout":300,"download_dir":null,"base_gpu_id":0,"log_level":"info","lo

## Health Check
- `/health`: Check the health of the server.
- `/health_generate`: Check the health of the server by generating one token.

In [5]:
url = f"http://localhost:{port}/health_generate"

response = requests.get(url)
print_highlight(response.text)

[2025-02-19 22:15:09 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, cache hit rate: 6.25%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-19 22:15:09] INFO:     127.0.0.1:50434 - "GET /health_generate HTTP/1.1" 200 OK



In [6]:
url = f"http://localhost:{port}/health"

response = requests.get(url)
print_highlight(response.text)

[2025-02-19 22:15:09] INFO:     127.0.0.1:50442 - "GET /health HTTP/1.1" 200 OK



## Flush Cache

Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API.

In [7]:
# flush cache

url = f"http://localhost:{port}/flush_cache"

response = requests.post(url)
print_highlight(response.text)

[2025-02-19 22:15:09] INFO:     127.0.0.1:50454 - "POST /flush_cache HTTP/1.1" 200 OK
Cache flushed.
Please check backend logs for more details. (When there are running or waiting requests, the operation will not be performed.)

[2025-02-19 22:15:09 TP0] Cache flushed successfully!


## Update Weights From Disk

Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.

SGLang support `update_weights_from_disk` API for continuous evaluation during training (save checkpoint to disk and update weights from disk).


In [8]:
# successful update with same architecture and size

url = f"http://localhost:{port}/update_weights_from_disk"
data = {"model_path": "meta-llama/Llama-3.2-1B"}

response = requests.post(url, json=data)
print_highlight(response.text)
assert response.json()["success"] is True
assert response.json()["message"] == "Succeeded to update model weights."
assert response.json().keys() == {"success", "message"}

[2025-02-19 22:15:09] Start update_weights. Load format=auto
[2025-02-19 22:15:09 TP0] Update engine weights online from disk begin. avail mem=7.71 GB


[2025-02-19 22:15:09 TP0] Using model weights format ['*.safetensors']


[2025-02-19 22:15:09 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.49it/s]

[2025-02-19 22:15:09 TP0] Update weights end.
[2025-02-19 22:15:09 TP0] Cache flushed successfully!
[2025-02-19 22:15:09] INFO:     127.0.0.1:50464 - "POST /update_weights_from_disk HTTP/1.1" 200 OK
{"success":true,"message":"Succeeded to update model weights."}


In [9]:
# failed update with different parameter size or wrong name

url = f"http://localhost:{port}/update_weights_from_disk"
data = {"model_path": "meta-llama/Llama-3.2-1B-wrong"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(response_json)
assert response_json["success"] is False
assert response_json["message"] == (
    "Failed to get weights iterator: "
    "meta-llama/Llama-3.2-1B-wrong"
    " (repository not found)."
)

[2025-02-19 22:15:09] Start update_weights. Load format=auto
[2025-02-19 22:15:09 TP0] Update engine weights online from disk begin. avail mem=7.71 GB


[2025-02-19 22:15:10 TP0] Failed to get weights iterator: meta-llama/Llama-3.2-1B-wrong (repository not found).
[2025-02-19 22:15:10] INFO:     127.0.0.1:50474 - "POST /update_weights_from_disk HTTP/1.1" 400 Bad Request
{'success': False, 'message': 'Failed to get weights iterator: meta-llama/Llama-3.2-1B-wrong (repository not found).'}


## Encode (embedding model)

Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.html#openai-apis-embedding) and will raise an error for generation models.
Therefore, we launch a new server to server an embedding model.

In [10]:
terminate_process(server_process)

embedding_process, port = launch_server_cmd(
    """
python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \
    --host 0.0.0.0 --is-embedding
"""
)

wait_for_server(f"http://localhost:{port}")

INFO 02-19 22:15:13 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:15:15,783 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:15:16] server_args=ServerArgs(model_path='Alibaba-NLP/gte-Qwen2-7B-instruct', tokenizer_path='Alibaba-NLP/gte-Qwen2-7B-instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='Alibaba-NLP/gte-Qwen2-7B-instruct', chat_template=None, is_embedding=True, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30045, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=316518618, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=Fals

INFO 02-19 22:15:20 __init__.py:190] Automatically detected platform cuda.
INFO 02-19 22:15:20 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:15:22,247 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-19 22:15:22,258 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:15:22 TP0] Downcasting torch.float32 to torch.float16.


[2025-02-19 22:15:23 TP0] Overlap scheduler is disabled for embedding models.
[2025-02-19 22:15:23 TP0] Downcasting torch.float32 to torch.float16.
[2025-02-19 22:15:23 TP0] Init torch distributed begin.


[2025-02-19 22:15:24 TP0] Load weight begin. avail mem=78.84 GB


[2025-02-19 22:15:24 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:00<00:05,  1.10it/s]


Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:02<00:06,  1.32s/it]


Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:04<00:06,  1.66s/it]


Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:06<00:05,  1.82s/it]


Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:08<00:03,  1.94s/it]


Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:10<00:02,  2.01s/it]


Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:13<00:00,  2.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:13<00:00,  1.89s/it]

[2025-02-19 22:15:38 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=48.91 GB


[2025-02-19 22:15:38 TP0] KV Cache is allocated. K size: 19.73 GB, V size: 19.73 GB.
[2025-02-19 22:15:38 TP0] Memory pool end. avail mem=8.02 GB


[2025-02-19 22:15:38 TP0] max_total_num_tokens=738737, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2886, context_len=131072
[2025-02-19 22:15:38] INFO:     Started server process [1842187]
[2025-02-19 22:15:38] INFO:     Waiting for application startup.
[2025-02-19 22:15:38] INFO:     Application startup complete.
[2025-02-19 22:15:38] INFO:     Uvicorn running on http://0.0.0.0:30045 (Press CTRL+C to quit)


[2025-02-19 22:15:39] INFO:     127.0.0.1:47258 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-19 22:15:39] INFO:     127.0.0.1:47270 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-19 22:15:39 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


2025-02-19 22:15:40,539 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:15:40,562 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-02-19 22:15:40] INFO:     127.0.0.1:47274 - "POST /encode HTTP/1.1" 200 OK
[2025-02-19 22:15:40] The server is fired up and ready to roll!




                    NOTE: Typically, the server runs in a separate terminal.
                    In this notebook, we run the server and notebook code together, so their outputs are combined.
                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
                    We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
                    


In [11]:
# successful encode for embedding model

url = f"http://localhost:{port}/encode"
data = {"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "text": "Once upon a time"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(f"Text embedding (first 10): {response_json['embedding'][:10]}")

[2025-02-19 22:15:44 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-19 22:15:44] INFO:     127.0.0.1:47276 - "POST /encode HTTP/1.1" 200 OK
Text embedding (first 10): [0.00830841064453125, 0.000698089599609375, -0.008087158203125, -0.0006761550903320312, 0.01438140869140625, -0.00899505615234375, 0.012359619140625, 0.00208282470703125, 0.00620269775390625, -0.003021240234375]


In [12]:
terminate_process(embedding_process)

## Classify (reward model)

SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations.

In [13]:
terminate_process(embedding_process)

# Note that SGLang now treats embedding models and reward models as the same type of models.
# This will be updated in the future.

reward_process, port = launch_server_cmd(
    """
python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding
"""
)

wait_for_server(f"http://localhost:{port}")

INFO 02-19 22:15:48 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:15:50,117 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:15:50] server_args=ServerArgs(model_path='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', tokenizer_path='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', chat_template=None, is_embedding=True, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30047, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=109241283, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False

INFO 02-19 22:15:54 __init__.py:190] Automatically detected platform cuda.


INFO 02-19 22:15:54 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:15:56,704 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-19 22:15:56,713 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:15:57 TP0] Overlap scheduler is disabled for embedding models.
[2025-02-19 22:15:58 TP0] Init torch distributed begin.


[2025-02-19 22:15:58 TP0] Load weight begin. avail mem=78.84 GB


[2025-02-19 22:15:59 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.04it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.58it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.42s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.17s/it]



[2025-02-19 22:16:04 TP0] Load weight end. type=LlamaForSequenceClassification, dtype=torch.bfloat16, avail mem=64.72 GB
[2025-02-19 22:16:04 TP0] KV Cache is allocated. K size: 27.63 GB, V size: 27.63 GB.
[2025-02-19 22:16:04 TP0] Memory pool end. avail mem=8.35 GB


[2025-02-19 22:16:05 TP0] max_total_num_tokens=452712, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-02-19 22:16:05] INFO:     Started server process [1845583]
[2025-02-19 22:16:05] INFO:     Waiting for application startup.
[2025-02-19 22:16:05] INFO:     Application startup complete.
[2025-02-19 22:16:05] INFO:     Uvicorn running on http://0.0.0.0:30047 (Press CTRL+C to quit)


[2025-02-19 22:16:05] INFO:     127.0.0.1:46730 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-19 22:16:06] INFO:     127.0.0.1:46744 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-19 22:16:06 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


2025-02-19 22:16:06,755 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:16:06,776 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False


[2025-02-19 22:16:06] INFO:     127.0.0.1:46748 - "POST /encode HTTP/1.1" 200 OK
[2025-02-19 22:16:06] The server is fired up and ready to roll!




                    NOTE: Typically, the server runs in a separate terminal.
                    In this notebook, we run the server and notebook code together, so their outputs are combined.
                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
                    We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
                    


In [14]:
from transformers import AutoTokenizer

PROMPT = (
    "What is the range of the numeric output of a sigmoid node in a neural network?"
)

RESPONSE1 = "The output of a sigmoid node is bounded between -1 and 1."
RESPONSE2 = "The output of a sigmoid node is bounded between 0 and 1."

CONVS = [
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE1}],
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE2}],
]

tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
prompts = tokenizer.apply_chat_template(CONVS, tokenize=False)

url = f"http://localhost:{port}/classify"
data = {"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", "text": prompts}

responses = requests.post(url, json=data).json()
for response in responses:
    print_highlight(f"reward: {response['embedding'][0]}")

[2025-02-19 22:16:10 TP0] Prefill batch. #new-seq: 1, #new-token: 68, #cached-token: 1, cache hit rate: 1.32%, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-02-19 22:16:10,948 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-02-19 22:16:10,960 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-02-19 22:16:10 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 62, cache hit rate: 43.45%, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-02-19 22:16:11] INFO:     127.0.0.1:58284 - "POST /classify HTTP/1.1" 200 OK
reward: -24.25
reward: 0.9375


In [15]:
terminate_process(reward_process)

## Skip Tokenizer and Detokenizer

SGLang Runtime also supports skip tokenizer and detokenizer. This is useful in cases like integrating with RLHF workflow.

In [16]:
tokenizer_free_server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --skip-tokenizer-init
"""
)

wait_for_server(f"http://localhost:{port}")

INFO 02-19 22:16:15 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:16:16,848 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:16:17] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=True, host='127.0.0.1', port=30048, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=872285751, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False

INFO 02-19 22:16:21 __init__.py:190] Automatically detected platform cuda.


INFO 02-19 22:16:21 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:16:23,042 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-19 22:16:23,203 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:16:23 TP0] Init torch distributed begin.


[2025-02-19 22:16:24 TP0] Load weight begin. avail mem=78.84 GB


[2025-02-19 22:16:24 TP0] Using model weights format ['*.safetensors']
[2025-02-19 22:16:25 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.40it/s]

[2025-02-19 22:16:25 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=76.41 GB
[2025-02-19 22:16:25 TP0] KV Cache is allocated. K size: 33.48 GB, V size: 33.48 GB.
[2025-02-19 22:16:25 TP0] Memory pool end. avail mem=7.41 GB


[2025-02-19 22:16:25 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

2025-02-19 22:16:26,575 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False
2025-02-19 22:16:26,599 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False


  4%|▍         | 1/23 [00:01<00:25,  1.14s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.64it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.32it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.89it/s]

 22%|██▏       | 5/23 [00:02<00:05,  3.35it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.66it/s]

 30%|███       | 7/23 [00:02<00:04,  3.88it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.10it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.31it/s]

 43%|████▎     | 10/23 [00:03<00:02,  4.44it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.52it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.47it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.53it/s]

 61%|██████    | 14/23 [00:03<00:01,  4.61it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.43it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.52it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.53it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.59it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.42it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.43it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.46it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.51it/s]

100%|██████████| 23/23 [00:05<00:00,  4.48it/s]100%|██████████| 23/23 [00:05<00:00,  3.84it/s]
[2025-02-19 22:16:31 TP0] Capture cuda graph end. Time elapsed: 6.00 s
[2025-02-19 22:16:31 TP0] max_total_num_tokens=2193825, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-02-19 22:16:31] INFO:     Started server process [1847939]
[2025-02-19 22:16:31] INFO:     Waiting for application startup.
[2025-02-19 22:16:31] INFO:     Application startup complete.
[2025-02-19 22:16:31] INFO:     Uvicorn running on http://127.0.0.1:30048 (Press CTRL+C to quit)


[2025-02-19 22:16:32] INFO:     127.0.0.1:58126 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-19 22:16:32] INFO:     127.0.0.1:58140 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-19 22:16:32 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


2025-02-19 22:16:33,469 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:16:33,495 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-02-19 22:16:33] INFO:     127.0.0.1:58142 - "POST /generate HTTP/1.1" 200 OK
[2025-02-19 22:16:33] The server is fired up and ready to roll!




                    NOTE: Typically, the server runs in a separate terminal.
                    In this notebook, we run the server and notebook code together, so their outputs are combined.
                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
                    We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
                    


In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

input_text = "What is the capital of France?"

input_tokens = tokenizer.encode(input_text)
print_highlight(f"Input Text: {input_text}")
print_highlight(f"Tokenized Input: {input_tokens}")

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "input_ids": input_tokens,
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
            "stop_token_ids": [tokenizer.eos_token_id],
        },
        "stream": False,
    },
)
output = response.json()
output_tokens = output["token_ids"]

output_text = tokenizer.decode(output_tokens, skip_special_tokens=False)
print_highlight(f"Tokenized Output: {output_tokens}")
print_highlight(f"Decoded Output: {output_text}")
print_highlight(f"Output Text: {output['meta_info']['finish_reason']}")

Input Text: What is the capital of France?
Tokenized Input: [128000, 3923, 374, 279, 6864, 315, 9822, 30]
[2025-02-19 22:16:37 TP0] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-19 22:16:37 TP0] Decode batch. #running-req: 1, #token: 41, token usage: 0.00, gen throughput (token/s): 6.87, #queue-req: 0


[2025-02-19 22:16:37 TP0] Decode batch. #running-req: 1, #token: 81, token usage: 0.00, gen throughput (token/s): 365.20, #queue-req: 0
[2025-02-19 22:16:37] INFO:     127.0.0.1:58144 - "POST /generate HTTP/1.1" 200 OK
Tokenized Output: [12366, 627, 791, 6864, 315, 9822, 374, 12366, 13, 12366, 374, 279, 1455, 95551, 3363, 304, 9822, 323, 374, 3967, 369, 1202, 9257, 3925, 11, 1989, 11, 11401, 11, 323, 36105, 13, 1102, 374, 1101, 2162, 311, 1690, 11495, 61024, 1778, 439, 279, 469, 3168, 301, 22703, 11, 44564, 41798, 57829, 11, 323, 279, 9928, 49606, 16730, 13, 12366, 374, 264, 5526, 31070, 9284, 323, 374, 3629, 14183, 311, 439, 279, 330, 13020, 315, 8828, 1, 4245, 311, 1202, 15360, 449, 279, 92931, 323, 279, 8753, 22910, 13, 128009]
Decoded Output:  Paris.
The capital of France is Paris. Paris is the most populous city in France and is known for its rich history, art, fashion, and cuisine. It is also home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Lo

In [18]:
terminate_process(tokenizer_free_server_process)