# Sending Requests
This notebook provides a quick-start guide to use SGLang in chat completions after installation.

- For Vision Language Models, see [OpenAI APIs - Vision](../backend/openai_api_vision.ipynb).
- For Embedding Models, see [OpenAI APIs - Embedding](../backend/openai_api_embeddings.ipynb) and [Encode (embedding model)](../backend/native_api.html#Encode-(embedding-model)).
- For Reward Models, see [Classify (reward model)](../backend/native_api.html#Classify-(reward-model)).

## Launch A Server

In [1]:
from sglang.test.test_utils import is_in_ci
from sglang.utils import wait_for_server, print_highlight, terminate_process

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

# This is equivalent to running the following command in your terminal

# python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0

server_process, port = launch_server_cmd(
    """
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
 --host 0.0.0.0
"""
)

wait_for_server(f"http://localhost:{port}")

INFO 02-19 22:17:44 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:17:45,903 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:17:46] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30050, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=532364776, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable

INFO 02-19 22:17:50 __init__.py:190] Automatically detected platform cuda.
INFO 02-19 22:17:50 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:17:52,055 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-19 22:17:52,093 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:17:53 TP0] Init torch distributed begin.


[2025-02-19 22:17:54 TP0] Load weight begin. avail mem=78.84 GB


[2025-02-19 22:17:54 TP0] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.16it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.77it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.40it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]

[2025-02-19 22:17:58 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.75 GB


[2025-02-19 22:17:58 TP0] KV Cache is allocated. K size: 27.14 GB, V size: 27.14 GB.
[2025-02-19 22:17:58 TP0] Memory pool end. avail mem=8.37 GB
[2025-02-19 22:17:58 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

2025-02-19 22:17:59,177 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-02-19 22:17:59,200 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False


  4%|▍         | 1/23 [00:01<00:25,  1.18s/it]

  9%|▊         | 2/23 [00:01<00:13,  1.61it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.23it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.77it/s]

 22%|██▏       | 5/23 [00:02<00:05,  3.19it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.47it/s]

 30%|███       | 7/23 [00:02<00:04,  3.72it/s]

 35%|███▍      | 8/23 [00:02<00:03,  3.90it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.89it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.99it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.03it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.05it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.11it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.17it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.19it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.05it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  3.94it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  3.95it/s]

 83%|████████▎ | 19/23 [00:05<00:01,  4.00it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.01it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.04it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.12it/s]

100%|██████████| 23/23 [00:06<00:00,  4.10it/s]100%|██████████| 23/23 [00:06<00:00,  3.54it/s]
[2025-02-19 22:18:05 TP0] Capture cuda graph end. Time elapsed: 6.49 s


[2025-02-19 22:18:05 TP0] max_total_num_tokens=444696, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-02-19 22:18:05] INFO:     Started server process [1855414]
[2025-02-19 22:18:05] INFO:     Waiting for application startup.
[2025-02-19 22:18:05] INFO:     Application startup complete.
[2025-02-19 22:18:05] INFO:     Uvicorn running on http://0.0.0.0:30050 (Press CTRL+C to quit)


[2025-02-19 22:18:06] INFO:     127.0.0.1:32800 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-19 22:18:06] INFO:     127.0.0.1:32816 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-19 22:18:06 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


2025-02-19 22:18:07,132 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:18:07,154 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False


[2025-02-19 22:18:07] INFO:     127.0.0.1:32822 - "POST /generate HTTP/1.1" 200 OK
[2025-02-19 22:18:07] The server is fired up and ready to roll!




                    NOTE: Typically, the server runs in a separate terminal.
                    In this notebook, we run the server and notebook code together, so their outputs are combined.
                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
                    We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
                    


## Using cURL


In [2]:
import subprocess, json

curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{{"role": "user", "content": "What is the capital of France?"}}]}}'
"""

response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)

[2025-02-19 22:18:11 TP0] Prefill batch. #new-seq: 1, #new-token: 41, #cached-token: 1, cache hit rate: 2.04%, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-02-19 22:18:11,423 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-02-19 22:18:11,436 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-02-19 22:18:11] INFO:     127.0.0.1:46942 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{'id': '75fd138fc7de4011bf3fff1a7c529137', 'object': 'chat.completion', 'created': 1740003491, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The capital of France is Paris.', 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}], 'usage': {'prompt_tokens': 42, 'total_tokens': 50, 'completion_tokens': 8, 'prompt_tokens_details': None}}


## Using Python Requests

In [3]:
import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

[2025-02-19 22:18:11 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 41, cache hit rate: 46.15%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-19 22:18:11] INFO:     127.0.0.1:46948 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{'id': '62a3a855df5f4d3396f3f0d6fe579ce5', 'object': 'chat.completion', 'created': 1740003491, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The capital of France is Paris.', 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}], 'usage': {'prompt_tokens': 42, 'total_tokens': 50, 'completion_tokens': 8, 'prompt_tokens_details': None}}


## Using OpenAI Python Client

In [4]:
import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print_highlight(response)

[2025-02-19 22:18:11 TP0] Prefill batch. #new-seq: 1, #new-token: 13, #cached-token: 30, cache hit rate: 53.73%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-19 22:18:11 TP0] Decode batch. #running-req: 1, #token: 60, token usage: 0.00, gen throughput (token/s): 6.26, #queue-req: 0


[2025-02-19 22:18:12] INFO:     127.0.0.1:46954 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ChatCompletion(id='144c7dba10a046e08099a4d16a0e8779', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. Country: Japan\n   Capital: Tokyo\n\n2. Country: Australia\n   Capital: Canberra\n\n3. Country: Brazil\n   Capital: Brasília', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), matched_stop=128009)], created=1740003492, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=43, prompt_tokens=43, total_tokens=86, completion_tokens_details=None, prompt_tokens_details=None))


### Streaming

In [5]:
import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

# Use stream=True for streaming responses
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
    stream=True,
)

# Handle the streaming output
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

[2025-02-19 22:18:12] INFO:     127.0.0.1:46964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-19 22:18:12 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 42, cache hit rate: 64.41%, token usage: 0.00, #running-req: 0, #queue-req: 0
Here

 are

 

3

 countries

 and

 their

 capitals

:



1

.

 Country

:

[2025-02-19 22:18:12 TP0] Decode batch. #running-req: 1, #token: 57, token usage: 0.00, gen throughput (token/s): 79.92, #queue-req: 0
 Japan




  

 Capital

:

 Tokyo





2

.

 Country

:

 Australia




  

 Capital

:

 Canberra





3

.

 Country

:

 Brazil




  

 Capital

:

 Bras

ília

## Using Native Generation APIs

You can also use the native `/generate` endpoint with requests, which provides more flexiblity. An API reference is available at [Sampling Parameters](../references/sampling_params.md).

In [6]:
import requests

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)

print_highlight(response.json())

[2025-02-19 22:18:12 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 3, cache hit rate: 63.93%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-19 22:18:12 TP0] Decode batch. #running-req: 1, #token: 17, token usage: 0.00, gen throughput (token/s): 81.69, #queue-req: 0


{'text': ' a city of romance, art, fashion, and cuisine. Paris is a must-visit destination for anyone who loves history, architecture, and culture. From the', 'meta_info': {'id': '8a25f367da6a47c194bc302b16d72017', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 6, 'completion_tokens': 32, 'cached_tokens': 3}}[2025-02-19 22:18:13] INFO:     127.0.0.1:46980 - "POST /generate HTTP/1.1" 200 OK



### Streaming

In [7]:
import requests, json

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
        output = data["text"]
        print(output[prev:], end="", flush=True)
        prev = len(output)

[2025-02-19 22:18:13] INFO:     127.0.0.1:46982 - "POST /generate HTTP/1.1" 200 OK
[2025-02-19 22:18:13 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 64.55%, token usage: 0.00, #running-req: 0, #queue-req: 0
 a

 city

 of

 romance

,

 art

,

 fashion

,

 and

 cuisine

.

 Paris

 is

 a

 must

-

visit

[2025-02-19 22:18:13 TP0] Decode batch. #running-req: 1, #token: 25, token usage: 0.00, gen throughput (token/s): 82.78, #queue-req: 0
 destination

 for

 anyone

 who

 loves

 history

,

 architecture

,

 and

 culture

.

 From

 the

In [8]:
terminate_process(server_process)