# OpenAI APIs - Completions

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.
A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).

This tutorial covers the following popular APIs:

- `chat/completions`
- `completions`
- `batches`

Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models.

## Launch A Server

Launch the server in your terminal and wait for it to initialize.

In [1]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

INFO 02-19 22:00:36 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:00:38,096 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:00:38] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30018, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=688718194, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable

INFO 02-19 22:00:42 __init__.py:190] Automatically detected platform cuda.


INFO 02-19 22:00:42 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:00:44,115 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-19 22:00:44,167 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:00:45 TP0] Init torch distributed begin.


[2025-02-19 22:00:46 TP0] Load weight begin. avail mem=78.84 GB


[2025-02-19 22:00:46 TP0] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.12it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.68it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.28it/s]

[2025-02-19 22:00:50 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.75 GB


[2025-02-19 22:00:50 TP0] KV Cache is allocated. K size: 27.14 GB, V size: 27.14 GB.
[2025-02-19 22:00:50 TP0] Memory pool end. avail mem=8.37 GB
[2025-02-19 22:00:50 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

2025-02-19 22:00:51,607 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-02-19 22:00:51,635 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False


  4%|▍         | 1/23 [00:01<00:24,  1.12s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.68it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.34it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.88it/s]

 22%|██▏       | 5/23 [00:02<00:05,  3.31it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.59it/s]

 30%|███       | 7/23 [00:02<00:04,  3.77it/s]

 35%|███▍      | 8/23 [00:02<00:03,  3.93it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.09it/s]

 43%|████▎     | 10/23 [00:03<00:03,  4.18it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.24it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.15it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.11it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.16it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.18it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.14it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.23it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.31it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.37it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.33it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.26it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.22it/s]

100%|██████████| 23/23 [00:06<00:00,  4.20it/s]100%|██████████| 23/23 [00:06<00:00,  3.68it/s]
[2025-02-19 22:00:57 TP0] Capture cuda graph end. Time elapsed: 6.26 s


[2025-02-19 22:00:57 TP0] max_total_num_tokens=444696, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-02-19 22:00:57] INFO:     Started server process [1785669]
[2025-02-19 22:00:57] INFO:     Waiting for application startup.
[2025-02-19 22:00:57] INFO:     Application startup complete.
[2025-02-19 22:00:57] INFO:     Uvicorn running on http://0.0.0.0:30018 (Press CTRL+C to quit)


[2025-02-19 22:00:58] INFO:     127.0.0.1:51162 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-19 22:00:58] INFO:     127.0.0.1:51178 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-19 22:00:58 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


2025-02-19 22:00:59,476 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:00:59,498 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False


[2025-02-19 22:00:59] INFO:     127.0.0.1:51186 - "POST /generate HTTP/1.1" 200 OK
[2025-02-19 22:00:59] The server is fired up and ready to roll!




                    NOTE: Typically, the server runs in a separate terminal.
                    In this notebook, we run the server and notebook code together, so their outputs are combined.
                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
                    We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
                    
Server started on http://localhost:30018


## Chat Completions

### Usage

The server fully implements the OpenAI API.
It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.
You can also specify a custom chat template with `--chat-template` when launching the server.

In [2]:
import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

[2025-02-19 22:01:03 TP0] Prefill batch. #new-seq: 1, #new-token: 42, #cached-token: 1, cache hit rate: 2.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-02-19 22:01:03,728 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-02-19 22:01:03,740 - INFO - flashinfer.jit: Finished loading JIT ops: cascade


[2025-02-19 22:01:04 TP0] Decode batch. #running-req: 1, #token: 76, token usage: 0.00, gen throughput (token/s): 6.38, #queue-req: 0
[2025-02-19 22:01:04] INFO:     127.0.0.1:50534 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Response: ChatCompletion(id='737e32e9104f4a758611d64eeb3da09a', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. Country: Japan\n   Capital: Tokyo\n\n2. Country: Australia\n   Capital: Canberra\n\n3. Country: Brazil\n   Capital: Brasília', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), matched_stop=128009)], created=1740002464, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=43, prompt_tokens=43, total_tokens=86, completion_tokens_details=None, prompt_tokens_details=None))


### Parameters

The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.

Here is an example of a detailed chat completion request:

In [3]:
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a knowledgeable historian who provides concise responses.",
        },
        {"role": "user", "content": "Tell me about ancient Rome"},
        {
            "role": "assistant",
            "content": "Ancient Rome was a civilization centered in Italy.",
        },
        {"role": "user", "content": "What were their major achievements?"},
    ],
    temperature=0.3,  # Lower temperature for more focused responses
    max_tokens=128,  # Reasonable length for a concise response
    top_p=0.95,  # Slightly higher for better fluency
    presence_penalty=0.2,  # Mild penalty to avoid repetition
    frequency_penalty=0.2,  # Mild penalty for more natural language
    n=1,  # Single response is usually more stable
    seed=42,  # Keep for reproducibility
)

print_highlight(response.choices[0].message.content)

[2025-02-19 22:01:04 TP0] Prefill batch. #new-seq: 1, #new-token: 51, #cached-token: 25, cache hit rate: 20.63%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-19 22:01:04 TP0] frequency_penalty, presence_penalty, and repetition_penalty are not supported when using the default overlap scheduler. They will be ignored. Please add `--disable-overlap` when launching the server if you need these features. The speed will be slower in that case.


[2025-02-19 22:01:04 TP0] Decode batch. #running-req: 1, #token: 106, token usage: 0.00, gen throughput (token/s): 76.69, #queue-req: 0


[2025-02-19 22:01:05 TP0] Decode batch. #running-req: 1, #token: 146, token usage: 0.00, gen throughput (token/s): 83.17, #queue-req: 0


[2025-02-19 22:01:05 TP0] Decode batch. #running-req: 1, #token: 186, token usage: 0.00, gen throughput (token/s): 82.52, #queue-req: 0


[2025-02-19 22:01:05] INFO:     127.0.0.1:50534 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Ancient Rome's major achievements include:

1. **Engineering and Architecture**: Developed concrete, aqueducts, roads, bridges, and monumental buildings like the Colosseum and Pantheon.
2. **Law and Governance**: Created the Twelve Tables, a foundation for modern law, and the Roman Republic's system of governance, which influenced Western democracy.
3. **Military Conquests**: Expanded the empire through extensive military campaigns, creating a vast network of roads and trade routes.
4. **Language and Literature**: Developed Latin, which became the foundation for many modern languages, and produced influential writers like Cicero and Virgil.
5.


Streaming mode is also supported.

In [4]:
stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

[2025-02-19 22:01:05] INFO:     127.0.0.1:50534 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-19 22:01:05 TP0] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 30, cache hit rate: 33.73%, token usage: 0.00, #running-req: 0, #queue-req: 0
It looks like we're starting a test. What would you like to

 do

 first?

## Completions

### Usage
Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates.

In [5]:
response = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="List 3 countries and their capitals.",
    temperature=0,
    max_tokens=64,
    n=1,
    stop=None,
)

print_highlight(f"Response: {response}")

[2025-02-19 22:01:06 TP0] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 1, cache hit rate: 32.57%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-19 22:01:06 TP0] Decode batch. #running-req: 1, #token: 13, token usage: 0.00, gen throughput (token/s): 74.52, #queue-req: 0


[2025-02-19 22:01:06 TP0] Decode batch. #running-req: 1, #token: 53, token usage: 0.00, gen throughput (token/s): 87.26, #queue-req: 0


Response: Completion(id='559c445ef4004d26a740c5b5e64887df', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. 2. 3.\n1.  United States - Washington D.C. 2.  Japan - Tokyo 3.  Australia - Canberra\nList 3 countries and their capitals. 1. 2. 3.\n1.  China - Beijing 2.  Brazil - Bras', matched_stop=None)], created=1740002466, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=9, total_tokens=73, completion_tokens_details=None, prompt_tokens_details=None))[2025-02-19 22:01:06] INFO:     127.0.0.1:50534 - "POST /v1/completions HTTP/1.1" 200 OK



### Parameters

The completions API accepts OpenAI Completions API's parameters.  Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.

Here is an example of a detailed completions request:

In [6]:
response = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="Write a short story about a space explorer.",
    temperature=0.7,  # Moderate temperature for creative writing
    max_tokens=150,  # Longer response for a story
    top_p=0.9,  # Balanced diversity in word choice
    stop=["\n\n", "THE END"],  # Multiple stop sequences
    presence_penalty=0.3,  # Encourage novel elements
    frequency_penalty=0.3,  # Reduce repetitive phrases
    n=1,  # Generate one completion
    seed=123,  # For reproducible results
)

print_highlight(f"Response: {response}")

[2025-02-19 22:01:06 TP0] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 1, cache hit rate: 31.35%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-19 22:01:06 TP0] frequency_penalty, presence_penalty, and repetition_penalty are not supported when using the default overlap scheduler. They will be ignored. Please add `--disable-overlap` when launching the server if you need these features. The speed will be slower in that case.


[2025-02-19 22:01:07 TP0] Decode batch. #running-req: 1, #token: 30, token usage: 0.00, gen throughput (token/s): 81.59, #queue-req: 0


[2025-02-19 22:01:07 TP0] Decode batch. #running-req: 1, #token: 70, token usage: 0.00, gen throughput (token/s): 86.44, #queue-req: 0


[2025-02-19 22:01:08 TP0] Decode batch. #running-req: 1, #token: 110, token usage: 0.00, gen throughput (token/s): 85.18, #queue-req: 0


[2025-02-19 22:01:08 TP0] Decode batch. #running-req: 1, #token: 150, token usage: 0.00, gen throughput (token/s): 83.97, #queue-req: 0
[2025-02-19 22:01:08] INFO:     127.0.0.1:50534 - "POST /v1/completions HTTP/1.1" 200 OK
Response: Completion(id='040ab5ec975344b08a14b9a5f8db5cb8', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=" Captain Jaxon had been traveling through the galaxy for 42 years, and he had seen some incredible sights. But nothing could have prepared him for the strange, glowing planet that appeared on his radar.\nAs he drew closer, he could see that the planet was covered in a thick, iridescent mist that seemed to pulse with an otherworldly energy. The ship's instruments were picking up strange readings, and Captain Jaxon couldn't shake the feeling that he was being watched.\nHe decided to take a small shuttle down to the planet's surface, and as he descended through the mist, he felt a strange, tingling sensation in his fingers. The sh

## Structured Outputs (JSON, Regex, EBNF)

For OpenAI compatible structed outputs API, refer to [Structured Outputs](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API) for more details.


## Batches

Batches API for chat completions and completions are also supported. You can upload your requests in `jsonl` files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).

The batches APIs are:

- `batches`
- `batches/{batch_id}/cancel`
- `batches/{batch_id}`

Here is an example of a batch job for chat completions, completions are similar.


In [7]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = [
    {
        "custom_id": "request-1",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
            "messages": [
                {"role": "user", "content": "Tell me a joke about programming"}
            ],
            "max_tokens": 50,
        },
    },
    {
        "custom_id": "request-2",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
            "messages": [{"role": "user", "content": "What is Python?"}],
            "max_tokens": 50,
        },
    },
]

input_file_path = "batch_requests.jsonl"

with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    file_response = client.files.create(file=f, purpose="batch")

batch_response = client.batches.create(
    input_file_id=file_response.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Batch job created with ID: {batch_response.id}")

[2025-02-19 22:01:08] INFO:     127.0.0.1:50548 - "POST /v1/files HTTP/1.1" 200 OK
[2025-02-19 22:01:08] INFO:     127.0.0.1:50548 - "POST /v1/batches HTTP/1.1" 200 OK
Batch job created with ID: batch_f071cf49-61ea-4ca3-9466-f4011a291933


[2025-02-19 22:01:08 TP0] Prefill batch. #new-seq: 2, #new-token: 20, #cached-token: 60, cache hit rate: 44.53%, token usage: 0.00, #running-req: 0, #queue-req: 0


In [8]:
while batch_response.status not in ["completed", "failed", "cancelled"]:
    time.sleep(3)
    print(f"Batch job status: {batch_response.status}...trying again in 3 seconds...")
    batch_response = client.batches.retrieve(batch_response.id)

if batch_response.status == "completed":
    print("Batch job completed successfully!")
    print(f"Request counts: {batch_response.request_counts}")

    result_file_id = batch_response.output_file_id
    file_response = client.files.content(result_file_id)
    result_content = file_response.read().decode("utf-8")

    results = [
        json.loads(line) for line in result_content.split("\n") if line.strip() != ""
    ]

    for result in results:
        print_highlight(f"Request {result['custom_id']}:")
        print_highlight(f"Response: {result['response']}")

    print_highlight("Cleaning up files...")
    # Only delete the result file ID since file_response is just content
    client.files.delete(result_file_id)
else:
    print_highlight(f"Batch job failed with status: {batch_response.status}")
    if hasattr(batch_response, "errors"):
        print_highlight(f"Errors: {batch_response.errors}")

[2025-02-19 22:01:09 TP0] Decode batch. #running-req: 1, #token: 69, token usage: 0.00, gen throughput (token/s): 86.28, #queue-req: 0


Batch job status: validating...trying again in 3 seconds...
Batch job completed successfully!
Request counts: BatchRequestCounts(completed=2, failed=0, total=2)
[2025-02-19 22:01:11] INFO:     127.0.0.1:50548 - "GET /v1/batches/batch_f071cf49-61ea-4ca3-9466-f4011a291933 HTTP/1.1" 200 OK
[2025-02-19 22:01:11] INFO:     127.0.0.1:50548 - "GET /v1/files/backend_result_file-eff98512-f34c-418f-b582-9af69fb66aae/content HTTP/1.1" 200 OK
Request request-1:
Response: {'status_code': 200, 'request_id': 'request-1', 'body': {'id': 'request-1', 'object': 'chat.completion', 'created': 1740002469, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': 'Why do programmers prefer dark mode?\n\nBecause light attracts bugs.', 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}, 'usage': {'prompt_tokens': 41, 'completion_tokens': 13, 'total_tokens': 54}, 'system_fingerprint': None}}
Request request-2:
Resp

It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.

1. `batches/{batch_id}`: Retrieve the batch job status.
2. `batches/{batch_id}/cancel`: Cancel the batch job.

Here is an example to check the batch job status.

In [9]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = []
for i in range(20):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 64,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")

time.sleep(10)

max_checks = 5
for i in range(max_checks):
    batch_details = client.batches.retrieve(batch_id=batch_job.id)

    print_highlight(
        f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
    )
    print_highlight(
        f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
    )

    time.sleep(3)

[2025-02-19 22:01:11] INFO:     127.0.0.1:37248 - "POST /v1/files HTTP/1.1" 200 OK
Created batch job with ID: batch_643a24f8-7808-43d2-a2c0-5dab11846ee3
Initial status: validating
[2025-02-19 22:01:11] INFO:     127.0.0.1:37248 - "POST /v1/batches HTTP/1.1" 200 OK
[2025-02-19 22:01:11 TP0] Prefill batch. #new-seq: 20, #new-token: 600, #cached-token: 500, cache hit rate: 45.27%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-19 22:01:12 TP0] Decode batch. #running-req: 20, #token: 1025, token usage: 0.00, gen throughput (token/s): 138.97, #queue-req: 0


[2025-02-19 22:01:12 TP0] Decode batch. #running-req: 20, #token: 1825, token usage: 0.00, gen throughput (token/s): 1487.60, #queue-req: 0


Batch job details (check 1 / 5) // ID: batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 // Status: completed // Created at: 1740002471 // Input file ID: backend_input_file-b87c4c9e-09d0-40be-8252-9ed7af9c221e // Output file ID: backend_result_file-7f31f05e-eb9a-4f60-a7fb-894cf26df358[2025-02-19 22:01:21] INFO:     127.0.0.1:45894 - "GET /v1/batches/batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 HTTP/1.1" 200 OK

<strong>Request counts: Total: 20 // Completed: 20 // Failed: 0</strong>


[2025-02-19 22:01:24] INFO:     127.0.0.1:45894 - "GET /v1/batches/batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 HTTP/1.1" 200 OK
Batch job details (check 2 / 5) // ID: batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 // Status: completed // Created at: 1740002471 // Input file ID: backend_input_file-b87c4c9e-09d0-40be-8252-9ed7af9c221e // Output file ID: backend_result_file-7f31f05e-eb9a-4f60-a7fb-894cf26df358
<strong>Request counts: Total: 20 // Completed: 20 // Failed: 0</strong>


Batch job details (check 3 / 5) // ID: batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 // Status: completed // Created at: 1740002471 // Input file ID: backend_input_file-b87c4c9e-09d0-40be-8252-9ed7af9c221e // Output file ID: backend_result_file-7f31f05e-eb9a-4f60-a7fb-894cf26df358[2025-02-19 22:01:27] INFO:     127.0.0.1:45894 - "GET /v1/batches/batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 HTTP/1.1" 200 OK

<strong>Request counts: Total: 20 // Completed: 20 // Failed: 0</strong>


[2025-02-19 22:01:30] INFO:     127.0.0.1:45894 - "GET /v1/batches/batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 HTTP/1.1" 200 OK
Batch job details (check 4 / 5) // ID: batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 // Status: completed // Created at: 1740002471 // Input file ID: backend_input_file-b87c4c9e-09d0-40be-8252-9ed7af9c221e // Output file ID: backend_result_file-7f31f05e-eb9a-4f60-a7fb-894cf26df358
<strong>Request counts: Total: 20 // Completed: 20 // Failed: 0</strong>


Batch job details (check 5 / 5) // ID: batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 // Status: completed // Created at: 1740002471 // Input file ID: backend_input_file-b87c4c9e-09d0-40be-8252-9ed7af9c221e // Output file ID: backend_result_file-7f31f05e-eb9a-4f60-a7fb-894cf26df358[2025-02-19 22:01:33] INFO:     127.0.0.1:45894 - "GET /v1/batches/batch_643a24f8-7808-43d2-a2c0-5dab11846ee3 HTTP/1.1" 200 OK

<strong>Request counts: Total: 20 // Completed: 20 // Failed: 0</strong>


Here is an example to cancel a batch job.

In [10]:
import json
import time
from openai import OpenAI
import os

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = []
for i in range(5000):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 128,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")

time.sleep(10)

try:
    cancelled_job = client.batches.cancel(batch_id=batch_job.id)
    print_highlight(f"Cancellation initiated. Status: {cancelled_job.status}")
    assert cancelled_job.status == "cancelling"

    # Monitor the cancellation process
    while cancelled_job.status not in ["failed", "cancelled"]:
        time.sleep(3)
        cancelled_job = client.batches.retrieve(batch_job.id)
        print_highlight(f"Current status: {cancelled_job.status}")

    # Verify final status
    assert cancelled_job.status == "cancelled"
    print_highlight("Batch job successfully cancelled")

except Exception as e:
    print_highlight(f"Error during cancellation: {e}")
    raise e

finally:
    try:
        del_response = client.files.delete(uploaded_file.id)
        if del_response.deleted:
            print_highlight("Successfully cleaned up input file")
        if os.path.exists(input_file_path):
            os.remove(input_file_path)
            print_highlight("Successfully deleted local batch_requests.jsonl file")
    except Exception as e:
        print_highlight(f"Error cleaning up: {e}")
        raise e

[2025-02-19 22:01:36] INFO:     127.0.0.1:53150 - "POST /v1/files HTTP/1.1" 200 OK
Created batch job with ID: batch_27d621b0-8b73-4419-8d25-128479417a73
Initial status: validating
[2025-02-19 22:01:36] INFO:     127.0.0.1:53150 - "POST /v1/batches HTTP/1.1" 200 OK


[2025-02-19 22:01:37 TP0] Prefill batch. #new-seq: 93, #new-token: 2210, #cached-token: 2905, cache hit rate: 54.37%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-19 22:01:37 TP0] Prefill batch. #new-seq: 274, #new-token: 8192, #cached-token: 6850, cache hit rate: 48.20%, token usage: 0.01, #running-req: 93, #queue-req: 1743


[2025-02-19 22:01:38 TP0] Prefill batch. #new-seq: 274, #new-token: 8192, #cached-token: 6825, cache hit rate: 47.07%, token usage: 0.02, #running-req: 366, #queue-req: 4361


[2025-02-19 22:01:38 TP0] Prefill batch. #new-seq: 274, #new-token: 8192, #cached-token: 6825, cache hit rate: 46.60%, token usage: 0.04, #running-req: 639, #queue-req: 4088


[2025-02-19 22:01:39 TP0] Prefill batch. #new-seq: 274, #new-token: 8192, #cached-token: 7011, cache hit rate: 46.49%, token usage: 0.06, #running-req: 912, #queue-req: 3815


[2025-02-19 22:01:39 TP0] Prefill batch. #new-seq: 274, #new-token: 8192, #cached-token: 7098, cache hit rate: 46.47%, token usage: 0.08, #running-req: 1185, #queue-req: 3542


[2025-02-19 22:01:40 TP0] Prefill batch. #new-seq: 274, #new-token: 8192, #cached-token: 7098, cache hit rate: 46.47%, token usage: 0.10, #running-req: 1458, #queue-req: 3269


[2025-02-19 22:01:40 TP0] Prefill batch. #new-seq: 274, #new-token: 8192, #cached-token: 7098, cache hit rate: 46.46%, token usage: 0.12, #running-req: 1731, #queue-req: 2996


[2025-02-19 22:01:41 TP0] Prefill batch. #new-seq: 45, #new-token: 1336, #cached-token: 1144, cache hit rate: 46.45%, token usage: 0.14, #running-req: 2004, #queue-req: 2952


[2025-02-19 22:01:46] INFO:     127.0.0.1:46936 - "POST /v1/batches/batch_27d621b0-8b73-4419-8d25-128479417a73/cancel HTTP/1.1" 200 OK
Cancellation initiated. Status: cancelling


[2025-02-19 22:01:49] INFO:     127.0.0.1:46936 - "GET /v1/batches/batch_27d621b0-8b73-4419-8d25-128479417a73 HTTP/1.1" 200 OK
Current status: cancelled
Batch job successfully cancelled
[2025-02-19 22:01:49] INFO:     127.0.0.1:46936 - "DELETE /v1/files/backend_input_file-07c4c820-24d8-4155-a7e8-a00f06e2949c HTTP/1.1" 200 OK
Successfully cleaned up input file
Successfully deleted local batch_requests.jsonl file


In [11]:
terminate_process(server_process)