# OpenAI Compatible API

SGLang provides an OpenAI compatible API for smooth transition from OpenAI services.

- `chat/completions`
- `completions`
- `batches`
- `embeddings`(refer to [embedding_model.ipynb](embedding_model.ipynb))

## Chat Completions

### Usage

Similar to [send_request.ipynb](send_request.ipynb), we can send a chat completion request to SGLang server with OpenAI API format.

In [38]:
from sglang.utils import execute_shell_command, wait_for_server, terminate_process

server_process = execute_shell_command(
    """
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --log-level warning
"""
)

wait_for_server("http://localhost:30000")
print("Server is ready. Proceeding with the next steps.")

Server is ready. Proceeding with the next steps.


In [39]:
import openai

# Always assign an api_key, even if not specified during server initialization.
# Setting an API key during server initialization is strongly recommended.

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

# Chat completion example

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)

ChatCompletion(id='e854540ec7914b2d8c712f16fd9ed2ca', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. **Country:** Japan\n**Capital:** Tokyo\n\n2. **Country:** Australia\n**Capital:** Canberra\n\n3. **Country:** Brazil\n**Capital:** Brasília', refusal=None, role='assistant', function_call=None, tool_calls=None), matched_stop=128009)], created=1730012326, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=46, prompt_tokens=49, total_tokens=95, prompt_tokens_details=None))


### Parameters

The chat completions API accepts the following parameters (refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details):

- `messages`: List of messages in the conversation, each containing `role` and `content`
- `model`: The model identifier to use for completion
- `max_tokens`: Maximum number of tokens to generate in the response
- `temperature`: Controls randomness (0-2). Lower values make output more focused and deterministic
- `top_p`: Alternative to temperature. Controls diversity via nucleus sampling
- `n`: Number of chat completion choices to generate
- `stream`: If true, partial message deltas will be sent as they become available
- `stop`: Sequences where the API will stop generating further tokens
- `presence_penalty`: Penalizes new tokens based on their presence in the text so far (-2.0 to 2.0)
- `frequency_penalty`: Penalizes new tokens based on their frequency in the text so far (-2.0 to 2.0)
- `logit_bias`: Modify the likelihood of specified tokens appearing in the completion
- `logprobs`: Include log probabilities of tokens in the response
- `top_logprobs`: Number of most likely tokens to return probabilities for
- `seed`: Random seed for deterministic results
- `response_format`: Specify the format of the response (e.g., JSON)
- `stream_options`: Additional options for streaming responses
- `user`: A unique identifier representing your end-user

Here is an example of a detailed chat completion request:

In [40]:
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a knowledgeable historian who provides concise responses.",
        },
        {"role": "user", "content": "Tell me about ancient Rome"},
        {
            "role": "assistant",
            "content": "Ancient Rome was a civilization centered in Italy.",
        },
        {"role": "user", "content": "What were their major achievements?"},
    ],
    temperature=0.3,  # Lower temperature for more focused responses
    max_tokens=100,  # Reasonable length for a concise response
    top_p=0.95,  # Slightly higher for better fluency
    stop=["\n\n"],  # Simple stop sequence
    presence_penalty=0.2,  # Mild penalty to avoid repetition
    frequency_penalty=0.2,  # Mild penalty for more natural language
    n=1,  # Single response is usually more stable
    seed=42,  # Keep for reproducibility
    stream=True,  # Keep streaming for real-time output
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Ancient Rome's major achievements include:

## Completions

### Usage

Completions API is similar to Chat Completions API, but without the `messages` parameter. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.

In [41]:
response = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="List 3 countries and their capitals.",
    temperature=0,
    max_tokens=64,
    n=1,
    stop=None,
)
print(response)

Completion(id='a6e07198f4b445baa0fb08a2178ceb59', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. 2. 3.\n1.  United States - Washington D.C. 2.  Japan - Tokyo 3.  Australia - Canberra\nList 3 countries and their capitals. 1. 2. 3.\n1.  China - Beijing 2.  Brazil - Bras', matched_stop=None)], created=1730012328, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=9, total_tokens=73, prompt_tokens_details=None))


### Parameters

The completions API accepts the following parameters:

- `model`: The model identifier to use for completion
- `prompt`: Input text to generate completions for. Can be a string, array of strings, or token arrays
- `best_of`: Number of completions to generate server-side and return the best one
- `echo`: If true, the prompt will be included in the response
- `frequency_penalty`: Penalizes new tokens based on their frequency in the text so far (-2.0 to 2.0)
- `logit_bias`: Modify the likelihood of specified tokens appearing in the completion
- `logprobs`: Include log probabilities of tokens in the response
- `max_tokens`: Maximum number of tokens to generate in the response (default: 16)
- `n`: Number of completion choices to generate
- `presence_penalty`: Penalizes new tokens based on their presence in the text so far (-2.0 to 2.0)
- `seed`: Random seed for deterministic results
- `stop`: Sequences where the API will stop generating further tokens
- `stream`: If true, partial completion deltas will be sent as they become available
- `stream_options`: Additional options for streaming responses
- `suffix`: Text to append to the completion
- `temperature`: Controls randomness (0-2). Lower values make output more focused and deterministic
- `top_p`: Alternative to temperature. Controls diversity via nucleus sampling
- `user`: A unique identifier representing your end-user

Here is an example of a detailed completions request:

In [42]:
response = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="Write a short story about a space explorer.",
    temperature=0.7,  # Moderate temperature for creative writing
    max_tokens=150,  # Longer response for a story
    top_p=0.9,  # Balanced diversity in word choice
    stop=["\n\n", "THE END"],  # Multiple stop sequences
    presence_penalty=0.3,  # Encourage novel elements
    frequency_penalty=0.3,  # Reduce repetitive phrases
    n=1,  # Generate one completion
    seed=123,  # For reproducible results
    stream=True,  # Stream the response
)

for chunk in response:
    print(chunk.choices[0].text or "", end="")

 Space explorer, Captain Orion Blackwood, had been traveling through the galaxy for 12 years, searching for a new home for humanity. His ship, the Aurora, had been his home for so long that he barely remembered what it was like to walk on solid ground.
As he navigated through the dense asteroid field, the ship's computer, S.A.R.A. (Self-Aware Reasoning Algorithm), alerted him to a strange reading on one of the asteroids. Captain Blackwood's curiosity was piqued, and he decided to investigate further.
"Captain, I'm detecting unusual energy signatures emanating from the asteroid," S.A.R.A. said. "It's unlike anything I've seen before."
Captain Blackwood's eyes narrowed as

## Batches

We have implemented the batches API for chat completions and completions. You can upload your requests in `jsonl` files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).

The batches APIs are:

- `batches`
- `batches/{batch_id}/cancel`
- `batches/{batch_id}`

Here is an example of a batch job for chat completions, completions are similar.


In [43]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")

requests = [
    {
        "custom_id": "request-1",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
            "messages": [
                {"role": "user", "content": "Tell me a joke about programming"}
            ],
            "max_tokens": 50,
        },
    },
    {
        "custom_id": "request-2",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
            "messages": [{"role": "user", "content": "What is Python?"}],
            "max_tokens": 50,
        },
    },
]

input_file_path = "batch_requests.jsonl"

with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    file_response = client.files.create(file=f, purpose="batch")

batch_response = client.batches.create(
    input_file_id=file_response.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print(f"Batch job created with ID: {batch_response.id}")

Batch job created with ID: batch_03d7f74f-dffe-4c26-b5e7-bb9fb5cb89ff


In [44]:
while batch_response.status not in ["completed", "failed", "cancelled"]:
    time.sleep(3)
    print(f"Batch job status: {batch_response.status}...trying again in 3 seconds...")
    batch_response = client.batches.retrieve(batch_response.id)

if batch_response.status == "completed":
    print("Batch job completed successfully!")
    print(f"Request counts: {batch_response.request_counts}")

    result_file_id = batch_response.output_file_id
    file_response = client.files.content(result_file_id)
    result_content = file_response.read().decode("utf-8")

    results = [
        json.loads(line) for line in result_content.split("\n") if line.strip() != ""
    ]

    for result in results:
        print(f"\nRequest {result['custom_id']}:")
        print(f"Response: {result['response']}")

    print("\nCleaning up files...")
    # Only delete the result file ID since file_response is just content
    client.files.delete(result_file_id)
else:
    print(f"Batch job failed with status: {batch_response.status}")
    if hasattr(batch_response, "errors"):
        print(f"Errors: {batch_response.errors}")

Batch job status: validating...trying again in 3 seconds...
Batch job completed successfully!
Request counts: BatchRequestCounts(completed=2, failed=0, total=2)

Request request-1:
Response: {'status_code': 200, 'request_id': 'request-1', 'body': {'id': 'request-1', 'object': 'chat.completion', 'created': 1730012333, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': 'Why do programmers prefer dark mode?\n\nBecause light attracts bugs.'}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}, 'usage': {'prompt_tokens': 41, 'completion_tokens': 13, 'total_tokens': 54}, 'system_fingerprint': None}}

Request request-2:
Response: {'status_code': 200, 'request_id': 'request-2', 'body': {'id': 'request-2', 'object': 'chat.completion', 'created': 1730012333, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': '**What is Python?**\n\nPython is a high

It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.

1. `batches/{batch_id}`: Retrieve the batch job status.
2. `batches/{batch_id}/cancel`: Cancel the batch job.

Here is an example to check the batch job status.

In [45]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")

requests = []
for i in range(100):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 500,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print(f"Created batch job with ID: {batch_job.id}")
print(f"Initial status: {batch_job.status}")

time.sleep(10)

max_checks = 5
for i in range(max_checks):
    batch_details = client.batches.retrieve(batch_id=batch_job.id)
    print(f"Batch job details (check {i+1}/{max_checks}):")
    print(f"ID: {batch_details.id}")
    print(f"Status: {batch_details.status}")
    print(f"Created at: {batch_details.created_at}")
    print(f"Input file ID: {batch_details.input_file_id}")
    print(f"Output file ID: {batch_details.output_file_id}")

    print("Request counts:")
    print(f"Total: {batch_details.request_counts.total}")
    print(f"Completed: {batch_details.request_counts.completed}")
    print(f"Failed: {batch_details.request_counts.failed}")

    time.sleep(3)

Created batch job with ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d
Initial status: validating
Batch job details (check 1/5):
ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d
Status: in_progress
Created at: 1730012334
Input file ID: backend_input_file-8203d42a-109c-4573-9663-13b5d9cb6a2b
Output file ID: None
Request counts:
Total: 0
Completed: 0
Failed: 0
Batch job details (check 2/5):
ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d
Status: in_progress
Created at: 1730012334
Input file ID: backend_input_file-8203d42a-109c-4573-9663-13b5d9cb6a2b
Output file ID: None
Request counts:
Total: 0
Completed: 0
Failed: 0
Batch job details (check 3/5):
ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d
Status: in_progress
Created at: 1730012334
Input file ID: backend_input_file-8203d42a-109c-4573-9663-13b5d9cb6a2b
Output file ID: None
Request counts:
Total: 0
Completed: 0
Failed: 0
Batch job details (check 4/5):
ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d
Status: completed
Created at: 1730012334
I

Here is an example to cancel a batch job.

In [46]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")

requests = []
for i in range(500):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 500,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print(f"Created batch job with ID: {batch_job.id}")
print(f"Initial status: {batch_job.status}")

time.sleep(10)

try:
    cancelled_job = client.batches.cancel(batch_id=batch_job.id)
    print(f"Cancellation initiated. Status: {cancelled_job.status}")
    assert cancelled_job.status == "cancelling"

    # Monitor the cancellation process
    while cancelled_job.status not in ["failed", "cancelled"]:
        time.sleep(3)
        cancelled_job = client.batches.retrieve(batch_job.id)
        print(f"Current status: {cancelled_job.status}")

    # Verify final status
    assert cancelled_job.status == "cancelled"
    print("Batch job successfully cancelled")

except Exception as e:
    print(f"Error during cancellation: {e}")
    raise e

finally:
    try:
        del_response = client.files.delete(uploaded_file.id)
        if del_response.deleted:
            print("Successfully cleaned up input file")
    except Exception as e:
        print(f"Error cleaning up: {e}")
        raise e

Created batch job with ID: batch_3d2dd881-ad84-465a-85ee-6d5991794e5e
Initial status: validating
Cancellation initiated. Status: cancelling
Current status: cancelled
Batch job successfully cancelled
Successfully cleaned up input file


In [47]:
terminate_process(server_process)