# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-23 08:35:15] INFO utils.py:148: Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-23 08:35:15] INFO utils.py:151: Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-23 08:35:15] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-23 08:35:17] INFO server_args.py:1835: Attention backend not specified. Use fa3 backend by default.


[2026-02-23 08:35:17] INFO server_args.py:2886: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.43it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=59.01 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=59.01 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.21it/s]Capturing batches (bs=120 avail_mem=58.91 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.21it/s]

Capturing batches (bs=112 avail_mem=58.91 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.21it/s]Capturing batches (bs=104 avail_mem=58.91 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.21it/s]Capturing batches (bs=104 avail_mem=58.91 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.28it/s]Capturing batches (bs=96 avail_mem=58.91 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.28it/s] Capturing batches (bs=88 avail_mem=58.91 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.28it/s]Capturing batches (bs=80 avail_mem=58.91 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.28it/s]Capturing batches (bs=80 avail_mem=58.91 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 20.07it/s]Capturing batches (bs=72 avail_mem=58.64 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 20.07it/s]

Capturing batches (bs=64 avail_mem=58.64 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 20.07it/s]Capturing batches (bs=56 avail_mem=57.93 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 20.07it/s]Capturing batches (bs=56 avail_mem=57.93 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 21.92it/s]Capturing batches (bs=48 avail_mem=57.93 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 21.92it/s]Capturing batches (bs=40 avail_mem=57.92 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 21.92it/s]Capturing batches (bs=32 avail_mem=57.92 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 21.92it/s]Capturing batches (bs=32 avail_mem=57.92 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 23.55it/s]Capturing batches (bs=24 avail_mem=57.92 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 23.55it/s]

Capturing batches (bs=16 avail_mem=57.92 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 23.55it/s]Capturing batches (bs=12 avail_mem=57.92 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 23.55it/s]Capturing batches (bs=12 avail_mem=57.92 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.45it/s]Capturing batches (bs=8 avail_mem=57.92 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.45it/s] Capturing batches (bs=4 avail_mem=57.92 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.45it/s]Capturing batches (bs=2 avail_mem=57.92 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.45it/s]

Capturing batches (bs=1 avail_mem=57.92 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.45it/s]Capturing batches (bs=1 avail_mem=57.92 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 20/20 [00:00<00:00, 25.51it/s]Capturing batches (bs=1 avail_mem=57.92 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 20/20 [00:00<00:00, 21.94it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mike. I am a student in the 10th grade. I have a dream to become a doctor. I want to help sick people. I have a lot of homework to do on Saturdays. I have some friends who want to study too, but I don't want to spend a lot of time on them. 

The first week of summer vacation, I had a meeting with some of my friends, and I told them about my dream. They all agreed to help me. We had a nice dinner, and we talked about my dream and our homework. When the time came for the summer vacation, we decided to take a
Prompt: The president of the United States is
Generated text:  a politician and a member of the legislative branch of the federal government of the United States. They are elected by the people of the United States and serve a four-year term, during which time they are not able to be elected again. The president is the most powerful official in the executive branch of the federal government of the United States. They are the first choice of 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [age] year old, and I have [number] years of experience in [industry]. I'm a [gender] and I'm [height] inches tall. I have [weight] pounds of body weight. I'm [eye color] and I have [hair color]. I'm [gender] and I have [hair color]. I'm [gender] and I have [hair color]. I'm [gender] and I have

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Parliament building. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is also known for its cuisine, including French cuisine, and its fashion industry. Paris is a

Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence. This could lead to more sophisticated forms of AI that can learn and adapt to new situations, and more human-like interactions with AI.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and use, and a greater focus on ensuring that AI is used in a way that is fair, just, and beneficial for all.

3. Increased use of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [occupation or hobby] who has [reason for being an expert in this field]. I have [number of years of experience] years of experience in this field, and I have always been passionate about [occupation or hobby]. I am [age] years old, and I live in [city or country]. I love [reason for my love for this field]. I have a goal to [what I want to achieve or accomplish in the field]. I am [gender] and I am [race] - [nationality or ethnicity]. What would you like to know about me? I am here to share

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, and Louvre Museum. It is also home to the most populous city in Europe, with an estimated population of over 6 million people. The city is known for its historical signific

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 an

 [

insert

 your

 profession

]

 who

 has

 a

 passion

 for

 [

insert

 one

 or

 two

 words

 to

 describe

 your

 hobby

 or

 interest

].

 I

 enjoy

 [

insert

 one

 or

 two

 things

 that

 you

 do

 for

 fun

],

 and

 I

 am

 always

 up

 for

 a

 good

 [

insert

 one

 or

 two

 words

 to

 describe

 your

 favorite

 thing

 to

 do

].

 What

 makes

 you

 unique

?

 Can

 you

 share

 any

 interesting

 or

 surprising

 facts

 about

 yourself

 that

 you

 would

 like

 to

 share

?

 I

 am

 excited

 to

 meet

 you

!

 Let

's

 get

 to

 know

 each

 other

 better

!

 

ðŸ˜Š

ðŸ˜Š

ðŸ˜Š




Hey

 there

,

 my

 name

 is

 [

Your

 Name

]

!

 I

â€™m

 an

 [

insert

 your

 profession

]



Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text: 

 Paris

.



Therefore

,

 the

 answer

 is

 Paris

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

 and

 promising

,

 with

 numerous

 potential

 applications

 and

 advancements

 shaping

 the

 way

 we

 live

,

 work

,

 and

 communicate

.

 Here

 are

 some

 potential

 trends

 in

 AI

 that

 are

 likely

 to

 shape

 the

 future

:



1

.

 Increased

 automation

 and

 AI

-int

egrated

 systems

:

 As

 AI

 technology

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 more

 and

 more

 automation

 in

 various

 industries

.

 This

 includes

 the

 integration

 of

 AI

 into

 manufacturing

,

 healthcare

,

 transportation

,

 and

 more

.

 These

 AI

-int

egrated

 systems

 will

 likely

 become

 more

 sophisticated

,

 with

 the

 ability

 to

 learn

 and

 adapt

 to

 new

 situations

,

 increasing

 efficiency

 and

 productivity

.



2

.

 Enhanced

 human

-com

puter

 interaction

:

 AI

 is

 already

 making

 significant

 strides

 in

 enhancing

 human

-com

puter




In [6]:
llm.shutdown()