# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-07 21:46:21] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-07 21:46:21] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-07 21:46:21] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-07 21:46:23] INFO trace.py:52: opentelemetry package is not installed, tracing disabled






[2025-11-07 21:46:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-07 21:46:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-07 21:46:32] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-07 21:46:33] INFO trace.py:52: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.69it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|â–Œ         | 1/20 [00:00<00:05,  3.58it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|â–Œ         | 1/20 [00:00<00:05,  3.58it/s]

Capturing batches (bs=120 avail_mem=76.82 GB):  10%|â–ˆ         | 2/20 [00:00<00:04,  4.18it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  10%|â–ˆ         | 2/20 [00:00<00:04,  4.18it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  10%|â–ˆ         | 2/20 [00:00<00:04,  4.18it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:02,  7.02it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:02,  7.02it/s] 

Capturing batches (bs=88 avail_mem=76.79 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:02,  7.02it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:02,  7.02it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:01, 11.63it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:01, 11.63it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:01, 11.63it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:01, 11.63it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 15.36it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 15.36it/s]

Capturing batches (bs=40 avail_mem=76.76 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 15.36it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 15.36it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:01<00:00, 18.16it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:01<00:00, 18.16it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:01<00:00, 18.16it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:01<00:00, 18.16it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:01<00:00, 18.78it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:01<00:00, 18.78it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:01<00:00, 18.78it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:01<00:00, 18.78it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  95%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ| 19/20 [00:01<00:00, 21.48it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  95%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ| 19/20 [00:01<00:00, 21.48it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 20/20 [00:01<00:00, 15.27it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lucas and I am writing to you today to talk about my favorite place to go to shop and dine. My favorite place to go to shop and dine is the CÃ´te dâ€™Azur. The CÃ´te dâ€™Azur is a beautiful country located on the west coast of France. It is in the Mediterranean Sea, and it is bordered by the Mediterranean Sea to the east, the English Channel to the south, and the Atlantic Ocean to the west. The CÃ´te dâ€™Azur is a beautiful place to shop and dine.

The CÃ´te dâ€™Azur has an amazing culture. It has a rich history of being a
Prompt: The president of the United States is
Generated text:  a person. Is there a president in China?

A. No  
B. Yes  
C. Information is lacking  
D. Insufficient information
To determine whether the president of the United States is also a person, we need to consider the definitions provided and the nature of a president.

1. **Definition of a President**: The president of the United States is a formal elected official w

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill] who has always been [What motivates you to be a [Skill]]. I'm passionate about [What's your favorite hobby or activity]. I'm always looking for new experiences and learning new things. I'm a [What's your favorite thing about [Occupation]] and I'm always eager to learn more about it. I'm a [What's your favorite [Occupation] activity] and I'm always looking for ways to improve my skills and knowledge. I'm a [What's your favorite [Occupation]

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the second-largest city in the world by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also famous for its rich history, including the French Revolution, the French Revolution, and the French Revolution. Paris is a cultural and political center of France and a major tourist destination. It is home to many famous museums, theaters, and art galleries. The city is also known for its cuisine, including French cuisine, and its fashion industry. Paris is a vibrant and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, we can expect to see even more sophisticated applications in healthcare, such as personalized medicine, drug discovery, and disease diagnosis.

2. Increased use of AI in finance: AI is already being used in finance to improve risk management, fraud detection, and trading algorithms. As AI becomes more advanced



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name], and I'm a [insert profession/role]! I'm passionate about [insert passion/focus on]. I enjoy [insert hobbies/activities that make me happy] and I'm always looking for new ways to improve my skills and knowledge. I'm a [insert any positive trait that makes me stand out, like being patient, friendly, or curious]. I'm excited to meet you and see what I can do for you! ðŸŒŸâœ¨

Feel free to add any details that could help me get a sense of you or your background. [Insert any relevant information or anecdotes that could enhance your self-int

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris, located on the western coast of the country. It serves as the capital and largest city, with an estimated population of over 2.3 million people. Paris is known for its historical architecture, iconic landma

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

].

 I

'm

 a

 [

occupation

]

 with

 [

type

 of

 work

 or

 career

].

 I

'm

 [

character

istic

 

1

]

 in

 [

character

istic

 

2

].

 I

'm

 [

character

istic

 

3

].

 I

'm

 a

 [

character

istic

 

4

]

 personality

 type

.

 I

'm

 a

 [

name

 of

 your

 favorite

 movie

,

 book

,

 TV

 show

,

 etc

.]

 and

 I

 enjoy

 [

reason

 why

].

 I

'm

 [

name

 of

 your

 favorite

 hobby

,

 sport

,

 or

 activity

].

 I

 love

 [

activity

 or

 hobby

 that

 brings

 me

 joy

].

 And

 I

'm

 [

character

istic

 

5

].

 I

'm

 [

character

istic

 

6

].

 I

'm

 [

name



Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text: 

 Paris

.

 



Please

 answer

 the

 following

 question

 based

 on

 the

 information

 provided

 in

 the

 passage

:


What

 is

 the

 capital

 of

 France

?

 To

 answer

 the

 question

 "

What

 is

 the

 capital

 of

 France

?

 ",

 I

 will

 follow

 these

 steps

:



1

.

 Identify

 the

 key

 information

 provided

 in

 the

 passage

.


2

.

 Extract

 the

 specific

 answer

 based

 on

 that

 information

.



Step

 

1

:

 The

 key

 information

 provided

 in

 the

 passage

 is

:


"The

 capital

 of

 France

 is

 Paris

."



Step

 

2

:

 From

 this

 information

,

 I

 can

 extract

 the

 specific

 answer

 to

 the

 question

.

 The

 capital

 of

 France

 is

 Paris

.



Therefore

,

 the

 answer

 is

:

 The

 capital

 of

 France

 is

 Paris

.

 



However

,

 it

's



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 one

 of

 rapid

 progress

,

 innovation

,

 and

 change

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Em

phasis

 on

 ethical

 considerations

:

 As

 the

 AI

 industry

 continues

 to

 evolve

,

 it

 will

 become

 increasingly

 important

 to

 consider

 the

 ethical

 implications

 of

 AI

.

 This

 will

 involve

 developing

 policies

 and

 regulations

 that

 will

 guide

 the

 development

 and

 use

 of

 AI

,

 as

 well

 as

 addressing

 issues

 such

 as

 bias

,

 accountability

,

 and

 transparency

.



2

.

 Increased

 focus

 on

 sustainability

:

 As

 concerns

 about

 climate

 change

 and

 environmental

 degradation

 become

 more

 acute

,

 AI

 is

 likely

 to

 be

 seen

 as

 a

 key

 player

 in

 addressing

 these

 issues

.

 This

 could

 lead

 to

 increased

 investment

 in

 AI

 technology

 that

 is

 designed




In [6]:
llm.shutdown()