# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0902 10:29:06.645000 4009316 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 10:29:06.645000 4009316 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0902 10:29:17.525000 4010118 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 10:29:17.525000 4010118 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0902 10:29:17.649000 4010117 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 10:29:17.649000 4010117 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-02 10:29:18] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.66it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=1 avail_mem=74.72 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=1 avail_mem=74.72 GB): 100%|██████████| 3/3 [00:00<00:00,  9.83it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Xiaoxue. I'm a senior 2 at a middle school in China, and I'm studying a subject called high school Chinese. Now I'm very interested in reading classic literature. My reading list is as follows:

1. A Dream of Red Mansions
2. The Scholars
3. Water Margin
4. Romance of the Three Kingdoms
5. The Great Illusion

I'm very curious about the academic value of reading classic literature. What do you think? I'd like to know more about it. If possible, I'd like to learn about it through different learning methods such as discussion groups or reading groups
Prompt: The president of the United States is
Generated text:  a political leader who has the power to appoint, remove, and supervise the work of other government officials. He also has the power to grant pardons, to declare war, to maintain the Union, to protect the rights of citizens, to execute laws passed by Congress, and to declare war. Can you tell me what is the main function of the president i

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. 

This statement encapsulates the main features and attractions of Paris, highlighting its historical significance, architectural landmarks, and cultural richness. It provides a brief overview of the city's importance in French culture and society. 

To ensure accuracy, the statement should be clear and concise, using appropriate terminology and avoiding any potentially sensitive information. It should also be well-structured, with a logical flow from the introduction to the main point. 

Please provide the statement in a markdown format, including the capital city, its main features, and a brief explanation

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be a need for greater privacy and security measures to protect against data breaches and other forms of cyber threats.

3. Increased focus on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations and the responsible development and use of AI.

4. Greater reliance on machine learning: As AI becomes



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: ... [Name], and I'm a/an [age] year old. I'm a/an [occupation] [career], but I've always been an [extraordinary trait]. I enjoy [something I do well, such as [activity], [substance], or [imagination].] I'm [job title] [character traits], and I always strive to [something I do well, such as [challenge], [skill], or [success].] And I'm always eager to [what, such as [future], [challenge], or [improve].]

I hope you can relate to my unique characteristics and dedication to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

This statement is factually accurate. Paris, officially known as the "City of Love" due to its iconic Eiffel Tower, is the largest city in France and the second largest city in the European Union by population. It is home to the French Parliament, the Eiffel Tower, the Louvre Museum, and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your Name

], and

 I am

 [Your

 Age/

Position].

 I

 am

 a

 [

Your

 Profession

]

 with

 [

Your

 Education

/

Experience

]

 and

 [

Your

 Passion

/

Interest

].

 I

 have

 always

 been

 driven

 to

 do

 great

 things

 and

 pursue

 my

 passions

.

 I

 believe

 in

 using

 my

 unique

 skills

 and

 talents

 to

 make

 a

 difference

 in

 the

 world

.

 I

 am

 passionate

 about

 [

Your

 Passion

/

Interest

],

 and

 I

 am

 excited

 to

 be

 here

 today

 and

 share

 my

 experiences

 and

 knowledge

 with

 others

.

 Thank

 you

 for

 having

 me

.

 Good

 luck

 with

 your

 career

!

 I

'm

 glad

 to

 be

 here

 with

 you

.

 What

 is

 your

 passion

 or

 interest

?

 As

 an

 AI

 language

 model

,

 I

 don



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 and

 most

 populous

 city

,

 located

 in

 the

 south

 of

 the

 country

 and

 known

 as

 the

 "

City

 of

 Light

."

 



An

not

ate

 the

 given

 information

 using

 the

 following

 format:

 "The

 capital

 of

 France

 is

 Paris

,

 located

 in

 [

country

]

 and

 known

 as

 the

 [

city

 name

].

"



These

 sentences

 accurately

 summarize

 the

 information

 provided

 and

 provide

 a

 concise

 statement

 of

 the

 capital

 city

's

 location

 and

 historical

 significance

.

 The

 sentence

 uses

 the

 correct

 format

 and

 includes

 the

 country

 and

 city

 name

 for

 clarity

 and

 professionalism

.

 Additionally

,

 it

 includes

 a

 brief

 note

 about

 the

 city

's

 nickname

,

 "

City

 of

 Light

,

 "

 which

 is

 often

 used

 to

 describe

 the

 vibrant

 and

 modern



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 here

 are

 some

 possible

 trends

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

:



1

.

 Increased

 autonomy

:

 AI

 will

 become

 more

 intelligent

 and

 autonomous

,

 allowing

 for

 more

 complex

 and

 efficient

 decision

-making

 processes

.



2

.

 Increased

 ethics

 and

 transparency

:

 AI

 will

 be

 developed

 to

 be

 more

 transparent

,

 accountable

,

 and

 ethical

,

 and

 will

 be

 used

 in

 a

 variety

 of

 ways

 that

 are

 more

 beneficial

 to

 society

.



3

.

 Increased

 human

 influence

:

 AI

 will

 become

 more

 integrated

 into

 our

 lives

,

 with

 AI

 making

 more

 decisions

 that

 we

 wouldn

't

 normally

 make

,

 and

 humans

 playing

 a

 more

 significant

 role

 in

 AI

 development

.



4

.

 Increased

 focus

 on

 AI

 in

 education

:

 There

 will




In [6]:
llm.shutdown()