# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-29 02:28:40] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-29 02:28:40] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-29 02:28:40] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-29 02:28:42] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.


[2025-12-29 02:28:42] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.51it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.74 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.74 GB):   5%|▌         | 1/20 [00:00<00:03,  5.56it/s]Capturing batches (bs=120 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  5.56it/s]

Capturing batches (bs=112 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  5.56it/s]Capturing batches (bs=104 avail_mem=74.63 GB):   5%|▌         | 1/20 [00:00<00:03,  5.56it/s]Capturing batches (bs=104 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:01, 15.66it/s]Capturing batches (bs=96 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:01, 15.66it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:01, 15.66it/s]Capturing batches (bs=80 avail_mem=74.61 GB):  20%|██        | 4/20 [00:00<00:01, 15.66it/s]Capturing batches (bs=80 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.09it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.09it/s]

Capturing batches (bs=64 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.09it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.09it/s]

Capturing batches (bs=56 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 12.07it/s]Capturing batches (bs=48 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 12.07it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 12.07it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  60%|██████    | 12/20 [00:00<00:00, 12.96it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  60%|██████    | 12/20 [00:00<00:00, 12.96it/s]Capturing batches (bs=24 avail_mem=74.58 GB):  60%|██████    | 12/20 [00:00<00:00, 12.96it/s]

Capturing batches (bs=24 avail_mem=74.58 GB):  70%|███████   | 14/20 [00:01<00:00, 14.29it/s]Capturing batches (bs=16 avail_mem=74.58 GB):  70%|███████   | 14/20 [00:01<00:00, 14.29it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  70%|███████   | 14/20 [00:01<00:00, 14.29it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:01<00:00, 15.20it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:01<00:00, 15.20it/s] Capturing batches (bs=4 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:01<00:00, 15.20it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:01<00:00, 15.20it/s]

Capturing batches (bs=1 avail_mem=74.55 GB):  80%|████████  | 16/20 [00:01<00:00, 15.20it/s]Capturing batches (bs=1 avail_mem=74.55 GB): 100%|██████████| 20/20 [00:01<00:00, 19.92it/s]Capturing batches (bs=1 avail_mem=74.55 GB): 100%|██████████| 20/20 [00:01<00:00, 15.97it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Susan and I live in the city. I like to travel. I have a lot of friends who go to the city. They like to travel, too. They are all very nice people. We often go on trips together. I go to the city by car. My friend goes to the city by train. She likes to take a train trip. One day, she said, "I am going to take a train to the mountains in the mountains. I am going to get a lot of exercise on the train. I am going to eat lots of fruits and vegetables to stay healthy. " I am going to the mountains in the
Prompt: The president of the United States is
Generated text:  seeking to establish a policy that may impact the United States’ relationship with Iran, as they have been notorious for their nuclear program. The president has been in an emotional state and has been very distraught about the situation, leading to a sense of urgency. Is it possible for the president to implement a policy that would lead to significant consequences, such as causing 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French National Library, and the French Academy of Sciences. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is also known for its cuisine, including French cuisine, and is home to many museums and art galleries. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is also known for its annual festivals and events, including the Eiffel Tower Festival and the Louvre

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks with increasing accuracy and efficiency. Some potential future trends in AI include:

1. Increased integration with other technologies: As AI becomes more integrated with other technologies, such as sensors, IoT devices, and blockchain, it will become even more powerful and capable. This integration will allow AI systems to learn from a wider range of data and make more informed decisions.

2. Enhanced privacy and security: As AI systems become more sophisticated



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a/an [age] year old [gender] who is in the process of [occupation or skill], [insert the skill or profession], and has an interest in [describe an activity or hobby]. I am always [describe a trait or quality] and [insert an example of how you use this trait]. What kind of person are you? I am a/an [insert what type of person you are], and my personality type is [insert personality type]. I enjoy [describe an activity or hobby], [insert why you enjoy this activity or hobby]. I also have an interest in [describe an activity or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest and most populous city in the European Union, with a population of over 2.2 million people. Paris is known for its beautiful architecture, world-class museums, and rich culinary scene. It is home to m

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 [

Your

 Profession

]

!

 I

 specialize

 in [

Your Specialty

 or

 Expert

ise

].

 My

 goal

 is

 to

 help

 people

 like

 you

 understand

 the

 world

 around

 us

 and

 make

 it

 a

 better

 place

.

 I

 believe

 that

 through

 my

 unique

 approach

 to

 teaching

 and

 communication

,

 I

 can

 make

 a

 positive

 impact

 on

 the

 lives

 of

 others

.

 I

 am

 passionate

 about

 using

 my

 knowledge

 and

 skills

 to

 inspire

 and

 motivate

 people

 to

 think

 critically

 and

 creatively

.

 I

 am

 excited

 to

 embark

 on

 this

 journey

 with

 you

 and

 bring

 my

 expertise

 and

 experiences

 to

 your

 personal

 growth

.

 Let

's

 connect

 and

 learn

 more

 about

 each

 other

!

 [

Your

 Name

]

 [

Your

 Profession

]

 [

Your

 Specialty



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Task

:

 Translate

 the

 French

 sentence

 "

la

 ville

 de

 Paris

 est

 la

 capit

ale

 de

 la

 France

"

 into

 English

.


Translation

:

 The

 capital

 of

 France

 is

 Paris

.

 



Explanation

:

 This

 is

 a

 simple

 translation

 of

 the

 given

 French

 sentence

 to

 English

.

 It

 maintains

 the

 same

 meaning

 and

 structure

 as

 the

 original

.

 The

 key

 components

 are

 the

 verb

 "

est

"

 (

is

)

 and

 the

 article

 "

la

"

 (

the

).

 The

 plural

 subject

 "

v

illes

"

 (

v

ill

ages

)

 is

 translated

 to

 "

v

illes

"

 in

 English

,

 which

 agrees

 in

 gender

 with

 "

v

ill

ages

"

 in

 the

 French

 sentence

.

 The

 French

 word

 "

capital

"

 is

 translated

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 complex

,

 and

 there

 is

 no

 one

 clear

 trend

.

 However

,

 there

 are

 a

 few

 possible

 trends

 that

 are

 likely

 to

 shape

 AI

 in

 the

 coming

 years

:



1

.

 Increased

 focus

 on

 ethical

 and

 social

 implications

:

 As

 AI

 becomes

 more

 prevalent

 in

 everyday

 life

,

 there

 will

 be

 an

 increased

 focus

 on

 its

 impact

 on

 society

 and

 the

 people

 who

 use

 and

 rely

 on

 it

.

 This

 could

 lead

 to

 more

 stringent

 regulations

 and

 standards

,

 as

 well

 as

 greater

 awareness

 and

 responsibility

 around

 the

 development

 and

 use

 of

 AI

.



2

.

 Development

 of

 more

 advanced

 and

 flexible

 AI

:

 As

 AI

 continues

 to

 advance

,

 there

 may

 be

 an

 increased

 focus

 on

 developing

 more

 advanced

 and

 flexible

 AI

 that

 can




In [6]:
llm.shutdown()