# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-22 19:19:27] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-22 19:19:27] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-22 19:19:27] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-22 19:19:29] INFO server_args.py:2408: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.20it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.90 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.90 GB):   5%|▌         | 1/20 [00:00<00:04,  4.30it/s]Capturing batches (bs=120 avail_mem=76.79 GB):   5%|▌         | 1/20 [00:00<00:04,  4.30it/s]Capturing batches (bs=112 avail_mem=76.78 GB):   5%|▌         | 1/20 [00:00<00:04,  4.30it/s]

Capturing batches (bs=112 avail_mem=76.78 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.48it/s]Capturing batches (bs=104 avail_mem=76.30 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.48it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.48it/s] Capturing batches (bs=96 avail_mem=76.29 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.17it/s]Capturing batches (bs=88 avail_mem=76.28 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.17it/s]

Capturing batches (bs=80 avail_mem=76.28 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.17it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.17it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  40%|████      | 8/20 [00:00<00:01, 11.81it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  40%|████      | 8/20 [00:00<00:01, 11.81it/s]

Capturing batches (bs=56 avail_mem=76.27 GB):  40%|████      | 8/20 [00:01<00:01, 11.81it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  50%|█████     | 10/20 [00:01<00:01,  9.75it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:01<00:01,  9.75it/s]Capturing batches (bs=40 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:01<00:01,  9.75it/s]Capturing batches (bs=40 avail_mem=76.26 GB):  60%|██████    | 12/20 [00:01<00:00, 11.03it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  60%|██████    | 12/20 [00:01<00:00, 11.03it/s]

Capturing batches (bs=24 avail_mem=76.25 GB):  60%|██████    | 12/20 [00:01<00:00, 11.03it/s]Capturing batches (bs=16 avail_mem=76.24 GB):  60%|██████    | 12/20 [00:01<00:00, 11.03it/s]Capturing batches (bs=16 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.19it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.19it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.19it/s] Capturing batches (bs=4 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.19it/s]

Capturing batches (bs=2 avail_mem=76.22 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.19it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  95%|█████████▌| 19/20 [00:01<00:00, 17.66it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  95%|█████████▌| 19/20 [00:01<00:00, 17.66it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:01<00:00, 12.86it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Hannah, I'm 14 years old. I have a difficult time focusing on my studies and I feel like I can't seem to concentrate on anything. I'm not good at music, I don't like the color red. I feel like I'm missing something in school and I'm not sure what. It's like I'm missing something in the curriculum that I don't understand. I don't really enjoy speaking and I'm not good at vocal practice. I'm pretty good at sports, I like to ride my bike, and I enjoy hiking. I'm a good reader, I can do math, I'm good with
Prompt: The president of the United States is
Generated text:  a political leader, and the president of the Philippines is also a political leader. Therefore, the president of the Philippines is the president of the United States. Is this argument valid? To determine if the argument is valid, we need to analyze whether it follows logically from the premises provided. The argument is as follows:

1. President of the United States (U. S. president

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What brings you to [company name] and what makes you a good fit for the position? I'm a [insert a short description of your personality or character traits]. I'm always looking for new challenges and opportunities to grow and learn. What do you think makes you a good fit for the position at [company name]?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the birthplace of French literature and cuisine, and is a major cultural and economic center. Paris is home to many world-renowned museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin. It is also a popular tourist destination, with millions of visitors annually. Paris is known for its diverse culture, including its rich history, art, and cuisine, and is a major hub for international business and diplomacy. The city is also home to

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, there is a growing emphasis on ethical considerations and the responsible use of AI, as concerns about bias, privacy, and security continue to grow. As AI becomes more integrated into our daily lives, it is likely to have a significant impact on society and the economy, and will require ongoing development and improvement to ensure that it is used in a responsible and ethical manner.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Age]. I am a [Occupation/Interest] who has always been fascinated by [Reason for Interest], and I am always eager to learn more about [Subject/Topic]. I enjoy meeting new people and exploring different cultures. I am always looking for opportunities to grow and learn, and I am always open to challenges and new experiences. What's your name? What's your age? What's your occupation or interest? What's your reason for being interested in the subject or topic? What are your hobbies and activities outside of work? How do you like to spend your free time? What are your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city and the seat of government of the country.

That's correct. Paris is the capital of France, serving as the nation's political, cultural, and economic cente

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 fictional

 character

's

 name

].

 I

'm

 excited

 to

 share

 this

 humble

 introduction

 with

 you

,

 and

 I

 hope

 that

 you

'll

 take

 the

 time

 to

 meet

 me

 and

 learn

 more

 about

 my

 life

 and

 experiences

.


I

 look

 forward

 to

 meeting

 you

 and

 helping

 you

 learn

 more

 about

 myself

.

 Do

 you

 have

 any

 questions

 or

 would

 you

 like

 to

 learn

 more

 about

 my

 background

 and

 experiences

?

 Let

 me

 know

,

 and

 I

'll

 be

 here

 to

 answer

 your

 questions

 and

 provide

 you

 with

 the

 information

 you

 need

.

 I

'm

 here

 to

 help

,

 so

 don

't

 hesitate

 to

 reach

 out

.

 Have

 a

 great

 day

!

 [

Your

 Name

]

 [

Your

 Contact

 Information

]

 [

Your

 Email

]

 [

Your

 LinkedIn



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



However

,

 please

 paraph

rase

 the

 sentence

 "

The

 capital

 of

 France

 is

 Paris

"

 in

 the

 following

 sentence

 to

 make

 it

 more

 precise

:

 "

The

 central

 political

 and

 cultural

 center

 of

 France

,

 Paris

 is

 the

 capital

 of

 the

 country

."

 



Additionally

,

 please

 provide

 an

 example

 sentence

 that

 uses

 this

 sentence

 structure

.

 The

 city

 is

 famous

 for

 its

 historical

 landmarks

,

 vibrant

 nightlife

,

 and

 delicious

 cuisine

.

 



Lastly

,

 please

 provide

 a

 table

 that

 shows

 the

 population

 density

 of

 Paris

 and

 the

 nearest

 city

 to

 it

 in

 terms

 of

 population

.

 



To

 make

 the

 task

 more

 challenging

,

 please

 also

 provide

 an

 Excel

 table

 that

 shows

 the

 population

 of

 Paris

,

 France

 and

 the

 nearest

 city

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 unpredictable

.

 However

,

 there

 are

 several

 trends

 that

 experts

 predict

 will

 shape

 the

 future

 of

 this

 rapidly

 evolving

 field

:



1

.

 Increased

 automation

:

 AI

 is

 expected

 to

 continue

 autom

ating

 tasks

 in

 industries

 such

 as

 manufacturing

,

 transportation

,

 and

 healthcare

.

 This

 automation

 could

 result

 in

 job

 losses

 but

 also

 create

 new

 opportunities

 for

 human

 workers

 to

 perform

 tasks

 that

 are

 no

 longer

 possible

 through

 AI

.



2

.

 Personal

ized

 experiences

:

 AI

 will

 continue

 to

 enable

 the

 creation

 of

 more

 personalized

 experiences

 for

 individuals

.

 This

 includes

 the

 ability

 to

 tailor

 online

 advertisements

 to

 individual

 preferences

 and

 interests

,

 as

 well

 as

 the

 ability

 to

 provide

 personalized

 healthcare

 recommendations

 based

 on

 an

 individual

's

 health

 data

.



3

.

 Autonomous




In [6]:
llm.shutdown()