# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-21 22:36:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-21 22:36:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-21 22:36:11] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-21 22:36:13] INFO server_args.py:1835: Attention backend not specified. Use fa3 backend by default.


[2026-02-21 22:36:13] INFO server_args.py:2888: Set soft_watchdog_timeout since in CI








[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.20it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.93 GB):   5%|▌         | 1/20 [00:00<00:03,  5.35it/s]Capturing batches (bs=120 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:03,  5.35it/s]

Capturing batches (bs=112 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:03,  5.35it/s]Capturing batches (bs=112 avail_mem=76.83 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.21it/s]Capturing batches (bs=104 avail_mem=76.82 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.21it/s]Capturing batches (bs=96 avail_mem=76.82 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.21it/s] Capturing batches (bs=88 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.21it/s]

Capturing batches (bs=88 avail_mem=76.81 GB):  30%|███       | 6/20 [00:00<00:00, 15.40it/s]Capturing batches (bs=80 avail_mem=76.81 GB):  30%|███       | 6/20 [00:00<00:00, 15.40it/s]Capturing batches (bs=72 avail_mem=76.80 GB):  30%|███       | 6/20 [00:00<00:00, 15.40it/s]Capturing batches (bs=64 avail_mem=76.80 GB):  30%|███       | 6/20 [00:00<00:00, 15.40it/s]Capturing batches (bs=64 avail_mem=76.80 GB):  45%|████▌     | 9/20 [00:00<00:00, 19.86it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  45%|████▌     | 9/20 [00:00<00:00, 19.86it/s]Capturing batches (bs=48 avail_mem=76.79 GB):  45%|████▌     | 9/20 [00:00<00:00, 19.86it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 19.86it/s]

Capturing batches (bs=40 avail_mem=76.78 GB):  60%|██████    | 12/20 [00:00<00:00, 22.79it/s]Capturing batches (bs=32 avail_mem=76.78 GB):  60%|██████    | 12/20 [00:00<00:00, 22.79it/s]Capturing batches (bs=24 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:00<00:00, 22.79it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:00<00:00, 22.79it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  75%|███████▌  | 15/20 [00:00<00:00, 22.03it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  75%|███████▌  | 15/20 [00:00<00:00, 22.03it/s]Capturing batches (bs=8 avail_mem=76.76 GB):  75%|███████▌  | 15/20 [00:00<00:00, 22.03it/s] 

Capturing batches (bs=4 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:00<00:00, 22.03it/s]Capturing batches (bs=2 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:00<00:00, 22.03it/s]Capturing batches (bs=2 avail_mem=76.75 GB):  95%|█████████▌| 19/20 [00:00<00:00, 25.94it/s]Capturing batches (bs=1 avail_mem=76.74 GB):  95%|█████████▌| 19/20 [00:00<00:00, 25.94it/s]Capturing batches (bs=1 avail_mem=76.74 GB): 100%|██████████| 20/20 [00:00<00:00, 21.08it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Terence, and I'm 26 years old. I'm a computer programmer. I'm very good at it. I enjoy solving problems. I enjoy being an independent person and learning by myself. I like to make code and program. 

I graduated from university in 2012 and got my first job in 2013. I'm now an independent programmer. 

What are my strengths and weaknesses in terms of programming? What are my career aspirations? What do you think is the most important thing to be a programmer? Please give me some examples. 
I am curious about your thoughts on the following question
Prompt: The president of the United States is
Generated text:  a member of the executive branch of the government.
A. 错误
B. 正确
答案:

B

The president of the United States has no power to appoint or remove members of the executive branch.
A. 错误
B. 正确
答案:

A

风力发电机组应在具备____及以上环境条件下运行，风力发电机组的发电效率相对较高。
A. 6级
B. 5级
C. 4级
D. 3级
答案:

A

____是机场的主体，是机场的经济来源。
A. 旅客

Prompt: The capital of France is
Generated te

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the annual Eiffel Tower Festival. It is the largest city in France and the second-largest city in the European Union. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. The city is known for its vibrant nightlife, art, and cuisine, and is a popular tourist destination. It is also home to many world-renowned museums, including the Louvre and the Musée d'Orsay. Paris is a major transportation hub, with the Eiffel Tower serving as a landmark and the metro system serving as

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some potential trends that are likely to shape the future of AI:

1. Increased automation and artificial intelligence: As AI becomes more advanced, it is likely to become more integrated into our daily lives, from the way we work to the way we communicate. This could lead to increased automation and artificial intelligence, which could potentially replace human workers in certain industries.

2. Improved privacy and security: As AI becomes more advanced, there is a risk that it could be used for malicious purposes, such as hacking or



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Alex. I'm a computer programmer with a degree in computer science from University College Dublin. I work as a software engineer at a leading tech company, and I enjoy solving complex problems and working with big data. I'm always looking for new opportunities to learn and grow, and I'm eager to stay up to date with the latest technologies and trends in the field. I have a talent for problem-solving and a passion for innovation, and I'm always willing to go the extra mile to help others achieve their goals. So, if you're looking for a reliable and talented software engineer to work with, I'm your guy! #self-int

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its rich history, unique architecture, and vibrant cultural scene. It is also known as the "City of Light" due to its iconic Eiffel Tower

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 a

 [

Age

]

 year

 old

,

 [

Occup

ation

]

 [

Your

 occupation

]

 [

Your

 profession

].

 I

 have

 [

number

]

 years

 of

 experience

 in

 [

Your

 field

]

 and

 am

 always

 looking

 to

 learn

 new

 things

.

 I

 have

 a

 passion

 for

 [

your

 passion

]

 and

 always

 strive

 to

 make

 a

 positive

 impact

.

 I

 have

 a

 clean

,

 organized

 life

,

 enjoy

 music

,

 and

 have

 a

 loyal

 following

.

 I

'm

 always

 looking

 for

 ways

 to

 stay

 fresh

 in

 my

 routine

,

 and

 I

'm

 always

 eager

 to

 try

 something

 new

.

 I

'm

 a

 [

favorite

 hobby

]

 who

 loves

 to

 [

favorite

 hobby

]

 and

 spend

 a

 lot

 of

 time

 doing

 it



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



To

 elaborate

 on

 this

 statement

,

 Paris

 is

 the

 largest

 city

 in

 France

 and

 serves

 as

 the

 capital

 city

 for

 the

 country

.

 It

 is

 located

 in

 the

 south

-west

ern

 part

 of

 the

 country

 and

 is

 known

 for

 its

 rich

 history

,

 fashion

,

 and

 culture

.

 The

 city

 is

 also

 home

 to

 numerous

 museums

,

 art

 galleries

,

 and

 historic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Notre

-D

ame

 Cathedral

.

 In

 addition

 to

 its

 historical

 and

 cultural

 significance

,

 Paris

 is

 also

 known

 for

 its

 vibrant

 and

 diverse

 nightlife

,

 making

 it

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

.

 Overall

,

 Paris

 is

 a

 city

 that

 is

 both

 iconic

 and

 unparalleled

,

 and

 continues

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 marked

 by

 many

 different

 trends

,

 depending on

 the specific

 developments and

 research in

 the field

.

 Here

 are

 a

 few

 possibilities

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 the

 impact

 of

 AI

 on

 society

 becomes

 more

 apparent

,

 there

 will

 be

 an

 increased

 focus

 on

 ethical

 AI

 practices

.

 This

 will

 involve

 creating

 AI

 systems

 that

 are

 transparent

,

 accountable

,

 and

 responsible

 for

 their

 actions

.



2

.

 Autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 likely

 to

 become

 increasingly

 common

 in

 the

 coming

 years

,

 especially

 in

 cities

 and

 in

 rural

 areas

.

 These

 vehicles

 will

 be

 able

 to

 navigate

 and

 drive

 themselves

,

 which

 will

 require

 AI

-based

 systems

 that

 are

 highly

 sophisticated

.



3

.

 Increased

 use

 of

 AI




In [6]:
llm.shutdown()