# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-02 14:40:56] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-02 14:40:56] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-02 14:40:56] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-02 14:40:58] INFO server_args.py:1599: Attention backend not specified. Use fa3 backend by default.


[2026-01-02 14:40:58] INFO server_args.py:2474: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.99it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.41 GB):   5%|â–Œ         | 1/20 [00:00<00:04,  4.75it/s]Capturing batches (bs=120 avail_mem=76.31 GB):   5%|â–Œ         | 1/20 [00:00<00:04,  4.75it/s]Capturing batches (bs=112 avail_mem=76.30 GB):   5%|â–Œ         | 1/20 [00:00<00:04,  4.75it/s]Capturing batches (bs=104 avail_mem=76.30 GB):   5%|â–Œ         | 1/20 [00:00<00:04,  4.75it/s]Capturing batches (bs=104 avail_mem=76.30 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 13.99it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 13.99it/s] Capturing batches (bs=88 avail_mem=76.28 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 13.99it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 13.99it/s]

Capturing batches (bs=80 avail_mem=76.28 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 18.36it/s]Capturing batches (bs=72 avail_mem=76.27 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 18.36it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 18.36it/s]Capturing batches (bs=56 avail_mem=76.26 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 18.36it/s]Capturing batches (bs=56 avail_mem=76.26 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 18.19it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 18.19it/s]

Capturing batches (bs=40 avail_mem=76.25 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 18.19it/s]Capturing batches (bs=40 avail_mem=76.25 GB):  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 12/20 [00:00<00:00, 14.11it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 12/20 [00:00<00:00, 14.11it/s]

Capturing batches (bs=24 avail_mem=76.25 GB):  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 12/20 [00:00<00:00, 14.11it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  70%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ   | 14/20 [00:01<00:00, 11.85it/s]Capturing batches (bs=16 avail_mem=76.24 GB):  70%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ   | 14/20 [00:01<00:00, 11.85it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  70%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ   | 14/20 [00:01<00:00, 11.85it/s]

Capturing batches (bs=12 avail_mem=76.24 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:01<00:00, 11.80it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:01<00:00, 11.80it/s] Capturing batches (bs=4 avail_mem=76.23 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:01<00:00, 11.80it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:01<00:00, 11.80it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  95%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ| 19/20 [00:01<00:00, 15.33it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  95%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ| 19/20 [00:01<00:00, 15.33it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 20/20 [00:01<00:00, 14.59it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Amanda and I live in Houston, Texas. My daughter is a high school senior and she has just completed her AP Physics B course. She has been feeling very anxious and restless for the past two weeks and I thought I would give you a short summary of what I learned about her anxiety.
I was fortunate to have Amanda as my daughterâ€™s middle school and high school teacher. I have had the pleasure of hearing her speak in her final year of high school and she is a fantastic speaker.
The experience with Amanda was helpful for me. She has a wonderful voice and can convey a message very clearly. She is very patient and is able to
Prompt: The president of the United States is
Generated text:  trying to decide how many military weapons to have. He has a total of $40,000 to spend. He has to buy weapons from the company, military advisors, and special advisors. Each military advisor costs $30,000, and each special advisor costs $50,000. He also has a limit of 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, positive description of your personality or skills]. And what's your favorite hobby or activity? I love [insert a short, positive description of your favorite hobby or activity]. And what's your favorite book or movie? I love [insert a short, positive description of your favorite book or movie]. And what's your favorite place to go? I love [insert a short, positive description of your favorite place to go]. And

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" or "La Grande-Bretagne". It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, art, and culture, as well as its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major transportation hub, with many major highways and rail lines connecting the city to other parts of France and the world. Paris is a popular tourist destination, with millions of visitors each year. The city is also home to many important institutions and organizations, including the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and preferences.

2. Greater reliance on data: AI is likely to become more data-driven, with machines being able to learn from large amounts of data to improve their performance. This could lead to more efficient and effective



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Character Name], and I'm an [occupation] with [years of experience]. I'm [Number] of years in the industry. I'm passionate about [interest/field of interest] and I'm always ready to learn new things and grow as a professional. I'm [Number] of years in the industry. I'm committed to [commitment to industry], [commitment to professional development], [commitment to customer service]. I'm [Number] of years in the industry. I'm an [Number] of years in the industry. I'm [Number] of years in the industry. I'm an [Number

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris. 

A. True
B. False

A. True

France's capital city, Paris, is a bustling metropolis with a rich history and vibrant culture. It is known for its iconic landmarks such as the Eiffel Tower and Notre-Dame Cathedral, as well as its cultural insti

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

].

 I

'm

 a

 [

insert

 profession

 or

 skill

 level

]

 who

 is

 always

 looking

 for

 the

 next

 great

 adventure

.

 I

 enjoy

 solving

 puzzles

,

 playing

 board

 games

,

 and

 learning

 new

 languages

.

 I

 also

 love

 cooking

 and

 trying

 new

 cuis

ines

.

 And

 I

'm

 a

 big

 fan

 of

 [

insert

 a

 favorite

 book

,

 movie

,

 or

 game

].

 I

'm

 excited

 to

 embark

 on

 new

 challenges

 and

 make

 new

 friends

 on

 the

 journey

.

 [

Insert

 character

's

 name

]

!

 

ðŸŒŸ

ðŸŒŸ

ðŸŒŸ

ðŸŒŸ

.

 Looking

 forward

 to

 the

 adventure

!

 

ðŸ’–

ðŸ’°

ðŸ’ª

ðŸ’ª

.

 

ðŸ’–

ðŸ‘€

ðŸ‘€

ðŸ‘€

.

 

ðŸŒŸ

ðŸŒŸ

ðŸŒŸ

ðŸŒŸ

.

 

ðŸ”¥

ðŸ’¡

ðŸ’¥

.



Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 capital

 city

 of

 France

,

 located

 on

 the

 Lo

ire

 River

 in

 the

 north

western

 part

 of

 the

 country

.

 It

 is

 the

 second

-largest

 city

 in

 France

 by

 population

,

 after

 Paris

,

 and

 the largest

 city

 by

 area

.

 The

 city

 is

 known

 for

 its

 rich

 history

,

 art

,

 culture

,

 and

 fashion

.

 It

 is

 home

 to

 many

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 the

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 Paris

 is

 a

 major

 international

 city

 that

 is

 home

 to

 many

 important

 businesses

,

 institutions

,

 and

 attractions

.

 Its

 history

 can

 be

 traced

 back

 to

 ancient

 Roman

 times

,

 and

 it



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 key

 trends

 that

 are

 shaping

 the

 technology

's

 development

 and

 implications

 for

 society

:



1

.

 Increased

 Real

-

World

 Applications

:

 The

 development

 of

 AI

 will

 continue

 to

 advance

 beyond

 its

 current

 applications

 in

 fields

 like

 healthcare

,

 finance

,

 and

 manufacturing

.

 We

 may

 see

 more

 advanced

 AI

 systems

 that

 can

 interact

 with

 human

 users

 more

 naturally

,

 improving

 their

 understanding

 and

 interaction

 with

 the

 world

 around

 them

.



2

.

 Integration

 with

 Human

 Wisdom

:

 AI

 will

 continue

 to

 be

 integrated

 with

 human

 wisdom

 in

 areas

 like

 language

 processing

,

 natural

 language

 generation

,

 and

 decision

-making

.

 AI

 will

 be

 able

 to

 learn

 from

 feedback

,

 improve

 its

 performance

,

 and

 adapt

 to

 new

 situations

,

 making

 it




In [6]:
llm.shutdown()