# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-08 15:01:46] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-08 15:01:46] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-08 15:01:46] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.46it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:03,  5.35it/s]Capturing batches (bs=120 avail_mem=76.31 GB):   5%|▌         | 1/20 [00:00<00:03,  5.35it/s]

Capturing batches (bs=112 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.35it/s]Capturing batches (bs=104 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.35it/s]Capturing batches (bs=104 avail_mem=76.30 GB):  20%|██        | 4/20 [00:00<00:01, 15.19it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:01, 15.19it/s] Capturing batches (bs=88 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:01, 15.19it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:01, 15.19it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.88it/s]

Capturing batches (bs=64 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  50%|█████     | 10/20 [00:00<00:00, 21.77it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:00<00:00, 21.77it/s]Capturing batches (bs=40 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:00<00:00, 21.77it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  50%|█████     | 10/20 [00:00<00:00, 21.77it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.09it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.09it/s]

Capturing batches (bs=16 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.09it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.09it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  80%|████████  | 16/20 [00:00<00:00, 22.06it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  80%|████████  | 16/20 [00:00<00:00, 22.06it/s] Capturing batches (bs=4 avail_mem=76.23 GB):  80%|████████  | 16/20 [00:00<00:00, 22.06it/s]Capturing batches (bs=2 avail_mem=76.14 GB):  80%|████████  | 16/20 [00:00<00:00, 22.06it/s]

Capturing batches (bs=2 avail_mem=76.14 GB):  95%|█████████▌| 19/20 [00:00<00:00, 24.03it/s]Capturing batches (bs=1 avail_mem=75.44 GB):  95%|█████████▌| 19/20 [00:00<00:00, 24.03it/s]Capturing batches (bs=1 avail_mem=75.44 GB): 100%|██████████| 20/20 [00:00<00:00, 21.56it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Liselle and I'm a teacher at the Seattle Waldorf School in Bellevue, Washington. I love teaching young children and helping them develop a strong emotional foundation and nurturing a love of learning. I've worked in a variety of educational environments including a private preschool, early childhood education center, and a gifted education school, and I have experience working with children who have special needs. I use a strengths-based approach to teaching and work collaboratively with parents and educators. I have an M.S. in Early Childhood Education from the University of Washington and a B.S. in Early Childhood Education from St. Mary's College of Maryland. My teaching
Prompt: The president of the United States is
Generated text:  a sitting person, and the current president is Barack Obama. Which of the following statements is correct? ( )
A: Barack Obama is the president of the United States
B: Obama is the president of the United States

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] who has been [number of years] in the industry. I'm passionate about [reason for passion], and I'm always looking for ways to [action or achievement]. I'm a [type of person] who is [character trait or quality] and I'm always [character trait or quality]. I'm [character trait or quality] and I'm always [character trait or quality]. I'm [character trait or quality] and I'm always [character trait or quality]. I'm [character trait or quality] and I'm always [character trait or quality]. I'm [character

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant nightlife. It is also a major center for French culture, politics, and arts. Paris is a city of contrasts, with its rich history and modernity. Its status as the world's most populous city is due to its large population and diverse cultural influences. The city is also known for its annual Eiffel Tower Festival, which attracts millions of visitors each year. Paris is a city of contrasts, with its rich history and modernity. Its status as the world's most populous city is due to its large population and diverse cultural influences. Its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased accuracy and precision: AI is likely to continue to improve its ability to process and analyze large amounts of data, leading to more accurate predictions and more precise solutions to complex problems.

2. Integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making.

3. Personalization and customization: AI is likely to become more personalized and customizable, allowing for more efficient and effective use of resources.

4. Ethical and responsible development: As AI becomes more prevalent in various industries, there will be a growing



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First name] and I'm [Last name]. I'm a [occupation], [background] with over [number of years] of experience in [specific field]. I enjoy [professionally enjoyable activity or hobby]. I'm an [occupation] who is always [personality trait]. I believe in [core belief or value]. I'm passionate about [something], and I'm committed to [something else]. I'm excited to bring [something to the table] to [specific context]. I'm confident in [strength or area of expertise]. I'm always eager to learn and grow, and I'm always willing to share what I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the third largest city in the European Union. 

This statement is factual and concise, providing the essential details about Paris' location and significance in French history, culture, and politics. It avoids making any a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

Career

]

 in

 [

Field

]

 who

 has

 been

 a

 [

Role

]

 for

 [

Number

]

 years

.

 I

 have

 always

 been

 passionate

 about

 [

What

 motiv

ates

 me

]

 and

 have

 been

 dedicated

 to

 [

What

 I

 have

 achieved

]

 in

 this

 field

.

 I

 am

 a

 [

What

 is

 my

 job

 title

]

 who

 always

 strive

 to

 [

What

 I

 try

 to

 achieve

 in

 my

 work

].

 I

 am

 passionate

 about

 [

What

 I

 enjoy

 about

 my

 work

]

 and

 always

 strive

 to

 [

What

 I

 try

 to

 improve

 on

].

 I

 am

 always

 ready

 to

 learn

 and

 adapt

 to

 new

 challenges

.

 Thank

 you

!

 [

Name

].

 [

Name

]

 [

Name

].

 [

Name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 rich

 history

 and

 stunning

 architecture

.



Paris

 is

 the

 capital

 city

 of

 France

,

 known

 for

 its

 rich

 history

 and

 stunning

 architecture

.

 The

 city

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

,

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Mont

mart

re

,

 and

 is

 a

 major

 transportation

 hub

 with

 the

 headquarters

 of

 major

 companies

 and

 the

 French

 Parliament

.

 Other

 notable

 landmarks

 include

 the

 Lou

vre

 Museum

,

 Mus

ée

 d

'

Or

say

,

 and

 the

 Par

c

 des

 Je

unes

.

 The

 city

 is

 also

 home

 to

 the

 E

iff

el

 Tower

,

 which

 is

 a

 UNESCO

 World

 Heritage

 site

.

 Paris

 is

 a

 vibrant

 and

 culturally

 rich

 city

 with

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 promising

 and

 potentially

 transformative

,

 with

 potential

 applications

 in

 virtually

 every

 sector

 and

 enabling

 technological

 advancements

 across

 fields

 like

 healthcare

,

 transportation

,

 finance

,

 and

 more

.

 Here

 are

 some

 possible

 future

 trends

 that

 may

 influence

 AI

 further

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 With

 the

 increasing

 awareness

 of

 the

 impact

 of

 AI

 on

 society

,

 there

 is

 a

 growing

 emphasis

 on

 ethical

 and

 responsible

 development

 of

 AI

 systems

.

 Governments

 and

 organizations

 are

 investing

 more

 resources

 to

 develop

 AI

 that

 is

 transparent

,

 accountable

,

 and

 respectful

 of

 human

 rights

 and

 privacy

.



2

.

 AI

's

 role

 in

 personalized

 medicine

:

 AI

 is

 increasingly

 being

 used

 to

 analyze

 large

 amounts

 of

 medical

 data

 to

 identify

 patterns

 and

 predict

 patient

 outcomes

.

 This




In [6]:
llm.shutdown()