# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-19 13:47:45] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-19 13:47:45] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-19 13:47:45] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.21it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=120 avail_mem=74.64 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.18it/s]

Capturing batches (bs=112 avail_mem=74.64 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=104 avail_mem=74.63 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=104 avail_mem=74.63 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.06it/s]Capturing batches (bs=96 avail_mem=74.63 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.06it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.06it/s]Capturing batches (bs=80 avail_mem=74.62 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.06it/s]Capturing batches (bs=80 avail_mem=74.62 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 19.90it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 19.90it/s]

Capturing batches (bs=64 avail_mem=74.61 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 19.90it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 19.90it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 21.32it/s]Capturing batches (bs=48 avail_mem=74.60 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 21.32it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 21.32it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 21.32it/s]

Capturing batches (bs=32 avail_mem=74.59 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 22.94it/s]Capturing batches (bs=24 avail_mem=74.58 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 22.94it/s]Capturing batches (bs=16 avail_mem=74.58 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 22.94it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 22.94it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 21.97it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 21.97it/s] Capturing batches (bs=4 avail_mem=74.57 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 21.97it/s]

Capturing batches (bs=2 avail_mem=74.56 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 21.97it/s]Capturing batches (bs=1 avail_mem=74.56 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 21.97it/s]Capturing batches (bs=1 avail_mem=74.56 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 20/20 [00:00<00:00, 25.10it/s]Capturing batches (bs=1 avail_mem=74.56 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 20/20 [00:00<00:00, 21.56it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Marco, I'm an ex-ASV student and I'm now an AI assistant. 

I am here to help you with any information or task that you would like me to assist you with. Please feel free to ask me anything and I will do my best to assist you.

What is the weather forecast for today? 

And I would like to know the weather in your city. Can you please share it with me?

Certainly! Could you please tell me the name of your city? That way I can provide you with the weather forecast for it. 

I hope you have a pleasant day! ðŸ˜Š

I'm sorry,
Prompt: The president of the United States is
Generated text:  married to a social worker and they have a daughter. She's 53 and she has children of her own. What's this saying something about?

This statement is saying that the president and social worker are having a child together and the woman is married to her social worker's husband. The fact that the woman is 53 and the social worker is her spouse, along with the fact th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a cultural and economic center of France and a major tourist destination. It is home to many famous museums, theaters, and restaurants. The city is also known for its cuisine, with its famous dishes such as croissants, escarg

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and robotics: AI is likely to become more prevalent in manufacturing, transportation, and other industries, leading to increased automation and robotics. This could lead to job losses in some sectors, but also create new opportunities for workers in areas such as data analysis and software development.

2. AI-powered healthcare: AI is already being used in healthcare to diagnose and treat diseases, and it has the potential to revolutionize the field. AI-powered healthcare could lead to more accurate diagnoses



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm currently a [Your Profession] in [Your Profession's Country]. I'm a passionate advocate for [Your Profession's Cause or Mission], and I'm eager to contribute to the world in any way I can. Please tell me something interesting about yourself.

You've probably already identified yourself as the most suitable answer to this question, but I will ask you anyway: What is your favorite movie, book, or musician? And why? I am here to learn about you, so please tell me about yourself. Let me know if you would like me to generate a list of interesting facts about you, or if

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris, and it is the largest city in both France and Europe. Its population is around 2.7 million, and it is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Ca

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Occup

ation

]

 who

 loves

 [

Your

 Character

's

 Hobby

/

Inter

ests

/

Op

port

unities

].

 



Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

!

 Let

 me

 know

 if

 you

'd

 like

 any

 information

 on

 my

 character

 or

 what

 I

 do

.

 I

'm

 looking

 forward

 to

 hearing

 from

 you

!

 (

I

'll

 write

 out

 any

 questions

 you

 might

 have

)

 



It

's

 nice

 to

 meet

 someone

 like

 you

.

 Thanks

 again

 for

 taking

 the

 time

 to

 talk

 to

 me

!

 Let

's

 keep

 in

 touch

!

 (

I

'll

 write

 out

 my

 own

 short

 reply

)

 



[

Name

]

 

ðŸ˜Š





---



**

Note

:**

 I

've

 used

 [

Occup

ation

]



Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

.

 It

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 cultural

 history

 and

 a

 renowned

 art

 and

 music

 scene

.

 Paris

 is

 home

 to

 many

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Se

ine

 River

.

 It

 is

 also

 a

 popular

 tourist

 destination

,

 with

 millions

 of

 visitors

 annually

.

 Paris

 is

 known

 for

 its

 romantic

 atmosphere

 and

 its

 elegant

,

 ancient

 architecture

,

 which

 is

 a

 reflection

 of

 the

 country

's

 rich

 history

 and

 culture

.

 Its

 economy

 is

 strong

,

 with

 significant

 investments

 in

 the

 creative

 industries

 and

 the

 services

 sector

.

 Paris

 is

 the

 second

-largest

 city

 in

 the

 European

 Union



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 diverse

,

 and

 there

 are

 many

 potential

 trends

 that

 could

 shape

 the

 technology

 and

 its

 applications

.

 Some

 potential

 trends

 that

 are

 emerging

 are

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 to

 improve

 the

 accuracy

 and

 efficiency

 of

 medical

 diagnoses

 and

 treatments

.

 As

 the

 use

 of

 AI

 continues

 to

 grow

,

 it

 may

 become

 increasingly

 important

 in

 healthcare

 to

 harness

 its

 power

 to

 improve

 patient

 care

.



2

.

 AI

 in

 the

 manufacturing

 industry

:

 The

 manufacturing

 industry

 is

 often

 criticized

 for

 its

 high

 levels

 of

 environmental

 and

 labor

 costs

.

 However

,

 there

 is

 growing

 interest

 in

 using

 AI

 to

 optimize

 production

 processes

 and

 reduce

 waste

.

 AI

 can

 help

 manufacturers

 identify

 and

 reduce

 ineff

iciencies




In [6]:
llm.shutdown()