# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-30 02:22:29] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.71it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=28.02 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=28.02 GB):   5%|▌         | 1/20 [00:00<00:05,  3.32it/s]Capturing batches (bs=120 avail_mem=27.83 GB):   5%|▌         | 1/20 [00:00<00:05,  3.32it/s]

Capturing batches (bs=120 avail_mem=27.83 GB):  10%|█         | 2/20 [00:00<00:05,  3.56it/s]Capturing batches (bs=112 avail_mem=27.70 GB):  10%|█         | 2/20 [00:00<00:05,  3.56it/s]

Capturing batches (bs=112 avail_mem=27.70 GB):  15%|█▌        | 3/20 [00:00<00:05,  3.34it/s]Capturing batches (bs=104 avail_mem=27.53 GB):  15%|█▌        | 3/20 [00:00<00:05,  3.34it/s]Capturing batches (bs=104 avail_mem=27.53 GB):  20%|██        | 4/20 [00:00<00:03,  4.48it/s]Capturing batches (bs=96 avail_mem=27.38 GB):  20%|██        | 4/20 [00:00<00:03,  4.48it/s] 

Capturing batches (bs=96 avail_mem=27.38 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.88it/s]Capturing batches (bs=88 avail_mem=27.24 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.88it/s]Capturing batches (bs=88 avail_mem=27.24 GB):  30%|███       | 6/20 [00:01<00:02,  5.69it/s]Capturing batches (bs=80 avail_mem=27.13 GB):  30%|███       | 6/20 [00:01<00:02,  5.69it/s]

Capturing batches (bs=80 avail_mem=27.13 GB):  35%|███▌      | 7/20 [00:01<00:02,  5.25it/s]Capturing batches (bs=72 avail_mem=26.78 GB):  35%|███▌      | 7/20 [00:01<00:02,  5.25it/s]Capturing batches (bs=64 avail_mem=26.52 GB):  35%|███▌      | 7/20 [00:01<00:02,  5.25it/s]Capturing batches (bs=64 avail_mem=26.52 GB):  45%|████▌     | 9/20 [00:01<00:01,  7.34it/s]Capturing batches (bs=56 avail_mem=26.09 GB):  45%|████▌     | 9/20 [00:01<00:01,  7.34it/s]

Capturing batches (bs=56 avail_mem=26.09 GB):  50%|█████     | 10/20 [00:01<00:01,  7.73it/s]Capturing batches (bs=48 avail_mem=25.54 GB):  50%|█████     | 10/20 [00:01<00:01,  7.73it/s]Capturing batches (bs=40 avail_mem=25.41 GB):  50%|█████     | 10/20 [00:01<00:01,  7.73it/s]Capturing batches (bs=40 avail_mem=25.41 GB):  60%|██████    | 12/20 [00:01<00:00, 10.32it/s]Capturing batches (bs=32 avail_mem=25.31 GB):  60%|██████    | 12/20 [00:01<00:00, 10.32it/s]Capturing batches (bs=24 avail_mem=25.20 GB):  60%|██████    | 12/20 [00:01<00:00, 10.32it/s]Capturing batches (bs=16 avail_mem=25.08 GB):  60%|██████    | 12/20 [00:01<00:00, 10.32it/s]

Capturing batches (bs=16 avail_mem=25.08 GB):  75%|███████▌  | 15/20 [00:02<00:00, 12.09it/s]Capturing batches (bs=12 avail_mem=24.82 GB):  75%|███████▌  | 15/20 [00:02<00:00, 12.09it/s]Capturing batches (bs=8 avail_mem=24.31 GB):  75%|███████▌  | 15/20 [00:02<00:00, 12.09it/s] Capturing batches (bs=4 avail_mem=24.10 GB):  75%|███████▌  | 15/20 [00:02<00:00, 12.09it/s]Capturing batches (bs=4 avail_mem=24.10 GB):  90%|█████████ | 18/20 [00:02<00:00, 15.51it/s]Capturing batches (bs=2 avail_mem=24.03 GB):  90%|█████████ | 18/20 [00:02<00:00, 15.51it/s]Capturing batches (bs=1 avail_mem=23.93 GB):  90%|█████████ | 18/20 [00:02<00:00, 15.51it/s]Capturing batches (bs=1 avail_mem=23.93 GB): 100%|██████████| 20/20 [00:02<00:00,  8.86it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kim and I'm a college student. I was born in 1995 and have been married since 2014. I'm a full-time student at Western Michigan University. I have no children and am trying to have one. I use my parents' joint-tenancy property as the only asset for my children. My goal is to have a child. I haven't tried to get pregnant and have been trying to for the past 18 years. It's been like 10 years since we started trying. The last time I tried, my husband and I didn't know if we were pregnant, but
Prompt: The president of the United States is
Generated text:  a person who leads the country. The President of the United States is the head of government in the United States. The President is elected by the people to a term of four years. The office of the President of the United States is known as the President of the United States. This term is a very short term, only one year. The term of the President of the United States can change from year to year.

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill or Trait] who has always been [Positive Trait] in my [Field of Interest]. I'm passionate about [What I Love to Do], and I'm always looking for ways to [What I Want to Improve]. I'm [What I Want to Do Next], and I'm excited to [What I Want to Achieve]. I'm a [What I Want to Do Next], and I'm excited to [What I Want to Achieve]. I'm a [What I Want to Do Next], and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville-Marie" or "La Ville de Paris". It is the largest city in France and the third largest in the world, with a population of over 2. 8 million people. Paris is known for its rich history, art, and culture, and is a major tourist destination. It is also home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is a cultural and economic hub of France and plays a significant role in the country's political and social life. It is also a major hub for international trade and diplomacy

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This could lead to the creation of more efficient and productive machines that can perform tasks that were previously done by humans.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we are likely to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name], and I'm a [job title or occupation] who specializes in [specialization or area of expertise]. I enjoy [reason for interest in] and am always looking for new challenges and opportunities to learn and grow. 

Feel free to add any personal anecdotes or experiences that might add depth to your character and help readers better understand who you are as a person and a professional. However, keep in mind that this is a neutral self-introduction, so avoid using any personal information or any potentially sensitive topics. Your goal should be to create a clear and concise profile that showcases your professional skills and interests in a positive and welcoming manner

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

(Note: The statement is taken from the official Wikipedia page on Paris.)
The statement

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

 and

 I

'm

 a

/an

 ______

__.

 I

'm

 a

/an

 ______

__

.


My

 name

 is

 ______

__

 and

 I

'm

 a

/an

 ______

__.

 I

'm

 a

/an

 ______

__.

 What

 are

 your

 hobbies

 and

 interests

?

 What

 are

 your

 strengths

 and

 weaknesses

?


Thank

 you

 for

 asking

!

 I

'm

 excited

 to

 meet

 you

 and

 learn

 more

 about

 you

.


I

'm

 a

/an

 ______

__.

 I

'm

 a

/an

 ______

__.

 What

 are

 your

 hobbies

 and

 interests

?

 What

 are

 your

 strengths

 and

 weaknesses

?


Hello

!

 My

 name

 is

 ______

 and

 I

'm

 a

/an ______

__.

 I

'm

 a

/an

 ______

__

_.

 What

 are

 your

 hobbies

 and

 interests

?

 What

 are

 your

 strengths

 and

 weaknesses

?


I

'm

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

.


Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 country

's

 capital

.

 It

 is

 a

 large

,

 wealthy

,

 and

 cosm

opolitan

 city

,

 famous

 for

 its

 museums

,

 iconic

 landmarks

,

 and

 French

 cuisine

.

 The

 city

 has

 a

 rich

 history

 and

 culture

,

 including

 its

 historic

 center

 and

 medieval

 w

alled

 city

 walls

.

 Paris

 is

 also

 one

 of

 the

 world

's

 most

 important

 financial

 and

 media

 centers

.

 It

 is

 home

 to

 many

 world

-ren

owned

 institutions

,

 including

 the

 Lou

vre

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 E

iff

el

 Tower

.

 The

 city

 has

 a

 vibrant

 nightlife

 and

 is

 a

 popular

 destination

 for

 tourists

,

 filmmakers

,

 and

 fashion

istas



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 combination

 of

 technological

 advancements

,

 regulatory

 changes

,

 and

 human

 preferences

.

 Here

 are

 some

 possible

 trends

 that

 could

 emerge

 in

 the

 near

 future

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 AI

 is

 already

 becoming

 more

 integrated

 with

 other

 technologies

 like

 the

 Internet

 of

 Things

 (

Io

T

)

 and

 the

 blockchain

,

 and

 there

's

 potential

 for

 this

 to

 continue

 expanding

.

 This

 could

 lead

 to

 AI

-powered

 smart

 cities

,

 self

-driving

 cars

,

 and

 other

 applications

 that

 interact

 with

 the

 physical

 world

 in

 novel

 ways

.



2

.

 Greater

 transparency

 and

 accountability

:

 As

 AI

 systems

 become

 more

 complex

 and

 rely

 on

 larger

 datasets

,

 there

 will

 be

 an

 increased

 need

 for

 transparency

 and

 accountability

.

 This




In [6]:
llm.shutdown()