# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-17 15:39:07] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.39it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.42 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.42 GB):   5%|▌         | 1/20 [00:00<00:03,  5.78it/s]Capturing batches (bs=120 avail_mem=71.32 GB):   5%|▌         | 1/20 [00:00<00:03,  5.78it/s]

Capturing batches (bs=120 avail_mem=71.32 GB):  10%|█         | 2/20 [00:00<00:02,  7.55it/s]Capturing batches (bs=112 avail_mem=71.31 GB):  10%|█         | 2/20 [00:00<00:02,  7.55it/s]Capturing batches (bs=112 avail_mem=71.31 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.76it/s]Capturing batches (bs=104 avail_mem=76.30 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.76it/s]

Capturing batches (bs=96 avail_mem=76.29 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.76it/s] Capturing batches (bs=88 avail_mem=76.28 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.76it/s]Capturing batches (bs=88 avail_mem=76.28 GB):  30%|███       | 6/20 [00:00<00:01, 12.40it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  30%|███       | 6/20 [00:00<00:01, 12.40it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  30%|███       | 6/20 [00:00<00:01, 12.40it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  30%|███       | 6/20 [00:00<00:01, 12.40it/s]

Capturing batches (bs=64 avail_mem=76.27 GB):  45%|████▌     | 9/20 [00:00<00:00, 15.93it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  45%|████▌     | 9/20 [00:00<00:00, 15.93it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  45%|████▌     | 9/20 [00:00<00:00, 15.93it/s]Capturing batches (bs=40 avail_mem=76.26 GB):  45%|████▌     | 9/20 [00:00<00:00, 15.93it/s]Capturing batches (bs=40 avail_mem=76.26 GB):  60%|██████    | 12/20 [00:00<00:00, 18.44it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  60%|██████    | 12/20 [00:00<00:00, 18.44it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  60%|██████    | 12/20 [00:00<00:00, 18.44it/s]

Capturing batches (bs=16 avail_mem=76.24 GB):  60%|██████    | 12/20 [00:00<00:00, 18.44it/s]Capturing batches (bs=16 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:01<00:00, 18.41it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:01<00:00, 18.41it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:01<00:00, 18.41it/s] Capturing batches (bs=4 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:01<00:00, 18.41it/s]

Capturing batches (bs=4 avail_mem=76.23 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.98it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.98it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.98it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:01<00:00, 16.15it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  [My name], and I am a freshman at [University of] in [City]. I'm interested in learning about your teaching philosophy and education, and I am seeking your expertise on developing a mentorship program for high school students that caters to their diverse needs and learning styles. Can you share any insights or strategies you have for creating an effective mentorship program? Additionally, could you provide some examples of successful mentorship programs and what makes them effective?
Certainly, I would be happy to share some insights and strategies for developing a mentorship program. One of the key aspects of a successful mentorship program is to ensure that the mentor
Prompt: The president of the United States is
Generated text:  a member of the ________.
A. upper house of the House of Representatives
B. lower house of the House of Representatives
C. Supreme Court
D. Congress
Answer:
A

The essential attribute of a commodity is ____.
A. Use 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [Job Title] at [Company Name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is known for its rich history, art, and cuisine, and is a popular tourist destination. It is also home to the French Parliament and the

Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making. This could lead to more personalized and context-aware AI systems that can better understand and respond to human emotions and behaviors.

2. Enhanced machine learning capabilities: AI systems are likely to become even more powerful and capable, with the ability to learn and adapt to new data and situations. This could lead to more efficient and effective AI systems that can handle a wider range of tasks and applications.

3. Increased focus on ethical and social implications:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [age] year old [gender] [race]. I'm an [occupation] and have always loved [job title]. I've always been [goals]. I'm constantly learning and improving, so I'm always trying to [what I hope to do]. I'm passionate about [interests or activities]. I'm a [personality type]. How would you describe me? [Describe your personality traits and characteristics in one sentence or two. Why do you think they describe you best? How do you think they describe you?]
[Name]
[Name]
Name:
[Name]
[Name]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest city in France and serves as its cultural, political, and economic center. It is also the oldest city in the world, having existed since the 8th century. Paris is a major tourist destination and is home to many iconic landmarks such as th

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

/an

 [

Occup

ation

].

 I

 currently

 live

 in

 [

Location

]

 with

 [

Person

].

 I

've

 always

 been

 fascinated

 by

 [

Question

able

 Area

].

 I

've

 always

 dreamed

 of

 [

A

esthetic

}.

 I

 enjoy

 [

Favorite

 Activity

].

 I

'm

 a

/an

 [

Function

].

 



Please

 add

 a

 little

 more

 information

 about

 [

Occup

ation

]

 that

 would

 make

 the

 introduction

 even

 more

 interesting

.

 For

 example

,

 if

 you

're

 interested

 in

 hobbies

,

 tell

 me

 about

 a

 hobby

 you

're

 passionate

 about

,

 or

 if

 you

 have

 any

 particular

 interests

 in

 particular

 professions

.

 That

 way

,

 the

 reader

 will

 understand

 that

 I

 have

 a

 personal

 interest

 in

 the

 character

 and

 not

 just

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

,

 the

 historic

 and

 world

-ren

owned

 capital

 of

 France

,

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

,

 and

 hosts

 numerous

 cultural

 events

 and

 festivals

 throughout

 the

 year

.

 Paris

 is

 a

 popular

 destination

 for

 tourists

,

 business

 travelers

,

 and

 locals

 alike

,

 making

 it

 a

 key

 center

 of

 French

 culture

 and

 politics

.

 Its

 vibrant

 nightlife

,

 elegant

 dining

 scene

,

 and

 unique

 food

 offerings

 have

 made

 it

 a

 major

 tourist

 destination

 in

 the

 world

.

 Paris

 is

 a

 city

 of

 contrasts

,

 from

 its

 bustling

 streets

 and

 historic

 architecture



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 growth

,

 development

,

 and

 integration

 of

 new

 technologies

 and

 applications

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 integration

 with

 human

-com

puter

 interaction

:

 As

 AI

 becomes

 more

 advanced

 and

 able

 to

 perform

 complex

 tasks

,

 it

 will

 be

 able

 to

 interact

 with

 humans

 in

 a

 more

 natural

 way

.

 This

 could

 mean

 more

 seamless

 and

 intuitive

 interaction

 between

 AI

 and

 humans

,

 leading

 to

 improved

 efficiency

 and

 productivity

.



2

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 can

 be

 used

 to

 analyze

 medical

 data

,

 diagnose

 diseases

,

 and

 develop

 new

 treatments

.

 This

 could

 lead

 to

 more

 accurate

 and

 effective

 medical

 treatments

,

 potentially

 reducing

 the

 need

 for

 human

 doctors

 and

 improving




In [6]:
llm.shutdown()