# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-13 03:20:46] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=23.44 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=23.44 GB):   5%|▌         | 1/20 [00:00<00:05,  3.76it/s]Capturing batches (bs=120 avail_mem=23.34 GB):   5%|▌         | 1/20 [00:00<00:05,  3.76it/s]

Capturing batches (bs=120 avail_mem=23.34 GB):  10%|█         | 2/20 [00:00<00:04,  4.16it/s]Capturing batches (bs=112 avail_mem=23.33 GB):  10%|█         | 2/20 [00:00<00:04,  4.16it/s]Capturing batches (bs=112 avail_mem=23.33 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.13it/s]Capturing batches (bs=104 avail_mem=23.33 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.13it/s]

Capturing batches (bs=96 avail_mem=23.32 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.13it/s] Capturing batches (bs=96 avail_mem=23.32 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.33it/s]Capturing batches (bs=88 avail_mem=23.32 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.33it/s]Capturing batches (bs=80 avail_mem=23.31 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.33it/s]Capturing batches (bs=72 avail_mem=23.31 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.33it/s]

Capturing batches (bs=72 avail_mem=23.31 GB):  40%|████      | 8/20 [00:00<00:00, 12.37it/s]Capturing batches (bs=64 avail_mem=23.30 GB):  40%|████      | 8/20 [00:00<00:00, 12.37it/s]Capturing batches (bs=56 avail_mem=23.30 GB):  40%|████      | 8/20 [00:00<00:00, 12.37it/s]Capturing batches (bs=56 avail_mem=23.30 GB):  50%|█████     | 10/20 [00:01<00:00, 14.17it/s]Capturing batches (bs=48 avail_mem=23.29 GB):  50%|█████     | 10/20 [00:01<00:00, 14.17it/s]Capturing batches (bs=40 avail_mem=23.29 GB):  50%|█████     | 10/20 [00:01<00:00, 14.17it/s]Capturing batches (bs=32 avail_mem=23.29 GB):  50%|█████     | 10/20 [00:01<00:00, 14.17it/s]

Capturing batches (bs=32 avail_mem=23.29 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.64it/s]Capturing batches (bs=24 avail_mem=23.28 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.64it/s]Capturing batches (bs=16 avail_mem=23.28 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.64it/s]Capturing batches (bs=16 avail_mem=23.28 GB):  75%|███████▌  | 15/20 [00:01<00:00, 16.22it/s]Capturing batches (bs=12 avail_mem=23.27 GB):  75%|███████▌  | 15/20 [00:01<00:00, 16.22it/s]Capturing batches (bs=8 avail_mem=23.27 GB):  75%|███████▌  | 15/20 [00:01<00:00, 16.22it/s] 

Capturing batches (bs=4 avail_mem=23.26 GB):  75%|███████▌  | 15/20 [00:01<00:00, 16.22it/s]Capturing batches (bs=4 avail_mem=23.26 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.07it/s]Capturing batches (bs=2 avail_mem=23.25 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.07it/s]Capturing batches (bs=1 avail_mem=23.25 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.07it/s]Capturing batches (bs=1 avail_mem=23.25 GB): 100%|██████████| 20/20 [00:01<00:00, 13.86it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jane. I'm a 15 year old girl and I'm studying at a high school in Ningbo. I have a lot of problems at school because I think my teacher is not very good. One day my teacher asks me to explain something to the class. I don't understand. I don't know how to explain it. I am very sad and I hope that I can find someone to help me. One day I find someone I know, and we talk together. She is very nice and she can help me a lot. I want to thank her, but she says she can't help me. I don't understand
Prompt: The president of the United States is
Generated text:  very interested in the number of students enrolled in different schools. The president is also interested in how many of those students attend schools with more than 300 students. He wants to know how many students attend schools with more than 300 students. 

Given that there are 500 students in the first grade, 1000 students in the second grade, 1500 students in the third grade, and 2000 stu

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Type of Character] who is [Describe your character's personality traits]. I am [Describe your character's hobbies and interests]. I am [Describe your character's strengths and weaknesses]. I am [Describe your character's goals and aspirations]. I am [Describe your character's personality type]. I am [Describe your character's overall personality]. I am [Describe your character's overall personality type]. I am [Describe your character's overall personality type]. I am [Describe your character's overall personality type]. I am [Describe your character's overall

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French Parliament building. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. Its history dates back to the Roman Empire and is known for its rich history, art, and architecture. It is a city that has been a center of politics, culture, and commerce for centuries. Paris is a city that has played a significant role in shaping French identity and is a major economic and cultural hub

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations and tasks. This could lead to more efficient and effective use of AI in various fields, such as healthcare, transportation, and manufacturing.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I am a [Your Profession/Role]. I am passionate about [Your Passion/Interest/Technology/Commitment]. I have [Your Achievements/Hobbies/Resolutions/Job Goals]. I am committed to [Your Goals/Goals/Commitment/Interests/Passion]. I believe that [Your Values/Character/Philosophy]. How can I be helpful to you? [Your Name] is a [Your Profession/Role] who is passionate about [Your Passion/Interest/Technology/Commitment]. He has [Achievements/Hobbies/Resolutions/Job Goals] and he is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the largest city in France and the second-largest city in Europe. The city is known for its rich history, beautiful architecture, and diverse cultural scene. It is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Muse

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 [

Name

's

 occupation

]

!

 I

 have

 been

 working

 at

 [

Company

 Name

]

 for

 [number

]

 years

 now

,

 and

 I

 am constantly

 striving

 to

 improve

 myself

 and

 learn

 new

 things

.

 I

 am

 a

 strong

 and dedicated

 individual

 who

 is

 always

 looking

 for

 ways

 to

 enhance

 my

 skills

 and

 make

 a

 positive

 impact

 on

 the

 world

.

 If

 you

 have

 any

 questions

 or

 need

 assistance

,

 feel

 free

 to

 reach

 out

!

 [

Name

]

  


(P

lease

 include

 your

 full

 name

 and

 occupation

)



Hello

,

 my

 name

 is

 [

Name

]

 and

 I

 am

 [

Name

's

 occupation

]

!

 I

 have

 been

 working

 at

 [

Company

 Name

]

 for

 [

number

]

 years

 now

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 of

 light

 and

 art

.

 It

 is

 located

 on

 the

 Se

ine

 river

 and

 is

 the

 most

 populous

 city

 in

 the

 European

 Union

,

 with

 an

 estimated

 population

 of

 over

 

2

.

7

 million

 people

.

 Paris

 is

 also

 known

 as

 the

 “

City

 of

 a

 Million

”,

 which

 refers

 to

 its

 size

 and

 population

.

 The

 city

 is

 famous

 for

 its

 historical

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.

 It

 is

 also

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 Arc

 de

 Tri

omp

he

,

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

 Museum

,

 and

 for

 its

 annual

 E

iff

el

 Tower

 Tour



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 continuous

 innovation

,

 greater

 accessibility

,

 and

 more

 complex

 applications

.

 Here

 are

 some

 possible

 trends

:



1

.

 Improved

 accuracy

:

 As

 AI

 continues

 to

 improve

 its

 ability

 to

 learn

 from

 data

 and

 generalize

 its

 decisions

,

 it

 is

 expected

 to

 become

 more

 accurate

 and

 reliable

.



2

.

 Increased

 efficiency

:

 With

 the

 ability

 to

 perform

 tasks

 more

 quickly

 and

 efficiently

,

 AI

 is

 expected

 to

 be

 more

 efficient

 and

 cost

-effective

 in

 many

 applications

.



3

.

 Personal

ization

:

 AI

 will

 continue

 to

 be

 used

 to

 personalize

 experiences

,

 such

 as

 recommending

 products

,

 services

,

 or

 entertainment

 based

 on

 an

 individual

's

 preferences

 and

 past

 behavior

.



4

.

 Increased

 autonomy

:

 AI

 is

 expected

 to

 become

 more




In [6]:
llm.shutdown()