# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-12 17:27:56] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.07it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.93 GB):   5%|▌         | 1/20 [00:00<00:03,  5.90it/s]Capturing batches (bs=120 avail_mem=71.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.90it/s]

Capturing batches (bs=112 avail_mem=71.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.90it/s]Capturing batches (bs=104 avail_mem=71.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.90it/s]Capturing batches (bs=104 avail_mem=71.81 GB):  20%|██        | 4/20 [00:00<00:01, 14.38it/s]Capturing batches (bs=96 avail_mem=71.81 GB):  20%|██        | 4/20 [00:00<00:01, 14.38it/s] Capturing batches (bs=88 avail_mem=71.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.38it/s]Capturing batches (bs=80 avail_mem=71.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.38it/s]

Capturing batches (bs=80 avail_mem=71.80 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.61it/s]Capturing batches (bs=72 avail_mem=71.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.61it/s]Capturing batches (bs=64 avail_mem=71.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.61it/s]Capturing batches (bs=56 avail_mem=71.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.61it/s]Capturing batches (bs=56 avail_mem=71.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.50it/s]Capturing batches (bs=48 avail_mem=71.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.50it/s]Capturing batches (bs=40 avail_mem=71.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.50it/s]

Capturing batches (bs=32 avail_mem=71.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.50it/s]Capturing batches (bs=32 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.55it/s]Capturing batches (bs=24 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.55it/s]Capturing batches (bs=16 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.55it/s]Capturing batches (bs=12 avail_mem=71.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.55it/s]

Capturing batches (bs=12 avail_mem=71.75 GB):  80%|████████  | 16/20 [00:00<00:00, 19.81it/s]Capturing batches (bs=8 avail_mem=71.75 GB):  80%|████████  | 16/20 [00:00<00:00, 19.81it/s] Capturing batches (bs=4 avail_mem=71.74 GB):  80%|████████  | 16/20 [00:00<00:00, 19.81it/s]Capturing batches (bs=2 avail_mem=71.74 GB):  80%|████████  | 16/20 [00:00<00:00, 19.81it/s]Capturing batches (bs=2 avail_mem=71.74 GB):  95%|█████████▌| 19/20 [00:00<00:00, 22.22it/s]Capturing batches (bs=1 avail_mem=71.74 GB):  95%|█████████▌| 19/20 [00:00<00:00, 22.22it/s]Capturing batches (bs=1 avail_mem=71.74 GB): 100%|██████████| 20/20 [00:01<00:00, 19.90it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Guo Jiu. I am a senior in the first year of high school. I am not only a student, but also a student of Chinese. I have three best friends - Wang Ming, Liu Hua, and Sun Tao. I am very clever and have a lot of questions in English. I love taking photos, watching cartoons, and playing video games. I am studying in a middle school. I study very hard and I get good grades in all subjects. I like to play basketball, and I also like to ride my bicycle. I like to eat Chinese food. I really like to have fun. I am very happy
Prompt: The president of the United States is
Generated text:  trying to decide whether to visit the moon or visit a small town. He is considering the following options:

Option A: Visit the moon, which costs $200,000, but the president believes that the benefits outweigh the costs and will return with a return on investment of 15%.

Option B: Visit a small town, which costs $300,000, and the president believes that the benefits ar

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. Let's chat! [Name] [Job Title] [Company Name] [Company Address] [Company Phone Number] [Company Email] [Company Website] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook Profile] [Company Instagram Profile] [Company GitHub Profile] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook Profile] [Company Instagram Profile] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook Profile] [Company Instagram Profile]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also famous for its rich history, including the French Revolution and the French Revolution Museum. Paris is a bustling city with a diverse population and is home to many cultural institutions, including the Louvre Museum, the Musée d'Orsay, and the Musée d'Art Moderne. The city is also known for its cuisine, including French cuisine, and is home to many famous restaurants and bars. Paris is a vibrant and exciting city that is a must-visit for anyone interested in French

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and the impact of AI on society.

2. Integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to new situations. This will require significant advances in machine learning and natural language processing.

3. Development of new AI technologies: There will be



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Alex. I love to write and I'm really good at it. How can I help you today? To get started, what can I expect from our conversation? Hello, my name is Alex. I love writing and I'm really good at it. How can I help you today? I'm excited to get to know you better. Is there a particular genre of writing that you're interested in? I'm always looking for new ideas and fresh perspectives to work with. What kind of writing do you enjoy doing? Writing is a really fun and rewarding activity for me. I enjoy pushing myself to be better at my craft and I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France by population and is the economic and cultural center of the country. It is located in the Île de France region and is known for its historical landmarks, art, music, and cuisine

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

].

 I

'm

 a

 dedicated

 [

job

 title

]

 who

 has

 always

 been

 passionate

 about

 [

job

 title

].

 I

 love

 [

job

 title

]

 because

 [

reason

 why

 I

 love

 it

].

 I

'm

 constantly

 learning

 and

 growing

,

 and

 I

'm

 always

 looking

 for

 opportunities

 to

 contribute

 to

 the

 company

 and

 the

 industry

.

 I

'm

 excited

 about

 the

 future

 of

 [

job

 title

]

 and

 I

'm

 looking

 forward

 to

 what

 it

 will

 be

 like

 to

 work

 with

 you

.


I

'm

 a

 [

job

 title

]

 at

 [

company

]

 who

 loves

 [

job

 title

].

 I

 have

 always

 been

 passionate

 about

 [

job

 title

]

 and

 have

 always

 wanted



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 statement

 can

 be

 summarized

 as

 follows

:

 "

Paris

 serves

 as

 the

 administrative

 and

 cultural

 center

 of

 France

."

 



(Note

:

 This

 is

 a

 factual

 statement

 about

 Paris

,

 which

 is

 the

 capital

 city

 of

 France

,

 and

 not

 a

 complex

 or

 abstract

 statement

.)

 



Please

 note

 that

 while

 Paris

 is

 indeed

 the

 capital

 of

 France

,

 it

 is

 not

 the

 only

 administrative

 and

 cultural

 center

 of

 the

 country

.

 The

 statement

 provided

 is

 a

 general

 overview

 of

 Paris

'

 role

 as

 the

 capital

.

 



For

 a

 more

 comprehensive

 answer

,

 a

 more

 detailed

 description

 of

 Paris

's

 role

 as

 the

 capital

,

 including

 its

 historical

 significance

 and

 current

 status

,

 would

 be

 more

 appropriate

.

 For

 example

,

 Paris

 is

 home



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 key

 trends

,

 including

:



1

.

 Increased

 emphasis

 on

 ethical

 considerations

:

 As

 more

 companies

 and

 governments

 begin

 to

 recognize

 the

 potential

 risks

 of

 AI

,

 there

 will

 be

 an

 increasing

 focus

 on

 how

 AI

 can

 be

 developed

,

 deployed

,

 and

 used

 in

 a

 responsible

 and

 ethical

 way

.

 This

 will

 likely

 lead

 to

 more

 stringent

 regulations

 and

 standards

,

 as

 well

 as

 a

 greater

 emphasis

 on

 transparency

 and

 accountability

 in

 AI

 systems

.



2

.

 Deep

 learning

 and

 big

 data

:

 AI

 will

 continue

 to

 benefit

 from

 advances

 in

 deep

 learning

 and

 big

 data

 technologies

,

 which

 will

 allow

 systems

 to

 learn

 and

 adapt

 more

 effectively

.

 This

 will

 also

 lead

 to

 new

 ways

 of

 processing

 and

 analyzing

 large




In [6]:
llm.shutdown()