# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-09 09:42:07] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-09 09:42:07] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-09 09:42:07] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-09 09:42:09] INFO trace.py:52: opentelemetry package is not installed, tracing disabled






[2025-11-09 09:42:16] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-09 09:42:16] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-09 09:42:16] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-09 09:42:18] INFO trace.py:52: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.51it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:02,  6.42it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:02,  6.42it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:02,  6.42it/s]

Capturing batches (bs=104 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:02,  6.42it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:00, 16.71it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:00, 16.71it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:00, 16.71it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:00, 16.71it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 21.13it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 21.13it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 21.13it/s]

Capturing batches (bs=56 avail_mem=76.77 GB):  35%|███▌      | 7/20 [00:00<00:00, 21.13it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 22.91it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 22.91it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 22.91it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 22.91it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.98it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.98it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.98it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.98it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 22.53it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 22.53it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 22.53it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 22.53it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 22.53it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 25.34it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 22.56it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Michele and I am a 22 year old woman. I have been feeling really sad lately and I've had a lot of emotional turmoil. I'm very isolated and have few friends. I have a partner, and we recently got divorced. We are still trying to decide what to do. My partner is currently under a lot of pressure, and I've been feeling like I'm not good enough. I've been having trouble sleeping and am not functioning well at work. I don't have a job and I'm not married and I'm worried that I'll be fired. I'm also feeling depressed and have a lot of grief.
Prompt: The president of the United States is
Generated text:  5 feet 3 inches tall. If it's a certain holiday, the president walks down the street at a speed of 3 feet per second. Assuming it is a holiday, how many seconds will it take for the president to walk 1000 feet? To determine how long it will take for the president to walk 1000 feet, we first need to find out how long it takes him to walk 1 foot at a s

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill] who has been [Number of Years] years in the industry. I'm passionate about [What I Love to Do], and I'm always looking for ways to [What I Want to Improve]. I'm [What I Do Best], and I'm always eager to learn and grow. I'm [What I'm Looking for in a Job], and I'm always ready to jump in and help others. I'm [What I'm Looking for in a Partner], and I'm always looking for someone who can support

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the birthplace of French literature, art, and cuisine. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is home to many famous landmarks and attractions, including the Notre-Dame Cathedral, the Louvre Museum, and the Eiffel Tower. It is also a major center for business and finance, with many international companies and institutions headquartered in the city. Paris is a vibrant and dynamic city with a rich cultural and historical heritage.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars to personalized medicine to virtual assistants. Additionally, AI will continue to be used for ethical and social reasons, such as in the development of more equitable and inclusive technologies. As technology continues to evolve, it is likely that we will see more complex and nuanced AI systems that can handle a wide range of tasks and situations. Overall, the future of AI is likely to be one of continued innovation and growth, with a focus on ethical and social



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name], and I am an [age] year old [gender] person. I am a [occupation] who has always been [positive adjective] towards the world. What's your name, and what are you currently doing? [Name]: A [characteristic of your occupation] professional, I am excited to meet you. How can I assist you today? [Name]: It's nice to meet you, [name]. I'm just a regular person, but I'm here to help you. [Name]: [Your name], what brings you to my world today? [Name]: I'm here to assist you with your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a historic and cultural center with a rich history dating back to the Roman period. It is also home to many world-renowned landmarks, including the Eiffel Tower and the Louvre Museum. Paris is known for its vibrant nightlife, delicious cuisine, and annual cultural festiva

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 Name

]

 and

 I

'm

 a

 [

Last

 Name

].

 I

'm a

 creative writer

,

 former

 professional

 soccer

 player

,

 and

 a

 social

 media influ

encer.

 I

 love to

 travel

,

 read

, and

 learn new

 things

. What

's one

 thing you

 enjoy doing

 in your

 free time

?

 I love

 to

 write!

 My work

 is always

 coming out

 of the

 blue,

 but it

 always feels

 like it

 was written

 by me

. How

 do

 you

 go

 about

 your

 writing

?

 I

 get

 ideas

 from

 the

 world

 around

 me

,

 and

 then

 I

 go

 through

 a

 process

 of

 re

organ

izing

 them

 into

 a

 story

.

 It

 can

 be

 fun

 to

 be

 able

 to

 rewrite

 something

 over

 and

 over

 again

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 

1

5

th

 most

 populous

 city

 in

 the

 world

 and

 the

 largest

 city

 in

 the

 European

 Union

. It

 is known

 as the

 “City

 of

 Love

”

 due

 to

 its beautiful

 architecture and

 lively streets

. Paris

 has many

 world-ren

owned landmarks

 such as

 the

 E

iff

el

 Tower

,

 the

 Louvre

 Museum

,

 and

 the

 Notre

-D

ame Cathedral

. The

 city

 also

 has

 a

 rich

 cultural

 heritage

 and

 is

 home

 to

 many

 famous

 French

 artists

,

 composers

,

 and

 writers

.

 It

 is

 a

 major

 center

 for

 business

,

 finance

,

 and

 politics

,

 and

 has

 been

 a

 UNESCO

 World

 Heritage

 site

 since

 

1

9

9

2

.

 The

 city

 is

 famous

 for

 its

 world

-f

amous

 fashion



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 there

 are

 several

 trends

 that

 could

 shape

 its

 development

 and

 applications

 in

 the

 coming

 years

. Here

 are

 some of

 the

 most likely

 trends to

 watch:



1.

 Increased automation

: With

 the rise

 of automation

, we

 can expect

 AI

 to

 become more

 integrated into

 everyday

 life

,

 from home

 automation systems

 to autonomous

 vehicles

.

 This

 could

 lead

 to

 a

 more

 efficient

 and

 reliable

 system

 of

 transportation

,

 as

 well

 as

 new

 industries

 that

 require

 human

 expertise

.



2

.

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 medical

 diagnosis

 and

 treatment

,

 but

 there

 is

 no

 doubt

 that

 it

 has

 the

 potential

 to

 revolution

ize

 the

 field

.

 AI

 could

 help

 doctors

 make

 more

 accurate

 diagnoses

,

 detect




In [6]:
llm.shutdown()