# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-12 01:35:59] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-12 01:35:59] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-12 01:35:59] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


I1212 01:36:11.441431 2578323 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1212 01:36:11.441450 2578323 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.187.8.134 port: 12001
I1212 01:36:11.441473 2578323 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.187.8.134:15243
I1212 01:36:11.441551 2578323 transfer_engine.cpp:185] Auto-discovering topology...
I1212 01:36:11.444247 2578323 transfer_engine.cpp:200] Topology discovery complete. Found 9 HCAs.
I1212 01:36:11.450268 2578323 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce0/
I1212 01:36:11.450919 2578323 rdma_context.cpp:126] RDMA device: mlx5_roce0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4c:48:45
I1212 01:36:11.479063 2578323 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce1/
I1212 01:36:11.479688 2578323 rdma_context.cpp:126] RDMA device: mlx5_roce1, LID: 0, GID: (GID_Index 3) 00:00

I1212 01:36:11.647104 2578323 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce5/
I1212 01:36:11.647730 2578323 rdma_context.cpp:126] RDMA device: mlx5_roce5, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4e:c8:45
I1212 01:36:11.675288 2578323 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce6/
I1212 01:36:11.675932 2578323 rdma_context.cpp:126] RDMA device: mlx5_roce6, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:48:45
I1212 01:36:11.681941 2578323 rdma_context.cpp:533] Find best gid index: 3 on mlx5_roce7/
I1212 01:36:11.682932 2578323 rdma_context.cpp:126] RDMA device: mlx5_roce7, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:16:4f:c8:45


W1212 01:36:12.362649 2578323 memory_location.cpp:72] Failed to get NUMA node, addr: 0x7f7983fff010, len: 2147483648: Operation not permitted [1]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.71it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  4.78it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  4.78it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  4.78it/s]Capturing batches (bs=104 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  4.78it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:01, 13.92it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 13.92it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 13.92it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 13.92it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.83it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.83it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.83it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.83it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 21.34it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 21.34it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 21.34it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 21.34it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.78it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.78it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.78it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.78it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 21.56it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 21.56it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 21.56it/s]

Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 21.56it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 21.56it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 24.65it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 20.98it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sherry from The Giftshop. Today, we're going to be talking about a very special topic: Time Travel. As a professional in the field of design, I can tell you that this concept is considered extremely complicated, but I would be happy to share some of the fun and exciting possibilities when it comes to time travel.

What can time travel do for us? Well, there are a few options that come to mind. The most obvious is the ability to take a trip in time, where you can live in the past or the future. There are also advanced time travel technology that allows for the creation of time machines or time dilation
Prompt: The president of the United States is
Generated text:  in a room with 1000 people, including the president himself. The president tells everyone, "I am the only one who is not the president." How many people are left in the room?
To determine how many people are left in the room after the president's statement, we can follow these steps:


### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French Academy of Sciences. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is also known for its cuisine, including French cuisine, and is home to many famous French restaurants and cafes. Paris is a city that is constantly evolving and is a must-visit destination for anyone interested in French culture and history. 

Paris is also home to many other notable landmarks, including the Lou

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with other technologies: AI is already being integrated into a wide range of devices and systems, from smartphones and wearables to autonomous vehicles and smart homes. As these technologies continue to evolve, we can expect to see even more seamless integration between AI and other technologies, such as blockchain and quantum computing.

2. Enhanced capabilities: AI is likely to continue to evolve and become more capable, with new algorithms and models being developed to solve increasingly complex problems. This could include tasks such as image and speech recognition, natural language processing, and predictive analytics.

3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name here] and I'm a [insert character's profession or role here]. I'm confident, determined, and always ready to help others. I'm passionate about learning and always looking for ways to improve myself. I'm honest, trustworthy, and I value honesty above all else. I strive to be a role model for others and inspire them to succeed. I'm confident in my abilities and know that I can achieve anything I set my mind to. What's your name, and what do you do? [insert name here] [insert profession or role here] I'm confident, determined, and always ready to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic architecture, rich history, and vibrant arts scene. The city is home to the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral, among many other landmarks. Paris is a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

Current

 Age

]

 year

 old

 [

Occup

ation

].

 I

'm

 an

 English

 teacher

 at

 [

Your

 School

 Name

]

 in

 [

Your

 Location

].

 I

 have

 been

 teaching

 English

 for

 [

Number

 of

 Years

]

 years

,

 and

 I

 am

 passionate

 about

 using

 technology

 and

 modern

 educational

 methods

 to

 engage

 students

.

 I

 enjoy

 collaborating

 with

 students

 and

 students

 with

 similar

 backgrounds

,

 and

 I

 believe

 that

 education

 should

 be

 accessible

 to

 all

.

 [

Name

]

 believes

 in

 the

 power

 of

 language

 to

 bridge

 cultures

 and

 promote

 understanding

.

 My

 approach

 to

 teaching

 is

 consistent

 and

 results

-driven

,

 with

 a

 focus

 on

 understanding

 each

 student

's

 unique

 strengths

 and

 weaknesses

.

 I

 strive

 to

 make

 learning

 fun



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

,

 also

 known

 as

 "

La

 Chap

elle

"

 or

 "

L

’

Î

le

 de

 la

 C

ité

"

 and

 officially

 the

 "

City

 of

 Paris

",

 is

 the

 largest

 city

 in

 France

 by

 area

,

 the

 seat

 of

 the

 French

 Government

,

 and

 the

 pre

em

inent

 cultural

,

 artistic

 and

 commercial

 center

 in

 the

 world

.



The

 city

 is

 located

 on

 the

 eastern

 bank

 of

 the

 Se

ine

 river

,

 just

 across

 from

 the

 Lou

vre

 Museum

.

 Its

 history

 dates

 back

 over

 

5

,

0

0

0

 years

,

 and

 Paris

 has

 been

 the

 capital

 of

 France

 since

 the

 

1

2

th

 century

.

 The

 city

 is

 home

 to

 one

 of

 the

 largest

 and

 most

 populous

 urban

 ag

glomer



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 constantly

 evolving

 and

 there

 are

 many

 potential

 areas

 of

 development

 that

 could

 lead

 to

 significant

 changes

 and

 innovations

 in

 the

 field

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Intelligence

:

 AI

 is

 expected

 to

 become

 even

 more

 intelligent

 and

 capable

 in

 the

 future

.

 Machine

 learning

 algorithms

 could

 learn

 and

 improve

 on

 their

 own

,

 leading

 to

 more

 complex

 and

 nuanced

 solutions

 to

 complex

 problems

.



2

.

 Personal

ization

:

 AI

 will

 continue

 to

 personalize

 and

 automate

 many

 aspects

 of

 our

 lives

,

 including

 communication

,

 entertainment

,

 and

 business

.

 This

 could

 lead

 to

 more

 efficient

 and

 effective

 use

 of

 resources

.



3

.

 Autonomous

 Vehicles

:

 Autonomous

 vehicles

 will

 become

 more

 common

 and

 widely

 used

,

 with

 AI

 playing




In [6]:
llm.shutdown()