# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-08 09:01:35] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-08 09:01:35] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-08 09:01:35] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-08 09:01:38] INFO server_args.py:1615: Attention backend not specified. Use fa3 backend by default.


[2026-01-08 09:01:38] INFO server_args.py:2512: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.20it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.93 GB):   5%|▌         | 1/20 [00:00<00:09,  1.94it/s]Capturing batches (bs=120 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:09,  1.94it/s]Capturing batches (bs=112 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:09,  1.94it/s]Capturing batches (bs=104 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:09,  1.94it/s]Capturing batches (bs=104 avail_mem=76.82 GB):  20%|██        | 4/20 [00:00<00:02,  7.63it/s]Capturing batches (bs=96 avail_mem=76.82 GB):  20%|██        | 4/20 [00:00<00:02,  7.63it/s] Capturing batches (bs=88 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:02,  7.63it/s]Capturing batches (bs=80 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:02,  7.63it/s]

Capturing batches (bs=80 avail_mem=76.81 GB):  35%|███▌      | 7/20 [00:00<00:01, 12.40it/s]Capturing batches (bs=72 avail_mem=76.80 GB):  35%|███▌      | 7/20 [00:00<00:01, 12.40it/s]Capturing batches (bs=64 avail_mem=76.80 GB):  35%|███▌      | 7/20 [00:00<00:01, 12.40it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:01, 12.40it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  50%|█████     | 10/20 [00:00<00:00, 15.96it/s]Capturing batches (bs=48 avail_mem=76.79 GB):  50%|█████     | 10/20 [00:00<00:00, 15.96it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 15.96it/s]Capturing batches (bs=32 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 15.96it/s]

Capturing batches (bs=32 avail_mem=76.78 GB):  65%|██████▌   | 13/20 [00:00<00:00, 18.64it/s]Capturing batches (bs=24 avail_mem=76.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 18.64it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  65%|██████▌   | 13/20 [00:01<00:00, 18.64it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:01<00:00, 18.64it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:01<00:00, 18.50it/s]Capturing batches (bs=8 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:01<00:00, 18.50it/s] 

Capturing batches (bs=4 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:01<00:00, 18.50it/s]Capturing batches (bs=2 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 18.50it/s]Capturing batches (bs=1 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 18.50it/s]Capturing batches (bs=1 avail_mem=76.75 GB): 100%|██████████| 20/20 [00:01<00:00, 22.11it/s]Capturing batches (bs=1 avail_mem=76.75 GB): 100%|██████████| 20/20 [00:01<00:00, 15.63it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mike and I'm the CEO of an investment firm that focuses on global bonds. I am a very successful investor in the bond market, but I do not understand why the value of bonds is decreasing.

Why is the value of bonds declining?

It is important to be aware that I am not an expert in the field of bond valuation. I know that changes in the market can lead to fluctuations in bond prices, but I am not able to provide a definitive answer on the specific cause of the declining bond value. 

One possible reason for the declining value of bonds could be an increase in inflation, as inflation can erode the purchasing power of
Prompt: The president of the United States is
Generated text:  a man. His name is Donald Trump. He's not a Republican or a Democrat. He's just a regular guy. He is responsible for the country. He is a very important job. 

But Donald Trump is not a nice man. In fact, he's not even a nice person. He's not even a good person. He's not 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and culture, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major center for fashion, art, and music, and is home to many world-renowned museums, theaters, and restaurants. The city is known for its romantic atmosphere and is a popular tourist destination, attracting millions of visitors each year. Paris is a vibrant and dynamic city that continues to evolve and grow, with a rich history and a strong sense of community. The city is a symbol

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars to personalized healthcare and financial services. Additionally, there is a growing focus on ethical considerations and the development of AI that is designed to be fair, transparent, and responsible. As AI becomes more integrated into our daily lives, we can expect to see a greater emphasis on privacy, security, and data protection. Overall, the future of AI is likely to be one of continued innovation, integration, and ethical considerations. 

Can you



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [age] year old [gender] [race] who moved to [city or country] from [previous location] after [short explanation of how the move occurred]. I've always had a passion for [specific hobby or activity], [explain the hobby or activity]. I'm [age] years old, and I've always been fascinated by [what interests you]. I'm an [education level] student and I enjoy [what I enjoy doing in my free time]. I value [value] in life and I'm [age] years old. What are some common mistakes to avoid when introducing yourself

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a historic city known for its rich history, beautiful architecture, and vibrant culture. It is the capital of France and the largest city in Europe by population, with over 10 million inhabitants. The city is famous for its Notre-Dame Cathed

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

-old

 [

Career

]

 [

Occup

ation

].

 I

'm

 an

 ext

ro

verted

 and

 outgoing

 person

 who

 enjoys

 spending

 time

 with

 friends

 and

 family

.

 I

'm

 passionate

 about

 [

Re

levant

 Interest

],

 and

 I

 believe

 that

 it

's

 important

 to

 help

 others

 and

 contribute

 to

 the

 world

 in

 some

 way

.

 I

'm

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 learn

 and

 grow

,

 and

 I

'm

 eager

 to

 share

 my

 experiences

 and

 insights

 with

 others

.

 I

'm

 confident

 in

 my

 abilities

 and

 am

 always

 eager

 to

 improve

.

 I

'm

 a

 [

Rel

ational

]

 person

 who

 values

 honesty

,

 integrity

,

 and

 respect

.

 I

'm

 available

 

2

4



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 commonly

 known

 as

 "

La

 Re

ine

",

 or

 simply

 Paris

,

 and

 is

 the

 second

-largest

 city

 in

 the

 European

 Union

 and

 the

 third

-largest

 city

 in

 the

 world

.

 It

 is

 a

 historic

 and

 modern

 met

ropolis

 with

 an

 important

 cultural

,

 artistic

,

 and

 intellectual

 center

,

 and

 home

 to

 many

 important

 museums

 and

 museums

.

 It

 is

 known

 for

 its

 rich

 culture

,

 food

,

 fashion

,

 and

 cuisine

.

 Paris

 is

 also

 a

 popular

 tourist

 destination

,

 known

 for

 its

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.

 Its

 climate

 is

 subt

ropical

,

 and

 the

 city

 is

 known

 for

 its

 climate

,

 with

 mild

 winters

 and

 warm

 summers

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 complex

,

 but

 some

 trends

 that

 are

 likely

 to

 shape

 its

 direction

 include

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 As

 more

 people

 become

 more

 reliant

 on

 AI

 in

 healthcare

,

 we

 can

 expect

 to

 see

 more

 AI

-driven

 diagnostics

,

 personalized

 treatment

 plans

,

 and

 drug

 discovery

.

 AI

 can

 help

 doctors

 make

 more

 accurate

 diagnoses

,

 identify

 potential

 diseases

 early

,

 and

 optimize

 treatment

 plans

.



2

.

 Increased

 use

 of

 AI

 in

 manufacturing

:

 AI

 can

 help

 manufacturers

 optimize

 production

 processes

,

 predict

 equipment

 failures

,

 and

 improve

 quality

 control

.

 AI

 can

 also

 help

 manufacturers

 create

 more

 efficient

 supply

 chains

,

 reduce

 waste

,

 and

 improve

 productivity

.



3

.

 Increased

 use

 of

 AI

 in

 transportation

:

 AI

 can




In [6]:
llm.shutdown()