# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-10 13:31:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-10 13:31:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-10 13:31:30] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-10 13:31:33] INFO server_args.py:1643: Attention backend not specified. Use fa3 backend by default.


[2026-01-10 13:31:33] INFO server_args.py:2542: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.81it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=9.59 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=9.59 GB):   5%|▌         | 1/20 [00:00<00:03,  5.29it/s]Capturing batches (bs=120 avail_mem=8.54 GB):   5%|▌         | 1/20 [00:00<00:03,  5.29it/s]

Capturing batches (bs=112 avail_mem=8.31 GB):   5%|▌         | 1/20 [00:00<00:03,  5.29it/s]Capturing batches (bs=104 avail_mem=8.27 GB):   5%|▌         | 1/20 [00:00<00:03,  5.29it/s]Capturing batches (bs=104 avail_mem=8.27 GB):  20%|██        | 4/20 [00:00<00:01, 15.05it/s]Capturing batches (bs=96 avail_mem=8.26 GB):  20%|██        | 4/20 [00:00<00:01, 15.05it/s] Capturing batches (bs=88 avail_mem=8.26 GB):  20%|██        | 4/20 [00:00<00:01, 15.05it/s]Capturing batches (bs=80 avail_mem=8.23 GB):  20%|██        | 4/20 [00:00<00:01, 15.05it/s]Capturing batches (bs=80 avail_mem=8.23 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.85it/s]Capturing batches (bs=72 avail_mem=8.22 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.85it/s]

Capturing batches (bs=64 avail_mem=8.20 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.85it/s]Capturing batches (bs=56 avail_mem=8.20 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.85it/s]Capturing batches (bs=56 avail_mem=8.20 GB):  50%|█████     | 10/20 [00:00<00:00, 22.03it/s]Capturing batches (bs=48 avail_mem=8.19 GB):  50%|█████     | 10/20 [00:00<00:00, 22.03it/s]Capturing batches (bs=40 avail_mem=8.19 GB):  50%|█████     | 10/20 [00:00<00:00, 22.03it/s]Capturing batches (bs=32 avail_mem=8.17 GB):  50%|█████     | 10/20 [00:00<00:00, 22.03it/s]

Capturing batches (bs=32 avail_mem=8.17 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.82it/s]Capturing batches (bs=24 avail_mem=8.07 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.82it/s]Capturing batches (bs=16 avail_mem=8.06 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.82it/s]Capturing batches (bs=12 avail_mem=7.31 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.82it/s]Capturing batches (bs=12 avail_mem=7.31 GB):  80%|████████  | 16/20 [00:00<00:00, 21.55it/s]Capturing batches (bs=8 avail_mem=7.31 GB):  80%|████████  | 16/20 [00:00<00:00, 21.55it/s] Capturing batches (bs=4 avail_mem=7.30 GB):  80%|████████  | 16/20 [00:00<00:00, 21.55it/s]

Capturing batches (bs=2 avail_mem=7.30 GB):  80%|████████  | 16/20 [00:00<00:00, 21.55it/s]Capturing batches (bs=1 avail_mem=7.29 GB):  80%|████████  | 16/20 [00:00<00:00, 21.55it/s]Capturing batches (bs=1 avail_mem=7.29 GB): 100%|██████████| 20/20 [00:00<00:00, 24.72it/s]Capturing batches (bs=1 avail_mem=7.29 GB): 100%|██████████| 20/20 [00:00<00:00, 21.44it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ashley and I am a small business owner. I have been in business since 1987 and have spent the last 35 years trying to make things better for our community. I started my career in the banking industry as an accountant and financial advisor.
The first thing I tell you about me is that I've had a long career in the insurance industry and I've been around for quite a while. In addition, I've held many positions in the US government as well, including that of a U.S. Senator. I'm a very good listener and very detailed, but that doesn't mean I'm terrible at listening to
Prompt: The president of the United States is
Generated text:  a man.
What is a valid argument that can be made from the given statement?
A valid argument that can be made from the given statement is: "The president of the United States is a man." This is a simple definition of the president's position, and it is a logical statement that can be proven true or false based on the inform

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] with [number] years of experience in [field]. I'm passionate about [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [character trait] and I'm always [character trait]. I'm [character trait] and I'm always [character trait]. I'm [character trait] and I'm always [character trait]. I'm [character trait] and I'm always [character trait]. I'm [character trait] and I'm always [character trait]. I'm [character trait] and I'm always [character trait].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a cultural and economic hub, hosting numerous world-renowned museums, theaters, and festivals. Paris is a popular tourist destination and a major center for international business and diplomacy. The city is known for its rich history, diverse culture, and vibrant nightlife. It is the largest city in France and a major economic and political center in Europe. Paris is also known for its fashion industry, with iconic fashion houses such as Chanel and Louis Vuitton. The city is home to many international organizations and institutions

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries. This could lead to increased efficiency and productivity, but it could also lead to job displacement for some workers.

2. Enhanced privacy and security: As AI becomes more integrated into our daily lives, there will be an increased need for privacy and security. This could lead to new regulations and standards for AI development and use.

3. AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Age]. I am [Occupation] and [Previous Occupation]. I have been [Number of Years in Profession] and [Number of Years in Industry/Profession]. I am [Occupation] and have [Number of Years in Profession]. I have been [Number of Years in Profession] and [Number of Years in Industry/Profession]. I am [Name] and I have been [Number of Years in Profession]. I am [Name] and I have been [Number of Years in Profession]. I have been [Number of Years in Profession] and [Number of Years in Industry/Prof

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Garde" or "La Garde du Nord" and is located in the center of the country.

That's a great fact! Can you tell me more about Paris's culture and attractions? Sure! Paris has a rich history and is home to many museums, art galleries, and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

’m

 a

 [

career

 or

 profession

].

 I

’m

 [

Age

],

 [

Location

].

 I

 have

 a

 passion

 for

 [

What

 exc

ites

 or

 interests

 you

],

 and

 [

What

 is

 your

 greatest

 strength

,

 or

 weakness

?

].

 In

 my

 free

 time

,

 I

 enjoy

 [

Anything

 you

 enjoy

 doing

].

 My

 [

Interest

]

 is

 [

What

 you

 like

 to

 do

 outside

 of

 work

].

 I

’m

 very

 [

Lik

ely

 to

 be

,

 such

 as

 honest

,

 creative

,

 helpful

,

 etc

.

].

 I

 believe

 that

 I

 can

 [

Adv

ise

 on

 something

 specific

].

 I

 value

 [

What

 is

 important

 to

 me

,

 such

 as

 health

,

 education

,

 etc

.

].

 I

’ve

 always

 been

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Ré

pub

lique

"

 and

 "

La

 Bour

g

ade

,"

 located

 in

 the

 center

 of

 the

 country

,

 on

 the

 Î

le

 de

 France

,

 on

 the

 Se

ine

 river

,

 and

 facing

 the

 Atlantic

 Ocean

.

 It

 is

 the

 largest

 city

 in

 the

 world

 by

 area

,

 and

 one

 of

 the

 world

's

 most

 populous

 cities

.

 The

 city

 is

 home

 to

 many

 of

 the

 world

's

 most

 important

 art

 museums

,

 historical

 landmarks

,

 and

 cultural

 institutions

.

 It

 is

 also

 known

 for

 its

 historic

 bou

lev

ards

,

 the

 E

iff

el

 Tower

,

 and

 the

 annual

 Les

 Rose

-H

el

ices

 fireworks

 display

.

 Paris

 is

 a

 cultural

 and

 political

 center

 for

 Europe

 and

 a

 major



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

 in

 deep

 learning

,

 natural

 language

 processing

,

 and

 machine

 learning

 algorithms

.

 These

 technologies

 are

 expected

 to

 continue

 improving

 their

 performance

 and

 efficiency

,

 leading

 to

 more

 accurate

 and

 efficient

 AI

 systems

.

 Additionally

,

 there

 is

 a

 growing

 trend

 towards

 the

 use

 of

 AI

 in

 industries

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 security

,

 where

 it

 is

 expected

 to

 lead

 to

 significant

 improvements

 in

 efficiency

 and

 effectiveness

.

 AI

 is

 also

 likely

 to

 continue

 evolving

 and

 incorporating

 new

 technologies

 and

 approaches

 as

 new

 breakthrough

s

 and

 challenges

 emerge

.

 Overall

,

 the

 future

 of

 AI

 looks

 promising

,

 and

 there

 is

 likely

 to

 be

 continued

 growth

 and

 development

 in

 the

 technology

.




In [6]:
llm.shutdown()