# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-26 17:40:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-26 17:40:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-26 17:40:41] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-26 17:40:44] INFO server_args.py:1859: Attention backend not specified. Use fa3 backend by default.


[2026-02-26 17:40:44] INFO server_args.py:2928: Set soft_watchdog_timeout since in CI








[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.21it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=17.34 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=17.34 GB):   5%|▌         | 1/20 [00:00<00:03,  5.55it/s]Capturing batches (bs=120 avail_mem=17.24 GB):   5%|▌         | 1/20 [00:00<00:03,  5.55it/s]

Capturing batches (bs=112 avail_mem=17.23 GB):   5%|▌         | 1/20 [00:00<00:03,  5.55it/s]Capturing batches (bs=104 avail_mem=17.23 GB):   5%|▌         | 1/20 [00:00<00:03,  5.55it/s]Capturing batches (bs=104 avail_mem=17.23 GB):  20%|██        | 4/20 [00:00<00:00, 16.31it/s]Capturing batches (bs=96 avail_mem=17.22 GB):  20%|██        | 4/20 [00:00<00:00, 16.31it/s] Capturing batches (bs=88 avail_mem=17.22 GB):  20%|██        | 4/20 [00:00<00:00, 16.31it/s]Capturing batches (bs=80 avail_mem=17.21 GB):  20%|██        | 4/20 [00:00<00:00, 16.31it/s]Capturing batches (bs=80 avail_mem=17.21 GB):  35%|███▌      | 7/20 [00:00<00:00, 21.57it/s]Capturing batches (bs=72 avail_mem=17.15 GB):  35%|███▌      | 7/20 [00:00<00:00, 21.57it/s]

Capturing batches (bs=64 avail_mem=17.15 GB):  35%|███▌      | 7/20 [00:00<00:00, 21.57it/s]Capturing batches (bs=56 avail_mem=17.11 GB):  35%|███▌      | 7/20 [00:00<00:00, 21.57it/s]Capturing batches (bs=56 avail_mem=17.11 GB):  50%|█████     | 10/20 [00:00<00:00, 24.09it/s]Capturing batches (bs=48 avail_mem=17.11 GB):  50%|█████     | 10/20 [00:00<00:00, 24.09it/s]Capturing batches (bs=40 avail_mem=17.10 GB):  50%|█████     | 10/20 [00:00<00:00, 24.09it/s]Capturing batches (bs=32 avail_mem=17.10 GB):  50%|█████     | 10/20 [00:00<00:00, 24.09it/s]Capturing batches (bs=32 avail_mem=17.10 GB):  65%|██████▌   | 13/20 [00:00<00:00, 25.68it/s]Capturing batches (bs=24 avail_mem=17.09 GB):  65%|██████▌   | 13/20 [00:00<00:00, 25.68it/s]

Capturing batches (bs=16 avail_mem=17.09 GB):  65%|██████▌   | 13/20 [00:00<00:00, 25.68it/s]Capturing batches (bs=12 avail_mem=17.08 GB):  65%|██████▌   | 13/20 [00:00<00:00, 25.68it/s]Capturing batches (bs=12 avail_mem=17.08 GB):  80%|████████  | 16/20 [00:00<00:00, 24.17it/s]Capturing batches (bs=8 avail_mem=17.08 GB):  80%|████████  | 16/20 [00:00<00:00, 24.17it/s] Capturing batches (bs=4 avail_mem=17.07 GB):  80%|████████  | 16/20 [00:00<00:00, 24.17it/s]Capturing batches (bs=2 avail_mem=17.07 GB):  80%|████████  | 16/20 [00:00<00:00, 24.17it/s]Capturing batches (bs=1 avail_mem=17.06 GB):  80%|████████  | 16/20 [00:00<00:00, 24.17it/s]

Capturing batches (bs=1 avail_mem=17.06 GB): 100%|██████████| 20/20 [00:00<00:00, 27.34it/s]Capturing batches (bs=1 avail_mem=17.06 GB): 100%|██████████| 20/20 [00:00<00:00, 23.62it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kylie from The Future. I’m a singer and songwriter from Manchester, England. I’m currently working on a solo album that I’m trying to release on my own label, however I haven’t signed yet. I’m trying to capture my current feeling of life, which seems to be that of being a fly on the wall, watching the world go by. I am very open to feedback and feedback is always welcome. As an AI language model, I don't have personal feelings or experiences, but I can provide you with general information on the topic you're interested in. How can I assist you with your songwriting or recording needs?
Prompt: The president of the United States is
Generated text:  the head of the executive branch of the federal government, and as such, all executive branch departments and agencies are overseen by the president. The President of the United States has the authority to determine the direction and policies of the executive branch, and has the power to issue executi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. The city is known for its cuisine, fashion, and art scene. It is also home to the world's largest library, the Bibliothèque nationale de France. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city that has been a hub of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and improve efficiency. As AI technology continues to advance, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient monitoring.

2. Greater integration of AI into everyday life: AI is already being integrated into everyday life through applications such as voice assistants, smart home devices, and self-driving cars



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First Name], and I am a [Role in a Profession]. I am the [Role in a Profession]. 

I am [First Name] and I have been [Number of Years in Profession] years of experience in [Role in a Profession]. I have always loved the [Role in a Profession] and have always wanted to be a [Role in a Profession] like it. 

So far, I have been [Number of Successes] and have always kept [Number of Challenges] in my profession. I am always looking for ways to [What I Hope to Achieve], but also keep [What I Hope to Avoid

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

[Answer based on the text] In the given text, it is mentioned that the capital of France is Paris. Therefore, the answer is Paris. 

To arrive at the answer, I examined the provided text to identify the location of Paris. The text states "The capital of Fr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

career

]

 at

 [

location

].

 I

'm

 a

 [

soft

 degree

]

 graduate

 with

 [

number

]

 years

 of

 experience

 in

 [

industry

].

 I

'm

 passionate

 about

 [

description

 of

 your

 profession

],

 I

 enjoy

 [

description

 of

 your

 profession

],

 and

 I

 strive

 to

 [

description

 of

 your

 profession

].

 How

 would

 you

 describe

 your

 character

?



[

Name

]:

 I

 am

 a

 [

career

]

 at

 [

location

],

 a

 [

soft

 degree

]

 graduate

 with

 [

number

]

 years

 of

 experience

 in

 [

industry

].

 I

 am

 passionate

 about

 [

description

 of

 your

 profession

],

 I

 enjoy

 [

description

 of

 your

 profession

],

 and

 I

 strive

 to

 [

description

 of

 your

 profession

].

 I

 am



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 capital

 of

 the

 country

.

 It

 is

 known

 as

 the

 "

City

 of

 Love

"

 due

 to

 its

 romantic

 atmosphere

 and

 historical

 landmarks

.

 The

 city

 is

 home

 to

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 major

 cultural

 hub

 and

 one

 of

 the

 world

's

 most

 populous

 cities

,

 with

 an

 estimated

 population

 of

 around

 

2

7

 million

.

 The

 city

 is

 also

 known

 for

 its

 fashion

 industry

,

 gastr

onomy

,

 and

 world

-class

 sports

 teams

 like

 Paris

 Saint

-G

er

main

 and

 the

 Paris

 Marathon

.

 The

 Paris

 metro

 system

 serves

 as

 a

 major

 transportation

 network

 for

 the

 city

.

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 possibilities

 and

 exciting

 developments

.

 Here

 are

 some

 of

 the

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 Personal

ization

:

 As

 AI

 becomes

 more

 advanced

,

 it

 will

 be

 possible

 to

 tailor

 personalized

 experiences

 to

 individuals

.

 This

 could

 lead

 to

 more

 efficient

 and

 effective

 communication

,

 personalized

 marketing

,

 and

 targeted

 advertising

.



2

.

 Autonomous

 Vehicles

:

 Autonomous

 vehicles

 will

 continue

 to

 evolve

 and

 become

 more

 advanced

.

 They

 will

 be

 able

 to

 drive

 safely

 and

 efficiently

,

 and

 will

 also

 be

 able

 to

 communicate

 with

 pedestrians

,

 cyclists

,

 and

 other

 vehicles

 in

 real

-time

.



3

.

 AI

 in

 Healthcare

:

 AI

 will

 be

 used

 in

 healthcare

 to

 improve

 the

 accuracy

 of

 diagnoses

,

 personalize

 treatment

 plans

,

 and




In [6]:
llm.shutdown()