# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-01 01:59:09] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-01 01:59:09] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-01 01:59:09] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-01 01:59:12] INFO server_args.py:1775: Attention backend not specified. Use fa3 backend by default.


[2026-02-01 01:59:12] INFO server_args.py:2762: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.72it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.27 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.27 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s]Capturing batches (bs=120 avail_mem=76.16 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s]Capturing batches (bs=112 avail_mem=76.16 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s]Capturing batches (bs=104 avail_mem=76.16 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s]Capturing batches (bs=104 avail_mem=76.16 GB):  20%|██        | 4/20 [00:00<00:01,  9.32it/s]Capturing batches (bs=96 avail_mem=76.15 GB):  20%|██        | 4/20 [00:00<00:01,  9.32it/s] Capturing batches (bs=88 avail_mem=76.15 GB):  20%|██        | 4/20 [00:00<00:01,  9.32it/s]Capturing batches (bs=80 avail_mem=76.14 GB):  20%|██        | 4/20 [00:00<00:01,  9.32it/s]

Capturing batches (bs=80 avail_mem=76.14 GB):  35%|███▌      | 7/20 [00:00<00:00, 14.51it/s]Capturing batches (bs=72 avail_mem=76.14 GB):  35%|███▌      | 7/20 [00:00<00:00, 14.51it/s]Capturing batches (bs=64 avail_mem=76.13 GB):  35%|███▌      | 7/20 [00:00<00:00, 14.51it/s]Capturing batches (bs=56 avail_mem=76.12 GB):  35%|███▌      | 7/20 [00:00<00:00, 14.51it/s]Capturing batches (bs=56 avail_mem=76.12 GB):  50%|█████     | 10/20 [00:00<00:00, 18.11it/s]Capturing batches (bs=48 avail_mem=76.12 GB):  50%|█████     | 10/20 [00:00<00:00, 18.11it/s]Capturing batches (bs=40 avail_mem=76.11 GB):  50%|█████     | 10/20 [00:00<00:00, 18.11it/s]Capturing batches (bs=32 avail_mem=76.11 GB):  50%|█████     | 10/20 [00:00<00:00, 18.11it/s]

Capturing batches (bs=32 avail_mem=76.11 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.08it/s]Capturing batches (bs=24 avail_mem=76.11 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.08it/s]Capturing batches (bs=16 avail_mem=76.10 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.08it/s]Capturing batches (bs=12 avail_mem=76.10 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.08it/s]Capturing batches (bs=12 avail_mem=76.10 GB):  80%|████████  | 16/20 [00:01<00:00, 19.68it/s]Capturing batches (bs=8 avail_mem=76.09 GB):  80%|████████  | 16/20 [00:01<00:00, 19.68it/s] 

Capturing batches (bs=4 avail_mem=76.09 GB):  80%|████████  | 16/20 [00:01<00:00, 19.68it/s]Capturing batches (bs=2 avail_mem=76.08 GB):  80%|████████  | 16/20 [00:01<00:00, 19.68it/s]Capturing batches (bs=2 avail_mem=76.08 GB):  95%|█████████▌| 19/20 [00:01<00:00, 18.04it/s]Capturing batches (bs=1 avail_mem=76.08 GB):  95%|█████████▌| 19/20 [00:01<00:00, 18.04it/s]

Capturing batches (bs=1 avail_mem=76.08 GB): 100%|██████████| 20/20 [00:01<00:00, 15.02it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Huaqing. I am a 13-year-old high school student. I have been studying and practicing Chinese since my childhood. I also have a pet dog named Xingguo. He is my pet, but I don't like to play with him because he is often my "friend".

Q1: Why do you like dogs?

Q2: Why do you dislike playing with Xingguo?

A: (1) The Chinese people love dogs very much. In many people's hearts, dogs have a special place in their lives, and it's very rare to see a person without dogs in their life. Dogs
Prompt: The president of the United States is
Generated text:  running for a second term. To ensure that the second term will not be interrupted, the president will ask for and receive an additional term from Congress. By law, Congress is required to approve the president's request for a second term, but there is a risk that Congress will approve the request and the president will be considered to have won a second term. Furthermore, there is a 50% chance that Congr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. Let's chat! [Name] [Job Title] [Company Name] [Company Address] [City, State, Zip Code] [Phone Number] [Email Address] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Instagram Profile] [GitHub Profile] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Instagram Profile] [GitHub Profile] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Instagram Profile] [GitHub Profile] [LinkedIn

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. It is also known for its rich history and diverse cultural scene. The city is home to many famous French artists, writers, and musicians, and is a major center for the arts and entertainment industry. Paris is a vibrant and dynamic city with a rich cultural heritage that continues to attract visitors from around the world. The city is also known

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to more sophisticated and adaptive AI systems that can learn from human behavior and adapt to new situations.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and evaluation of AI systems, as well as greater transparency and accountability in their development and deployment.

3. Increased use of AI in healthcare: AI is already being used in healthcare



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [Age] year old [Gender] [Occupation]. I enjoy [list all about your hobbies, interests, and passions]. [You are passionate about] [describe a specific hobby or activity you enjoy]. I'm a team player who [describe a specific personality trait or quality]. I value [mention the qualities that matter most to you]. I have a good knowledge of [specific subject or area of interest]. I'm a [describe any other characteristics or qualities]. [You are someone who] [state a positive attribute that you believe in].
Hello, my name is [Name]. I'm a [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its charming canals, art museums, and iconic Notre-Dame Cathedral, among other attractions. The city is renowned for its rich history, particularly in terms of its role in the French Revolution 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [Name

]. I

'm a

 [age

], [

gender

] (

male or

 female)

 with

 [skill

 or quality

]. I

'm [

occupation]

 in [

field or

 area of

 study

]. What

 can you

 tell me

 about yourself

? I

 am a

 [

insert your

 personality trait

 or

 characteristic,

 if any

]. And

 what brings

 you here

 today?

 I believe

 it's

 important to

 be honest

 and to

 share who

 you are

 with someone

. What

 are your

 goals for

 the future

 and how

 will you

 pursue them

? I

'm excited

 to meet

 you.

 

Remember

 to keep

 your intro

 short and

 to the

 point,

 focusing on

 your unique

 qualities and

 achievements.

 Use a

 neutral and

 unbiased



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

. It

 is the

 largest city

 in France

 and

 is a

 UNESCO

 World Heritage

 Site

.

 France

's

 national

 capital

,

 Paris

,

 is

 the

 seat

 of

 the

 government

,

 the

 heart

 of

 the

 economy

,

 and

 the

 center

 of

 culture

 and

 entertainment

,

 boasting

 the

 world

's

 most

 famous

 museums

,

 restaurants

,

 and

 architecture

.

 Paris

 is

 also

 a

 hub

 of

 the

 cultural

 industry

,

 hosting

 many

 of

 France

's

 major

 cultural

 festivals

,

 such

 as

 the

 Festival

 de

 Cannes

,

 the

 Mus

ée

 de

 l

'

Or

anger

ie

,

 and

 the

 Op

éra

.

 Paris

 is

 a

 cultural

 and

 economic

 center

,

 and

 it

 has

 the

 largest

 metropolitan

 area

 in

 the

 world

,

 with

 a

 population

 of

 approximately

 

7

 million

 people

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 exciting

 possibilities

,

 and

 here

 are

 some

 potential

 trends

 to

 look

 out

 for

:



1

.

 Adv

ancements

 in

 machine

 learning

 and

 neural

 networks

:

 With

 the

 help

 of

 powerful

 computing

 power

 and

 massive

 amounts

 of

 data

,

 AI

 models

 will

 become

 increasingly

 sophisticated

.

 This

 will

 lead

 to

 more accurate

 predictions and

 better

 solutions

 to

 complex problems

.

2

. Increased

 integration

 of

 AI

 with

 human

 intelligence

:

 AI

 will

 continue

 to

 merge

 with

 human

 intelligence

, making

 them more

 effective and

 efficient.

 This will

 lead to

 more personalized

 and adaptive

 solutions that

 are tailored

 to individual

 needs.



3

.

 Rise

 of

 new

 forms

 of

 AI

:

 With

 the

 development

 of quantum

 computers,

 AI

 will

 become

 more

 powerful

 and

 capable

 of

 solving




In [6]:
llm.shutdown()