# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-09 08:00:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-09 08:00:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-09 08:00:41] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-09 08:00:43] INFO server_args.py:1616: Attention backend not specified. Use fa3 backend by default.


[2026-01-09 08:00:43] INFO server_args.py:2513: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.52it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.76 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.76 GB):   5%|▌         | 1/20 [00:00<00:03,  5.20it/s]Capturing batches (bs=120 avail_mem=74.66 GB):   5%|▌         | 1/20 [00:00<00:03,  5.20it/s]

Capturing batches (bs=112 avail_mem=74.65 GB):   5%|▌         | 1/20 [00:00<00:03,  5.20it/s]Capturing batches (bs=104 avail_mem=74.65 GB):   5%|▌         | 1/20 [00:00<00:03,  5.20it/s]Capturing batches (bs=104 avail_mem=74.65 GB):  20%|██        | 4/20 [00:00<00:01, 15.10it/s]Capturing batches (bs=96 avail_mem=74.64 GB):  20%|██        | 4/20 [00:00<00:01, 15.10it/s] Capturing batches (bs=88 avail_mem=74.64 GB):  20%|██        | 4/20 [00:00<00:01, 15.10it/s]Capturing batches (bs=80 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:01, 15.10it/s]Capturing batches (bs=80 avail_mem=74.63 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.96it/s]Capturing batches (bs=72 avail_mem=74.63 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.96it/s]

Capturing batches (bs=64 avail_mem=74.62 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.96it/s]Capturing batches (bs=56 avail_mem=74.62 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.96it/s]Capturing batches (bs=56 avail_mem=74.62 GB):  50%|█████     | 10/20 [00:00<00:00, 22.28it/s]Capturing batches (bs=48 avail_mem=74.61 GB):  50%|█████     | 10/20 [00:00<00:00, 22.28it/s]Capturing batches (bs=40 avail_mem=74.61 GB):  50%|█████     | 10/20 [00:00<00:00, 22.28it/s]Capturing batches (bs=32 avail_mem=74.61 GB):  50%|█████     | 10/20 [00:00<00:00, 22.28it/s]Capturing batches (bs=32 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.52it/s]Capturing batches (bs=24 avail_mem=74.60 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.52it/s]

Capturing batches (bs=16 avail_mem=74.60 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.52it/s]Capturing batches (bs=12 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.52it/s]Capturing batches (bs=12 avail_mem=74.59 GB):  80%|████████  | 16/20 [00:00<00:00, 22.43it/s]Capturing batches (bs=8 avail_mem=74.59 GB):  80%|████████  | 16/20 [00:00<00:00, 22.43it/s] Capturing batches (bs=4 avail_mem=74.58 GB):  80%|████████  | 16/20 [00:00<00:00, 22.43it/s]Capturing batches (bs=2 avail_mem=74.58 GB):  80%|████████  | 16/20 [00:00<00:00, 22.43it/s]

Capturing batches (bs=2 avail_mem=74.58 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.92it/s]Capturing batches (bs=1 avail_mem=74.57 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.92it/s]Capturing batches (bs=1 avail_mem=74.57 GB): 100%|██████████| 20/20 [00:00<00:00, 21.58it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Isabella and I'm a second-year student at a prestigious university. I'm really passionate about programming and have been learning to code for about a year now. I'm always looking for ways to improve my coding skills and I'm eager to learn new things. 

What are some of the best resources for learning to code online?

1. Coding forums
2. Codecademy
3. Udacity
4. Code.org
5. Python.org
6. GotoHome
7. Khan Academy
8. YouTube
9. Coursera
10. MDN Web Docs

Which resources would you recommend me for
Prompt: The president of the United States is
Generated text:  a person who is in power in a country. The president is the head of the executive branch of the government. He is the commander-in-chief of the armed forces. He also has the authority to make laws and to appoint judges.
What are the duties of a president?
Duties of the President:
The President serves as the head of the executive branch of government. The president is the commander-in-chief o

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Character] who has always been [Positive Traits]. I'm [Positive Traits] and I'm [Positive Traits]. I'm a [Positive Traits] who has always been [Positive Traits]. I'm a [Positive Traits] who has always been [Positive Traits]. I'm a [Positive Traits] who has always been [Positive Traits]. I'm a [Positive Traits] who has always been [Positive Traits]. I'm a [Positive Traits] who has always been [Positive Traits]. I'm a [Positive Traits] who

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also famous for its rich history, including the French Revolution and the French Revolution Museum. Paris is a bustling metropolis with a diverse population and a vibrant cultural scene, making it a popular tourist destination. The city is also home to many famous French artists and writers, including Pablo Picasso and André Breton. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. Its status as the world's most populous city has made it a major economic and political

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased use of AI in healthcare: AI is already being used in healthcare to diagnose diseases, predict patient outcomes, and personalize treatment plans. As AI technology continues to improve, we can expect to see even more sophisticated applications in this field.

2. AI in manufacturing: AI is already being used in manufacturing to optimize production processes, reduce costs, and improve quality. As AI technology continues to evolve, we can expect to see even more advanced applications in this field.

3. AI in finance:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Age]. I'm currently studying [Field of Study] at [University Name]. I am passionate about [Your Field of Study] and have been an active member of [Gym, Team, Club, etc.] since [Start Date]. I love [Sport or Activity] and it's the way I make the world a better place. I have a deep love for [Life Deed/Responsibility/Service], and I'm always ready to help and support others. I am [your answer] and I'm here to make a positive difference in the world. [Add your personal experience or achievements

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city renowned for its rich cultural heritage, vibrant arts scene, and iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Additionally, Paris is known for its romantic atmosphere, vibrant nightlife, and world-clas

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

/an

 [

occupation

/

role

]

 who

 specializes

 in

 [

occupation

/

role

].

 I

 have

 a

 passion

 for

 [

career

 goal

].

 I

 am

 always

 seeking

 to

 learn

 and

 grow

,

 and

 I

 am

 always

 ready

 to

 help

 those

 who

 need

 it

.

 I

 am

 a

 great

 communicator

 and

 always

 look

 to

 connect

 with

 others

.

 I

 have

 a

 natural

 ability

 to

 adapt

 to

 new

 situations

 and

 to

 work

 under

 pressure

.

 I

 am

 a

 bit

 of

 a

 perfection

ist

,

 but

 I

 am

 not

 afraid

 to

 take

 risks

.

 I

 value

 integrity

 and

 honesty

 and

 I

 am

 committed

 to

 maintaining

 those

 values

 in

 all

 that

 I

 do

.

 What

 is

 your

 favorite

 hobby

 or

 activity

?

 I

 am

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 the

 largest

 city

 in

 the

 European

 Union

.

 It

 is

 also

 known

 as

 the

 “

City

 of

 Light

”

 due

 to

 its

 long

,

 elegant

 architecture

 and

 its

 status

 as

 the

 cultural

 and

 artistic

 capital

 of

 the

 world

.

 The

 city

 was

 founded

 by

 the

 Romans

,

 and

 has

 been

 the

 capital

 of

 France

 since

 the

 time

 of

 King

 Charles

 V

.

 It

 is

 located

 in the

 Î

le

-de

-F

rance

 region

 and

 is

 the

 fourth

 largest

 metropolitan

 area

 in

 the

 world

.

 The

 city

 is

 famous

 for

 its

 many

 museums

,

 theaters

,

 and

 landmarks

,

 including

 Notre

-D

ame

 Cathedral

,

 the

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 vibrant

 and

 diverse

 city

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 emerging

 trends

 that

 could

 have

 significant

 impacts

 on

 our

 society

 and

 economy

.

 Here

 are

 some

 potential

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 patient

 care

,

 diagnose

 diseases

,

 and

 assist

 in

 treatment

 planning

.

 As

 AI

 technology

 becomes

 more

 advanced

,

 we

 may

 see

 even

 more

 widespread

 use

 of

 AI in

 healthcare

 in

 the

 coming

 years

,

 with

 more

 personalized

 treatments

 and

 earlier

 detection

 of

 health

 issues

.



2

.

 Greater

 use

 of

 AI

 in

 finance

:

 AI

 is

 already

 being

 used

 in

 finance

 to

 analyze

 large

 datasets

 and

 predict

 market

 trends

.

 As

 AI

 technology

 improves

,

 we

 may




In [6]:
llm.shutdown()