# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-16 10:06:28] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-16 10:06:28] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-16 10:06:28] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-16 10:06:31] INFO server_args.py:1832: Attention backend not specified. Use fa3 backend by default.


[2026-02-16 10:06:31] INFO server_args.py:2867: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.62it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.61it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=72.76 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=72.76 GB):   5%|▌         | 1/20 [00:00<00:03,  4.85it/s]Capturing batches (bs=120 avail_mem=71.84 GB):   5%|▌         | 1/20 [00:00<00:03,  4.85it/s]Capturing batches (bs=112 avail_mem=71.83 GB):   5%|▌         | 1/20 [00:00<00:03,  4.85it/s]Capturing batches (bs=104 avail_mem=71.83 GB):   5%|▌         | 1/20 [00:00<00:03,  4.85it/s]Capturing batches (bs=104 avail_mem=71.83 GB):  20%|██        | 4/20 [00:00<00:01, 15.03it/s]Capturing batches (bs=96 avail_mem=71.83 GB):  20%|██        | 4/20 [00:00<00:01, 15.03it/s] Capturing batches (bs=88 avail_mem=71.82 GB):  20%|██        | 4/20 [00:00<00:01, 15.03it/s]Capturing batches (bs=80 avail_mem=71.82 GB):  20%|██        | 4/20 [00:00<00:01, 15.03it/s]

Capturing batches (bs=80 avail_mem=71.82 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.42it/s]Capturing batches (bs=72 avail_mem=71.81 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.42it/s]Capturing batches (bs=64 avail_mem=71.81 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.42it/s]Capturing batches (bs=56 avail_mem=71.80 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.42it/s]Capturing batches (bs=56 avail_mem=71.80 GB):  50%|█████     | 10/20 [00:00<00:00, 23.29it/s]Capturing batches (bs=48 avail_mem=71.79 GB):  50%|█████     | 10/20 [00:00<00:00, 23.29it/s]Capturing batches (bs=40 avail_mem=71.79 GB):  50%|█████     | 10/20 [00:00<00:00, 23.29it/s]Capturing batches (bs=32 avail_mem=71.78 GB):  50%|█████     | 10/20 [00:00<00:00, 23.29it/s]

Capturing batches (bs=32 avail_mem=71.78 GB):  65%|██████▌   | 13/20 [00:00<00:00, 25.00it/s]Capturing batches (bs=24 avail_mem=71.78 GB):  65%|██████▌   | 13/20 [00:00<00:00, 25.00it/s]Capturing batches (bs=16 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 25.00it/s]Capturing batches (bs=12 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 25.00it/s]Capturing batches (bs=12 avail_mem=71.77 GB):  80%|████████  | 16/20 [00:00<00:00, 23.10it/s]Capturing batches (bs=8 avail_mem=71.76 GB):  80%|████████  | 16/20 [00:00<00:00, 23.10it/s] Capturing batches (bs=4 avail_mem=71.76 GB):  80%|████████  | 16/20 [00:00<00:00, 23.10it/s]

Capturing batches (bs=2 avail_mem=71.75 GB):  80%|████████  | 16/20 [00:00<00:00, 23.10it/s]Capturing batches (bs=1 avail_mem=71.75 GB):  80%|████████  | 16/20 [00:00<00:00, 23.10it/s]Capturing batches (bs=1 avail_mem=71.75 GB): 100%|██████████| 20/20 [00:00<00:00, 26.77it/s]Capturing batches (bs=1 avail_mem=71.75 GB): 100%|██████████| 20/20 [00:00<00:00, 22.66it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chris and I'm a full stack developer in Seattle, WA. My focus is on building web applications and web-based services for developers. I like to solve problems with clean and efficient code.
Here are some key points from my experience:
- I've worked on the following projects:
  1. I was part of a team at a startup called Bootcamp that made a board game software for developers. The team focused on building a high-level game engine and a plugin manager. I created a web interface for the engine that would allow for developers to create custom game objects and create and share game assets.
  2. I was also
Prompt: The president of the United States is
Generated text:  attempting to get a new campaign fund. The fund includes the cost of advertising, which is $500,000, and the cost of the campaign itself, which is $800,000. If the president's net contribution to the campaign is $150,000, what is the amount of campaign expenses?

The president's net con

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm currently [Current Location]. I'm a [Favorite Hobby] enthusiast. I'm a [Favorite Book] lover. I'm a [Favorite Movie] fan. I'm a [Favorite Music] lover. I'm a [Favorite Sport] enthusiast. I'm a [Favorite Food] lover. I'm a [Favorite Animal] lover. I'm a [Favorite Movie] fan. I'm a [Favorite Book] lover. I'm a [Favorite Movie] fan. I'm a [Favorite Book] lover. I'm a [Favorite Movie

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter, a historic district known for its French colonial architecture. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is also home to many international organizations and institutions, including UNESCO and the International Olympic Committee. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the city's vibrant culture. The French capital is a city of art, culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and context-aware AI systems that can better understand and respond to human needs.

2. Enhanced capabilities in natural language processing: AI is likely to become even more capable in natural language processing, allowing machines to understand and respond to human language in ways that are more intuitive and natural. This could lead to more efficient and effective communication systems, as well as more accurate and reliable language processing.

3. Greater emphasis on ethical



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Age]. I have [Number] years of experience in the [Industry] field. I am a [Type of Person] with a passion for [What I like to do]. I am a [Subject] who is always [What I do best]. And I am excited to have the opportunity to [What I hope to do]. Let me know if you would like to introduce me to anyone else. [Age] years of experience in the [Industry] field. I am a [Type of Person] with a passion for [What I like to do]. I am a [Subject] who is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the most populous city in Europe and the third largest city in the world by population. The city is located on the Seine River and is known for its historic architecture, vibrant culture, and stunning natural scenery. Paris is the birthplace of French literature, art, and cinema, and its cultur

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Age

]

 year

 old

,

 [

Occup

ation

/

Role

],

 and

 I

've

 been

 working

 for

 [

Company

]

 for

 [

Number

 of

 Years

]

 years

.

 Currently

,

 I

'm

 a

 [

Status

]

 employee

,

 [

Company

],

 and

 I

 enjoy

 [

I

 like

 about

 [

Company

]].

 I

 love

 [

Company

's

 Work

 Culture

]

 and

 look

 forward

 to

 [

Company

's

 Next

 Big

 Move

].

 What

's

 your

 name

,

 and

 what

's

 your

 occupation

?

 I

'm

 [

Name

],

 I

'm

 a

 [

Occup

ation

/

Role

],

 and

 I

've

 been

 working

 for

 [

Company

]

 for

 [

Number

 of

 Years

]

 years

.

 Currently

,

 I

'm

 a

 [

Status

]

 employee



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 modern

 city

 with

 a

 rich

 cultural

 heritage

 and

 a

 diverse

 population

.

 It

 is

 located

 in

 the

 centre

 of

 the

 country

,

 with

 the

 Se

ine

 River

 running

 through

 its

 middle

 and

 the

 E

iff

el

 Tower

 rising

 prominently

 in

 the

 skyline

.

 Paris

 is

 a

 cultural

 and

 business

 hub

 for

 Europe

,

 known

 for

 its

 vibrant

 street

 food

,

 op

ulent

 museums

,

 and

 beautiful

 gardens

.

 The

 city

 is

 also

 home

 to

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 and

 the

 Paris

 Opera

,

 among

 other

 notable

 attractions

.

 Paris

 has

 a

 long

 and

 stor

ied

 history

 dating

 back

 to

 the

 Roman

 Empire

 and

 is

 known

 for

 its

 

1

9

th

-century

 architecture

 and

 its

 role

 in

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 several

 trends

 that

 will

 shape

 the

 technology

 and

 its

 impact

 on

 society

:



1

.

 Increased

 autonomy

:

 AI

 systems

 will

 become

 more

 capable

 of

 making

 autonomous

 decisions

,

 including

 driving

 cars

,

 building

 homes

,

 and

 even

 deciding

 what

 to

 wear

.

 This

 will

 lead

 to

 new

 job

 losses

 in

 areas

 like

 human

 healthcare

 and

 transportation

.



2

.

 Enhanced

 creativity

:

 AI

 will

 be

 able

 to

 produce

 innovative

 and

 creative

 work

,

 such

 as

 music

,

 art

,

 and

 science

 fiction

.

 This

 will

 lead

 to

 new

 forms

 of

 entertainment

 and

 storytelling

.



3

.

 Improved

 communication

:

 AI

 will

 become

 more

 intelligent

 and

 able

 to

 communicate

 effectively

 with

 people

,

 leading

 to

 more

 efficient

 and

 effective

 communication

 in

 various

 fields

.



4




In [6]:
llm.shutdown()