# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-23 02:31:42] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-23 02:31:42] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-23 02:31:42] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-23 02:31:42] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


`torch_dtype` is deprecated! Use `dtype` instead!






[2025-10-23 02:31:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-23 02:31:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-23 02:31:51] INFO utils.py:164: NumExpr defaulting to 16 threads.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-23 02:31:53] `torch_dtype` is deprecated! Use `dtype` instead!


[2025-10-23 02:31:53] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:06,  2.72it/s]Capturing batches (bs=120 avail_mem=74.65 GB):   5%|▌         | 1/20 [00:00<00:06,  2.72it/s]Capturing batches (bs=112 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:06,  2.72it/s]Capturing batches (bs=112 avail_mem=74.64 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.76it/s]Capturing batches (bs=104 avail_mem=74.64 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.76it/s]

Capturing batches (bs=96 avail_mem=74.63 GB):  15%|█▌        | 3/20 [00:00<00:02,  6.76it/s] 

Capturing batches (bs=96 avail_mem=74.63 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.65it/s]Capturing batches (bs=88 avail_mem=74.62 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.65it/s]

Capturing batches (bs=88 avail_mem=74.62 GB):  30%|███       | 6/20 [00:01<00:02,  5.33it/s]Capturing batches (bs=80 avail_mem=74.62 GB):  30%|███       | 6/20 [00:01<00:02,  5.33it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  30%|███       | 6/20 [00:01<00:02,  5.33it/s]

Capturing batches (bs=72 avail_mem=74.61 GB):  40%|████      | 8/20 [00:01<00:01,  6.40it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:01,  6.40it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.71it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.71it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.71it/s]

Capturing batches (bs=40 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.71it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:01<00:00, 10.50it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00, 10.50it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00, 10.50it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:01<00:00, 10.50it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.64it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.64it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.64it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.64it/s]Capturing batches (bs=4 avail_mem=76.74 GB):  90%|█████████ | 18/20 [00:01<00:00, 15.87it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 15.87it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 15.87it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 10.38it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Yan Wu. I'm a self-taught programmer from a university with more than 30 years of experience in the field. My main interests lie in the development of AI and the enhancement of machine learning algorithms. I'm passionate about exploring the possibilities and limitations of the development of AI. I'm also an advocate of the principles of global cooperation and collaboration between people from different countries and cultures. I believe that AI should be a tool for humanity's benefit and not a tool for profit.
I'm always looking for new challenges to tackle and learning from new developments in the field. I'm eager to contribute to the development of AI and
Prompt: The president of the United States is
Generated text:  running for re-election. To evaluate her chances of winning, a poll is conducted in a large city. The pollsters find that 45% of the population are in favor of the president, and 60% of those polled support her. If 10% of the pop

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill] who has been [Number of Years] years in the industry. I'm passionate about [What I Love to Do]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [Favorite Hobby] and I enjoy [What I Do for Fun]. I'm always looking for ways to improve my skills and knowledge. I'm a [Personality Trait] who is [What You Do Best]. I'm always ready to learn and grow, and I'm excited to meet new people and make new friends.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Quarter, where many famous French artists and writers live and work. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. The city is known for its cuisine, including French cuisine, and is home to many museums, theaters, and other cultural institutions. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also known for its fashion industry, with many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt in ways that are difficult for humans to do. This could lead to more sophisticated forms of AI that can learn from human behavior and adapt to new situations.

2. Greater reliance on data: AI will become more data-driven, with more and more data being used to train and improve AI systems. This will require more sophisticated data analysis and processing techniques to extract meaningful insights from large datasets.

3. Increased use of AI in healthcare: AI is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name], and I am a [insert character's profession or background]. I am passionate about [insert one or two words that describe what I enjoy or love to do], and I am always looking for new adventures and opportunities to grow and learn. Whether I am hiking in the mountains, solving complex mathematical problems, or playing the piano, I am always ready to step into a new challenge. I am a natural teacher and love sharing my knowledge and expertise with others. I am always ready to help people reach their full potential and make the most of their lives. I am a [insert one or two words that describe my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a historical and cultural center with a long history dating back to ancient times. The city is known for its rich cultural heritage, i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

insert

 occupation

 or

 profession

 here

].

 My

 background

 and

 interests

 are

 in

 [

insert

 a

 short

 summary

 of

 your

 background

 here

].

 I

'm

 a

 [

insert

 a

 short

 summary

 of

 your

 interests

 here

].

 I

 enjoy

 [

insert

 a

 short

 summary

 of

 your

 hobbies

 here

].

 I

'm

 always

 looking

 for

 new

 experiences

 and

 learning

 opportunities

,

 so

 I

'm

 always

 open

 to

 trying

 new

 things

 and

 learning

 from

 others

.

 I

'm

 a

 [

insert

 a

 short

 description

 of

 your

 personality

 or

 character

 trait

 here

].

 I

'm

 a

 [

insert

 a

 short

 description

 of

 your

 personality

 trait

 here

].

 I

'm

 confident

 and

 can

 be

 a

 great

 mentor

 or

 friend

.

 I

'm

 always

 eager

 to

 learn



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Répub

lique Pop

ulaire"

 or

 simply

 "

Paris

."



Based

 on

 that

 summary

,

 can

 you

 provide

 a

 

3

-word

 sentence

 that

 includes

 the

 city

's

 name

 and

 a

 fact

 about

 it

?

 Yes

,

 "

Paris

 is

 the

 largest

 city

 in

 France

 and

 home

 to

 the

 E

iff

el

 Tower

."

 



In

 Spanish

,

 the

 statement

 could

 be

 re

ph

r

ased

 as

 "

La

 capital

 de

 Franc

ia

 es

 Par

ís

,

 también

 conoc

ida

 como

 '

La

 Ré

pub

lique

 Pop

ulaire

'

 o

 simplement

e

 '

Paris

'.

 Esta

 ciudad

 tiene

 el

 E

iff

el

 Tower

 como

 sí

mb

olo

 de

 su

 antig

ua

 historia

 y

 importante

 ciudad

."

 



This

 sentence

 incorporates



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

 in

 the

 areas

 of

 machine

 learning

,

 robotics

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 With

 the

 help

 of

 new

 technologies

 such

 as

 deep

 learning

,

 quantum

 computing

,

 and

 autonomous

 systems

,

 AI

 is

 likely

 to

 become

 more

 intelligent

 and

 capable

 of

 solving

 complex

 problems

 in

 a

 faster

 and

 more

 efficient

 manner

.

 



One

 of

 the

 key

 trends

 in

 AI

 is

 the

 increasing

 importance

 of

 ethical

 considerations

.

 As

 AI

 is

 becoming

 more

 integrated

 into

 our

 daily

 lives

,

 there

 is

 a

 growing

 need

 for

 its

 development

 to

 be

 guided

 by

 ethical

 principles

.

 This

 includes

 considerations

 of

 privacy

,

 safety

,

 fairness

,

 and

 transparency

,

 among

 others

.



Another

 area

 of

 AI

 development

 is

 the




In [6]:
llm.shutdown()