# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-30 18:17:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-30 18:17:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-30 18:17:41] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.32it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.85 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.85 GB):   5%|▌         | 1/20 [00:00<00:03,  5.24it/s]Capturing batches (bs=120 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:03,  5.24it/s]

Capturing batches (bs=112 avail_mem=74.71 GB):   5%|▌         | 1/20 [00:00<00:03,  5.24it/s]Capturing batches (bs=104 avail_mem=74.70 GB):   5%|▌         | 1/20 [00:00<00:03,  5.24it/s]Capturing batches (bs=104 avail_mem=74.70 GB):  20%|██        | 4/20 [00:00<00:01, 14.63it/s]Capturing batches (bs=96 avail_mem=74.70 GB):  20%|██        | 4/20 [00:00<00:01, 14.63it/s] Capturing batches (bs=88 avail_mem=74.69 GB):  20%|██        | 4/20 [00:00<00:01, 14.63it/s]Capturing batches (bs=80 avail_mem=74.66 GB):  20%|██        | 4/20 [00:00<00:01, 14.63it/s]Capturing batches (bs=80 avail_mem=74.66 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.97it/s]Capturing batches (bs=72 avail_mem=74.65 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.97it/s]

Capturing batches (bs=64 avail_mem=74.64 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.97it/s]Capturing batches (bs=56 avail_mem=74.64 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.97it/s]Capturing batches (bs=56 avail_mem=74.64 GB):  50%|█████     | 10/20 [00:00<00:00, 20.97it/s]Capturing batches (bs=48 avail_mem=74.63 GB):  50%|█████     | 10/20 [00:00<00:00, 20.97it/s]Capturing batches (bs=40 avail_mem=74.63 GB):  50%|█████     | 10/20 [00:00<00:00, 20.97it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  50%|█████     | 10/20 [00:00<00:00, 20.97it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.46it/s]Capturing batches (bs=24 avail_mem=74.62 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.46it/s]

Capturing batches (bs=16 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.46it/s]Capturing batches (bs=12 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.46it/s]Capturing batches (bs=12 avail_mem=74.61 GB):  80%|████████  | 16/20 [00:00<00:00, 21.42it/s]Capturing batches (bs=8 avail_mem=74.60 GB):  80%|████████  | 16/20 [00:00<00:00, 21.42it/s] Capturing batches (bs=4 avail_mem=74.60 GB):  80%|████████  | 16/20 [00:00<00:00, 21.42it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  80%|████████  | 16/20 [00:00<00:00, 21.42it/s]

Capturing batches (bs=2 avail_mem=74.59 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.40it/s]Capturing batches (bs=1 avail_mem=74.59 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.40it/s]Capturing batches (bs=1 avail_mem=74.59 GB): 100%|██████████| 20/20 [00:00<00:00, 20.89it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kelly, and I'm a Senior at the University of Michigan. I'm a research scientist at the Stanford Cognitive Neuroscience Laboratory, where I investigate the neural underpinnings of decision-making. I'm particularly interested in the neural organization of the language system. I'm a first-year student in the Class of 2021, and I'm at the University of Michigan, a university located in Ann Arbor, Michigan, United States.
Education:
- Bachelor of Science in psychology and mathematics from the University of Michigan, Ann Arbor, Michigan, United States; May 2020 – May 2022
- Master
Prompt: The president of the United States is
Generated text:  a person. This statement is true for:

A) All U. S. Presidents
B) Some U. S. Presidents
C) No U. S. Presidents
D) None of the above
To determine the true statement, let's analyze the information given in the statement and our options.

The statement says: "The president of the United States is a person."

This 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Vehicle] with [Number of Wheels] wheels. I'm [Favorite Color] and I love [Favorite Activity]. I'm [Favorite Book] and I enjoy [Favorite Food]. I'm [Favorite Movie] and I love [Favorite Music]. I'm [Favorite Sport]. I'm [Favorite Place]. I'm [Favorite Animal]. I'm [Favorite Movie]. I'm [Favorite Book]. I'm [Favorite Food]. I'm [Favorite Movie]. I'm [Favorite Book]. I'm [Favorite Food]. I'm [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its fashion industry, art scene, and its role in the French Revolution. Paris is a vibrant and diverse city with a population of over 2 million people. It is a popular tourist destination and a major economic center in France. The city is home to many world-renowned museums, theaters, and restaurants. Paris is a city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are already being used in a wide range of applications, from self-driving cars to personalized medicine. As these technologies continue to improve, we can expect to see even more innovative applications emerge. Additionally, AI is likely to continue to be integrated into various industries, from healthcare to finance to manufacturing, as companies seek to optimize their operations and improve their efficiency. Finally, AI is likely to continue to evolve and improve, driven by new research and developments in the field. Overall, the future of AI looks bright, with potential for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jane Smith. I'm a quiet, empathetic writer who enjoys exploring the depths of my own subconscious. I'm currently working on a novel about a character I created, and I'm excited to share it with you. What can you tell me about yourself? Jane Smith, a quiet writer with a deep emotional connection to the subconscious, is passionate about exploring the depths of one's own psyche through writing. Her current project, a novel about a character she created, is a joyful endeavor for her. She enjoys sharing her thoughts and insights with others, and is always eager to learn more about the fascinating world of writing. What are your interests

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its rich history, arts and culture, and iconic landmarks such as the Eiffel Tower and Notre-Dame Cathedra

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

’m

 a

 [

age

]

 year

 old

 [

gender

]

 person

.

 I

 come

 from

 [

location

]

 and

 I

 have

 a

 passion

 for

 [

interest

,

 hobby

,

 or

 hobby

 that

 interests

 me

].

 I

'm

 a

 [

job

]

 and

 I

'm

 always

 looking

 for

 ways

 to

 [

future

 goal

,

 goal

 that

 interests

 me

].

 I

'm

 excited

 to

 meet

 you

.

 What

's

 your

 name

?

 What

's

 your

 age

?

 What

's

 your

 gender

?

 What

's

 your

 location

?

 What

's

 your

 job

?

 What

's

 your

 passion

?

 What

's

 your

 future

 goal

?

 What

's

 your

 future

 goal

?

 What

's

 your

 job

?

 What

's

 your

 hobby

 or

 hobby

 that

 interests

 you

?

 What



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 renowned

 for

 its

 medieval

 architecture

,

 beautiful

 parks

,

 and

 world

-ren

owned

 festivals

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Notre

-D

ame

 Cathedral

.

 



To

 see

 the

 latest

 news

 and

 events

 in

 Paris

,

 consider

 visiting

 the

 official

 website

 of

 the

 city

 or

 checking

 out

 its

 social

 media

 channels

 for

 updates

 and

 insider

 tips

.

 



While

 the

 city

 is

 known

 for

 its

 impressive

 architecture

,

 its

 cuisine

 is

 also

 a

 notable

 feature

.

 The

 city

 has

 a

 rich

 culinary

 tradition

 dating

 back

 to

 the

 Middle

 Ages

,

 with

 iconic

 dishes

 such

 as

 the

 B

oud

in

 and

 the

 Ec

alle

.

 



Another

 attraction

 in

 Paris

 is

 the

 Lou

vre

 Museum

,

 which

 houses

 the

 world

's

 largest



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 unpredictable

,

 but

 here

 are

 some

 potential

 trends

 that

 could

 shape

 its

 trajectory

:



1

.

 Increased

 integration

 with

 human

 intelligence

:

 AI

 is

 already

 becoming

 more

 closely

 integrated

 with

 human

 intelligence

,

 with

 machines

 able

 to

 learn

 and

 adapt

 from

 the

 same

 data

 as

 humans

.

 This

 could

 lead

 to

 more

 sophisticated

 and

 personalized

 AI

 systems

 that

 can

 learn

 from

 multiple

 sources

 and

 incorporate

 human

 expertise

 and

 knowledge

.



2

.

 Autonomous

 and

 semi

-aut

onomous

 machines

:

 As

 AI

 becomes

 more

 sophisticated

,

 we

 can

 expect

 to

 see

 more

 autonomous

 and

 semi

-aut

onomous

 machines

 that

 can

 perform

 a

 wide

 range

 of

 tasks

 without

 human

 intervention

.

 These

 machines

 could

 be

 used

 in

 a

 variety

 of

 applications

,

 from

 healthcare

 to

 manufacturing

 to

 transportation

.






In [6]:
llm.shutdown()