# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-11 23:03:49] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-11 23:03:49] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-11 23:03:49] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-11 23:03:51] INFO trace.py:52: opentelemetry package is not installed, tracing disabled






[2025-11-11 23:03:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-11 23:03:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-11 23:03:58] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-11 23:03:59] INFO trace.py:52: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.53it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:03,  5.96it/s]Capturing batches (bs=120 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.96it/s]

Capturing batches (bs=112 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.96it/s]Capturing batches (bs=104 avail_mem=76.29 GB):   5%|▌         | 1/20 [00:00<00:03,  5.96it/s]Capturing batches (bs=104 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:00, 16.27it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:00, 16.27it/s] Capturing batches (bs=88 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:00, 16.27it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:00, 16.27it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.95it/s]Capturing batches (bs=72 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.95it/s]

Capturing batches (bs=64 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.95it/s]Capturing batches (bs=56 avail_mem=76.26 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.95it/s]Capturing batches (bs=56 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:00<00:00, 22.97it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:00<00:00, 22.97it/s]Capturing batches (bs=40 avail_mem=76.25 GB):  50%|█████     | 10/20 [00:00<00:00, 22.97it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  50%|█████     | 10/20 [00:00<00:00, 22.97it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.16it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.16it/s]

Capturing batches (bs=16 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.16it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.16it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  80%|████████  | 16/20 [00:00<00:00, 22.72it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  80%|████████  | 16/20 [00:00<00:00, 22.72it/s] Capturing batches (bs=4 avail_mem=76.22 GB):  80%|████████  | 16/20 [00:00<00:00, 22.72it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  80%|████████  | 16/20 [00:00<00:00, 22.72it/s]

Capturing batches (bs=1 avail_mem=76.22 GB):  80%|████████  | 16/20 [00:00<00:00, 22.72it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:00<00:00, 25.73it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:00<00:00, 22.60it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Wilson Mendoza. I have been exploring the Sea & its rich history for many years. I am a student at the University of the People, a university on the Moon. I have a passion for climate change, and I am working on an independent project that aims to make a positive impact on the planet and the people living on it.

I would like to highlight some of the things I've learned from the various aspects of the Sea in my journey so far. We are all connected, and it is important that we work together towards a better future. The Sea is incredibly important and a very diverse ecosystem with
Prompt: The president of the United States is
Generated text:  a very important person. Everyone likes him very much. He is very important to the whole country. Many people like to say that President Obama is the president of the United States. President Obama has been in office for more than 8 years. He has won many important jobs. He has helped a lot to the country. 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Skill] who has always been [Positive Trait]. I am [Positive Trait] and I am [Positive Trait]. I am [Positive Trait] and I am [Positive Trait]. I am [Positive Trait] and I am [Positive Trait]. I am [Positive Trait] and I am [Positive Trait]. I am [Positive Trait] and I am [Positive Trait]. I am [Positive Trait] and I am [Positive Trait]. I am [Positive Trait] and I am [Positive Trait]. I am [Positive Trait] and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, art scene, and its role in the French Revolution and French Revolution-era architecture. Paris is a popular tourist destination and a major economic and cultural center in France. It is home to many famous museums, theaters, and restaurants. The city is also known for its cuisine, including French cuisine and international cuisine. Paris is a city of contrasts,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends include:

1. Increased use of AI in healthcare: AI is already being used to improve patient outcomes in fields such as diagnosis, treatment planning, and patient monitoring. As AI technology continues to improve, we may see even more widespread use of AI in healthcare, with the potential to revolutionize the way we treat and diagnose diseases.

2. Increased use of AI in manufacturing: AI is already being used to optimize production processes and improve quality control in industries such as automotive, aerospace, and electronics. As AI technology



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [insert your age, profession, or title here] with a passion for [insert your area of expertise or personal interest here]. I thrive on creating and delivering engaging content, whether it's through writing, photography, or video. I'm always looking for opportunities to make a positive impact, and I'm eager to learn new skills and grow in my field. Thank you for considering me for a job! Let's do this! Hey there, [The name of the reader]. I'm [insert your name here] and I'm a [insert your age, profession, or title here]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a bustling metropolis with a rich history, renowned for its architecture, cuisine, and cultural attractions. It serves as the capital of France, the largest metropolitan area in Europe, and a symbol of the French R

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Occup

ation

]

 who

 is

 passionate

 about

 [

Why

 you

 are

 passionate

].

 I

'm

 a

 [

How

 you

 got

 started

]

 who

 now

 own

 [

What

 your

 company

 does

]

 and

 [

What

 your

 company

 is

 about

].

 I

 love

 [

What

 makes

 you

 unique

],

 and

 I

'm

 always

 looking

 for

 new

 ways

 to

 [

What

 you

 are

 trying

 to

 achieve

].

 If

 you

 want

 to

 know

 more

,

 I

 can

 talk

 about

 my

 [

What

 you

 would

 like

 to

 know

]

 about

 my

 [

What

 you

 are

 passionate

 about

].


If

 you

're

 looking

 for

 a

 new

 career

 or

 job

,

 I

'd

 be

 happy

 to

 help

 you

 find

 the

 perfect

 opportunity

 for

 you

.

 Here

's



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 encaps

ulates

 the

 core

 fact

 about

 France

's

 capital

 city

,

 providing

 a

 clear

 and

 concise

 overview

 of

 its

 position

 and

 importance

 within

 the

 broader

 context

 of

 the

 nation

.

 



Additional

 context

 about

 Paris

 could

 include

 its

 iconic

 landmarks

,

 cultural

 heritage

,

 economic

 importance

,

 or

 any

 other

 relevant

 points

 that

 would

 further

 elaborate

 on

 the

 capital

's

 significance

.

 



However

,

 since

 the

 core

 statement

 is

 already

 provided

,

 no

 additional

 information

 or

 elabor

ation

 is

 necessary

 for

 the

 task

 at

 hand

.

 The

 statement

 directly

 addresses

 the

 factual

 core

 of

 Paris

'

 position

 in

 France

.

 



If

 you

 have

 any

 specific

 questions

 about

 Paris

 or

 need

 to

 expand

 on

 this

 statement

,

 feel

 free

 to

 ask

!



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 key

 trends

,

 each

 influencing

 its

 development

 and

 evolution

.

 Here

 are

 some

 potential

 future

 trends

:



1

.

 Increased

 use

 of

 AI

 for

 autonomous

 vehicles

:

 As

 autonomous

 driving

 technologies

 become

 more

 sophisticated

,

 AI

 could

 be

 used

 in

 self

-driving

 cars

,

 trucks

,

 and

 delivery

 vehicles

.

 This

 could

 lead

 to

 a

 reduction

 in

 human

 errors

,

 increased

 safety

,

 and

 improved

 traffic

 flow

.



2

.

 AI

 for

 healthcare

:

 AI

 could

 be

 used

 to

 improve

 the

 accuracy

 and

 speed

 of

 medical

 diagnosis

 and

 treatment

.

 This

 could

 lead

 to

 more

 effective

 treatments

 for

 diseases

 like

 cancer

 and

 neurological

 disorders

,

 and

 a

 reduction

 in

 the

 cost

 of

 healthcare

.



3

.

 AI

 for

 personalized

 medicine

:

 With




In [6]:
llm.shutdown()