# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-06 14:23:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-06 14:23:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-06 14:23:32] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:03,  5.33it/s]Capturing batches (bs=120 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  5.33it/s]

Capturing batches (bs=112 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  5.33it/s]Capturing batches (bs=104 avail_mem=74.63 GB):   5%|▌         | 1/20 [00:00<00:03,  5.33it/s]Capturing batches (bs=104 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:01, 15.20it/s]Capturing batches (bs=96 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:01, 15.20it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:01, 15.20it/s]Capturing batches (bs=80 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:01, 15.20it/s]Capturing batches (bs=80 avail_mem=74.62 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.00it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.00it/s]

Capturing batches (bs=64 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.00it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.00it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 22.17it/s]Capturing batches (bs=48 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 22.17it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 22.17it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 22.17it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.13it/s]Capturing batches (bs=24 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.13it/s]

Capturing batches (bs=16 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.13it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.13it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 20.64it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 20.64it/s] Capturing batches (bs=4 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 20.64it/s]

Capturing batches (bs=2 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:00<00:00, 20.64it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.00it/s]Capturing batches (bs=1 avail_mem=74.56 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.00it/s]Capturing batches (bs=1 avail_mem=74.56 GB): 100%|██████████| 20/20 [00:00<00:00, 21.06it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Stacey. My name is Stacey. I was born on the 13th of December 1990 in Londonderry. My name Stacey is my family name and I am a long-haired brown.
I am an E3 and a UK citizen. My school is Tenerife College in Limerick, I am currently studying a Bachelor of Education degree at Limerick University.
I have lived in Limerick for the last 2 years and have decided to study further in Spain and I am staying at a hotel in Madrid.
I am interested in learning Spanish and would like to make friends with people
Prompt: The president of the United States is
Generated text:  visiting a country where the people use the British currency, the pound. During a visit, the president exchanged $2000$ British pounds for $3000$ US dollars. If the president's exchange rate for pounds to US dollars is $2$ pounds to $3$ US dollars, how many pounds does the president need to exchange to get $4000$ US dollars?

To determine how many pounds the president needs to exchange t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] with [number of years] years of experience in [field]. I am a [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. It is also the birthplace of the French Revolution and the current capital of France. Paris is a bustling metropolis with a rich history and a vibrant cultural scene. It is the largest city in France and a major economic and political center. The city is home to many famous landmarks and attractions, including the Louvre Museum, the Notre-Dame Cathedral, and the Champs-Élysées. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is also known for its cuisine, including

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars to personalized healthcare and financial services. Additionally, there is a growing interest in developing AI that can learn and adapt to new situations, rather than simply following pre-programmed instructions. This could lead to more complex and sophisticated AI systems that can solve complex problems and make decisions that are difficult for humans to solve. Finally, there is also a growing concern about the ethical and social implications of AI, and how it will be used and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [Job Title] with [Number of Years Experience]. I enjoy working with people and can communicate in various languages fluently. I am a [Favorite) Activity) and [Pet]) person. My strong suits are my problem-solving skills, creativity, and adaptability. I am always looking for ways to improve and enhance my skills, and I am constantly learning new things.
My journey to becoming a [Job Title] has been filled with challenges, but I am overcoming those obstacles every day. I am confident that I am equipped to make a positive impact in the world, and I am eager to share

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, often referred to as the "City of Light" due to its vibrant culture and modern architecture. It is located in the northwestern region of France and has a population of over 14 m

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

 [

insert

 job

 title

]

 with

 over

 [

insert

 number

 of

 years

 of

 experience

]

 years

 of

 experience

 in

 [

insert

 relevant

 field

].

 I

 am

 a

 dedicated

 [

insert

 relevant

 skill

 or

 experience

]

 who

 loves

 [

insert

 one

 or

 two

 hobbies

 or

 passions

].

 I

 am

 a

 [

insert

 age

 range

]

 year

 old

 person

 with

 [

insert

 a

 personality

 trait

]

 personality

 type

.

 I

 am

 a

 [

insert

 a

 super

power

 or

 ability

]

 who

 enjoy

 [

insert

 one

 or

 two

 activities

].

 I

 am

 an

 [

insert

 a

 personal

 characteristic

]

 who

 is

 always

 [

insert

 a

 trait

 or

 quality

].

 My

 background

 is

 [

insert

 a

 reason

 for

 your

 background

].

 I

 am

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

,

 is

 the

 largest

 city

 in

 Europe

 by

 population

,

 and

 the

 

1

2

th

 largest

 by

 area

.

 The

 city

 is

 located

 on

 the

 banks

 of

 the

 Se

ine

 River

,

 on

 the

 north

 bank

 of

 the

 B

ourse

 de

 Paris

,

 the

 oldest

 river

 in

 Europe

.

 It

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 other

 significant

 landmarks

 in

 the

 city

.

 The

 climate

 of

 Paris

 is

 warm

 and

 humid

 throughout

 the

 year

,

 with

 a

 moderate

 temperature

 and

 plenty

 of

 sunshine

.

 The

 city

 is

 known

 for

 its

 rich

 history

,

 art

,

 and

 culture

,

 and

 its

 many

 museums



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 diverse

,

 with

 many

 possible

 trends

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

.

 Here

 are

 some

 of

 the

 most

 promising

 and

 likely

 to

 impact

 the

 technology

 in

 the

 coming

 years

:



1

.

 Increased

 transparency

:

 With

 the

 increasing

 demand

 for

 AI

 systems

 to

 be

 explain

able

 and

 transparent

,

 there

's

 a

 potential

 for

 AI

 to

 become

 more

 human

-like

 and

 less

 machine

-like

.

 As

 AI

 becomes

 more

 sophisticated,

 it

 will

 need

 to

 be

 more

 accountable

 for

 its

 actions

 and

 more

 transparent

 about

 its

 decisions

.



2

.

 AI

 in

 healthcare

:

 AI

 can

 be

 used

 to

 analyze

 medical

 images

,

 predict

 patient

 outcomes

,

 and

 even

 assist

 in

 the

 diagnosis

 of

 diseases

.

 The

 potential

 impact

 of

 AI

 in




In [6]:
llm.shutdown()