# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-26 14:32:43] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-26 14:32:43] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-26 14:32:43] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.38it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:03,  5.25it/s]Capturing batches (bs=120 avail_mem=76.31 GB):   5%|▌         | 1/20 [00:00<00:03,  5.25it/s]

Capturing batches (bs=112 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.25it/s]Capturing batches (bs=104 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.25it/s]Capturing batches (bs=104 avail_mem=76.30 GB):  20%|██        | 4/20 [00:00<00:01, 15.14it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:01, 15.14it/s] Capturing batches (bs=88 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:01, 15.14it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:01, 15.14it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.95it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.95it/s]

Capturing batches (bs=64 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.95it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.95it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  50%|█████     | 10/20 [00:00<00:00, 22.20it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:00<00:00, 22.20it/s]Capturing batches (bs=40 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:00<00:00, 22.20it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  50%|█████     | 10/20 [00:00<00:00, 22.20it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.30it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.30it/s]

Capturing batches (bs=16 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.30it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.30it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  80%|████████  | 16/20 [00:00<00:00, 22.13it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  80%|████████  | 16/20 [00:00<00:00, 22.13it/s] Capturing batches (bs=4 avail_mem=76.23 GB):  80%|████████  | 16/20 [00:00<00:00, 22.13it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  80%|████████  | 16/20 [00:00<00:00, 22.13it/s]

Capturing batches (bs=1 avail_mem=76.22 GB):  80%|████████  | 16/20 [00:00<00:00, 22.13it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:00<00:00, 24.99it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:00<00:00, 21.69it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mako and I’m a 25-year-old data scientist, self-taught. I am the owner and the sole author of my own data science and business services company, Mako LLC. I’ve been helping people in my free time through my 10th book, which was written in 2019, called “Maximizing My Target Audience’s Potential.” I’ve been learning and improving my skills as a data scientist by following the advice of Dr. Lawrence Taylor, who is an academic, author, and professor in the human sciences. My goal is to help people in their personal and professional lives,
Prompt: The president of the United States is
Generated text:  invited to a dinner at the White House. As he leaves, he sees a sign that says: "If you are invited to dinner, you are welcome at the White House, but please do not use any words that would insult anyone or demean any group of people." Upon leaving, he sees a sign that says: "Any person invited to dinner is welcome to use whatever language they choose

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I enjoy [job title] because [reason why you enjoy it]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I'm always looking for new challenges and opportunities to grow and learn. What's your favorite book or movie? I'm always looking for new challenges and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the third-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its rich history, including the French Revolution and the French Revolution Museum. Paris is a cultural and artistic center, with many museums, theaters, and art galleries. It is also a major transportation hub, with the Eiffel Tower serving as a symbol of the city. The city is home to many international organizations and events, including the World Cup

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and context-aware AI systems that can better understand and respond to human needs.

2. Enhanced ethical considerations: As AI becomes more integrated with human intelligence, there will be increased scrutiny of its ethical implications. This could lead to more stringent regulations and guidelines to ensure that AI systems are developed and used in a responsible and ethical manner.

3. Greater reliance on AI for decision-making: AI is likely to become more integrated



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a friendly, kind-hearted person who loves to share my knowledge and passion for nature with anyone who wants to learn. I'm passionate about hiking, camping, and bird watching, and I have a natural talent for deciphering the clues in nature, which I use to help people solve puzzles and puzzles solve me. I'm also very good at problem-solving, and I'm always ready to help people make sense of their problems. I enjoy sharing my love for nature with people who are interested in it, and I'm always eager to learn and learn more. I believe in the power of nature to bring people together

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

A. True
B. False
B. False
The capital of France is Paris. While it is an important city in France, it is not its capital. The capital of France is indeed Paris, t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

profession

 or

 field

]

 with

 [

number

 of

 years

]

 years

 of

 experience

 in

 the

 field

.

 I

 am

 a

 [

gener

ational

]

 generation

,

 and

 [

name

 the

 gener

ational

 group

].

 I

 have

 a

 [

number

 of

]

 degrees

,

 [

number

 of

]

 certifications

,

 and

 [

number

 of

]

 professional

 memberships

.

 What

 do

 you

 do

?

 Let

 me

 know

 what

 you

 think

!

 [

Name

]

 [

self

-int

roduction

]


Hello

,

 my

 name

 is

 [

Name

]

 and

 I

 am

 a

 [

profession

 or

 field

]

 with

 [

number

 of

 years

]

 years

 of

 experience

 in

 the

 field

.

 I

 am

 a

 [

gener

ational

]

 generation

,

 and

 [

name

 the



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 northeastern

 region

 of

 the

 country

.


-

 Paris

,

 known

 as

 "

la

 Ville

 Fl

uv

iale

,"

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

 and

 one

 of

 the

 largest

 in

 the

 world

.


-

 The

 city

 has

 a

 population

 of

 approximately

 

2

.

1

 million

 people

,

 making

 it

 the

 third

 most

 populous

 city

 in

 the

 European

 Union

 and

 the

 second

 most

 populous

 city

 in

 the

 world

.


-

 Paris

 is

 also

 the

 seat

 of

 the

 French

 government

,

 the

 most

 populous

 city

 in

 the

 world

,

 and

 the

 largest

 city

 in

 the

 world

 by

 area

.


-

 The

 city

 is

 home

 to

 numerous

 cultural

 and

 artistic

 institutions

,

 including

 the

 Lou

vre

 Museum

,

 the

 Centre

 Pom

pid



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 rapidly

 evolving

,

 with

 many

 potential

 trends

 shaping

 its

 direction

.

 Here

 are

 some

 possible

 trends

:



1

.

 Increased

 automation

 and

 artificial

 general

 intelligence

:

 As

 AI

 continues

 to

 improve

 and

 become

 more

 capable

,

 we

 may

 see

 more

 automation

 and

 AI

 that

 can

 perform

 tasks

 that

 were

 previously

 done

 by

 humans

,

 such

 as

 driving

 cars

,

 making

 decisions

,

 and

 even

 performing

 some

 jobs

.

 This

 will

 likely

 lead

 to

 a

 shift

 away

 from

 humans

 to

 machines

 and

 systems

 that

 can

 perform

 these

 tasks

 more

 efficiently

 and

 accurately

 than

 humans

.



2

.

 AI

 ethics

 and

 privacy

 concerns:

 As

 AI

 becomes

 more

 integrated

 into

 our

 lives

,

 there

 will

 likely

 be

 increased

 scrutiny

 of

 how

 it

 is

 developed

 and

 used

.

 There




In [6]:
llm.shutdown()