# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-20 14:03:37] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-20 14:03:37] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-20 14:03:37] INFO utils.py:164: NumExpr defaulting to 16 threads.






[2025-11-20 14:03:47] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-20 14:03:47] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-20 14:03:47] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.73it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.72it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=120 avail_mem=74.64 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.37it/s]

Capturing batches (bs=112 avail_mem=74.64 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=104 avail_mem=74.63 GB):   5%|â–Œ         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=104 avail_mem=74.63 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.25it/s]Capturing batches (bs=96 avail_mem=74.62 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.25it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.25it/s]Capturing batches (bs=80 avail_mem=74.61 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 15.25it/s]Capturing batches (bs=80 avail_mem=74.61 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 19.89it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 19.89it/s]

Capturing batches (bs=64 avail_mem=74.60 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 19.89it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 19.89it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 22.01it/s]Capturing batches (bs=48 avail_mem=74.59 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 22.01it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 22.01it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 10/20 [00:00<00:00, 22.01it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 23.36it/s]Capturing batches (bs=24 avail_mem=74.58 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 23.36it/s]

Capturing batches (bs=16 avail_mem=74.58 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 23.36it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  65%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ   | 13/20 [00:00<00:00, 23.36it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.06it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.06it/s] Capturing batches (bs=4 avail_mem=74.56 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.06it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 16/20 [00:00<00:00, 22.06it/s]

Capturing batches (bs=2 avail_mem=74.56 GB):  95%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ| 19/20 [00:00<00:00, 22.41it/s]Capturing batches (bs=1 avail_mem=74.55 GB):  95%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ| 19/20 [00:00<00:00, 22.41it/s]Capturing batches (bs=1 avail_mem=74.55 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 20/20 [00:00<00:00, 21.02it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ria. I am a 17-year-old girl who is really good at swimming. As a student in middle school, I got a swimming team. I was really good at swimming. I learned swimming very well. Now, I want to write a report on the topic "The Importance of Swimming in Swimming Teams". What should I include in the report?
I would be glad to help. Please provide more information about the topic and the writing style. How can I begin my report and what should I include in the body of the report? What are the suggested paragraphs and the possible headings for the report? What about the conclusion section
Prompt: The president of the United States is
Generated text:  a man. The president of the United States is an elected official. The president of the United States serves a term of four years. The president of the United States must be at least 35 years old. The president of the United States must be a citizen of the United States.
The president of the United States

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and what you're looking for in a job. Let's chat! [Name] [Job Title] [Company Name] [Company Address] [City, State, ZIP Code] [Phone Number] [Email Address] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Instagram Profile] [GitHub Profile] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Instagram Profile] [GitHub Profile] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Instagram Profile

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light, and is located in the south of the country. It is the largest city in France and the third-largest city in the world by population. Paris is known for its rich history, art, and culture, and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The 

Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations. This could lead to more complex and nuanced AI systems that can better understand and respond to human emotions and behaviors.

2. Enhanced privacy and security: As AI becomes more prevalent in our daily lives, there will be a growing need for privacy and security measures to protect the data and personal information that is generated and stored by AI systems. This could lead to more stringent privacy regulations and increased investment in security technologies.

3. Greater



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [occupation] with a passion for [reason for passion], [for example: writing, music, history, etc.]. I enjoy learning new things, exploring the world, and having fun. My love for learning and discovery fuels my creativity and drives me to create content that is both educational and engaging. I am always willing to learn and grow, and I am always looking for new ways to connect with my audience and inspire them with my unique voice and style. I am looking forward to meeting you. ðŸŒŸâœ¨ #SelfIntro #Interests #CreativePerson #FictionalCharacter

Hey

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris, known for its iconic Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral.
Paris, the capital of France, is renowned for its iconic Eiffel Tower, the Louvre Museum, and the stunning 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Age

]

 year

 old

 aspiring

 [

Professional

]

 who

 is

 currently

 [

Your

 current

 role

].

 I

'm

 passionate

 about

 [

what

 I

 love

 doing

].

 I

 have

 a

 sense

 of

 humor

 and

 enjoy

 social

izing

,

 especially

 with

 people

 who

 are

 like

-minded

.

 I

'm

 a

 [

favorite

 hobby

]

 that

 I

've

 been

 into

 since

 I

 was

 a

 child

.

 I

'm

 always

 ready

 to

 learn

 something

 new

 and

 interested

 in

 what

 makes

 people

 happy

.

 What

's

 your

 name

,

 and

 how

 do

 you

 usually

 get

 your

 inspiration

 for

 your

 work

?

 My

 inspiration

 comes

 from

 [

what

 inspires

 me

 the

 most

].

 Do

 you

 have

 a

 particular

 style

 or

 tone

 in

 your

 writing

,

 and



Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 home

 to

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 known

 as

 "

The

 City

 of

 Light

"

 due

 to

 its

 historic

 and

 vibrant

 skyline

 and

 vibrant

 nightlife

.

 France

's

 largest

 city

 is

 also

 home

 to

 the

 Notre

-D

ame

 Cathedral

 and

 the

 Ch

amps

-

Ã‰

lys

Ã©es

.

 It

's

 the

 seat

 of

 government

,

 industry

,

 and

 culture

 in

 France

 and

 is

 a

 major

 tourist

 destination

.

 It

 is

 the

 world

's

 fifth

-largest

 city

 by

 population

 and

 the

 most

 populous

 city

 in

 Europe

.

 Located

 on

 the

 Atlantic

 coast

,

 it

's

 the

 third

 most

 populous

 city

 in

 the

 world

,

 with

 over

 a

 billion

 people

.

 The

 city

's

 rich

 history

 dates



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 involves

 many

 different

 trends

 and

 technologies

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Integration

:

 AI

 will

 become

 even

 more

 integrated

 into

 our

 lives

.

 We

 will

 be

 able

 to

 use

 AI

-powered

 assistants

 like

 Siri

 or

 Alexa

 to

 help

 us

 with

 tasks

 like

 scheduling

 appointments

,

 setting

 reminders

,

 and

 managing

 our

 finances

.

 In

 the

 future

,

 we

 will

 also

 be

 able

 to

 use

 AI

 to

 assist

 with

 decision

-making

,

 such

 as

 identifying

 risks

 or

 making

 informed

 choices

.



2

.

 AI

 will

 become

 more

 personal

:

 AI

 will

 become

 more

 personalized

,

 and

 we

 will

 be

 able

 to

 build

 AI

 assistants

 that

 are

 tailored

 to

 our

 specific

 needs

 and

 preferences

.

 This

 will

 involve

 training

 AI




In [6]:
llm.shutdown()