# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-01 01:52:21] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.02it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=68.88 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=68.88 GB):   5%|▌         | 1/20 [00:00<00:05,  3.22it/s]Capturing batches (bs=120 avail_mem=68.77 GB):   5%|▌         | 1/20 [00:00<00:05,  3.22it/s]Capturing batches (bs=112 avail_mem=68.77 GB):   5%|▌         | 1/20 [00:00<00:05,  3.22it/s]Capturing batches (bs=112 avail_mem=68.77 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.78it/s]Capturing batches (bs=104 avail_mem=68.75 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.78it/s]

Capturing batches (bs=96 avail_mem=68.74 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.78it/s] Capturing batches (bs=96 avail_mem=68.74 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.36it/s]Capturing batches (bs=88 avail_mem=68.73 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.36it/s]Capturing batches (bs=80 avail_mem=68.73 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.36it/s]

Capturing batches (bs=80 avail_mem=68.73 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.20it/s]Capturing batches (bs=72 avail_mem=68.24 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.20it/s]Capturing batches (bs=64 avail_mem=68.23 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.20it/s]Capturing batches (bs=56 avail_mem=68.23 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.20it/s]Capturing batches (bs=56 avail_mem=68.23 GB):  50%|█████     | 10/20 [00:00<00:00, 13.79it/s]Capturing batches (bs=48 avail_mem=68.22 GB):  50%|█████     | 10/20 [00:00<00:00, 13.79it/s]Capturing batches (bs=40 avail_mem=68.22 GB):  50%|█████     | 10/20 [00:00<00:00, 13.79it/s]

Capturing batches (bs=32 avail_mem=68.21 GB):  50%|█████     | 10/20 [00:01<00:00, 13.79it/s]

Capturing batches (bs=32 avail_mem=68.21 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.66it/s]Capturing batches (bs=24 avail_mem=68.21 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.66it/s]Capturing batches (bs=16 avail_mem=68.20 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.66it/s]

Capturing batches (bs=16 avail_mem=68.20 GB):  75%|███████▌  | 15/20 [00:01<00:00,  8.33it/s]Capturing batches (bs=12 avail_mem=68.20 GB):  75%|███████▌  | 15/20 [00:01<00:00,  8.33it/s]Capturing batches (bs=8 avail_mem=68.19 GB):  75%|███████▌  | 15/20 [00:01<00:00,  8.33it/s] 

Capturing batches (bs=8 avail_mem=68.19 GB):  85%|████████▌ | 17/20 [00:01<00:00,  8.68it/s]Capturing batches (bs=4 avail_mem=68.18 GB):  85%|████████▌ | 17/20 [00:01<00:00,  8.68it/s]Capturing batches (bs=2 avail_mem=68.18 GB):  85%|████████▌ | 17/20 [00:01<00:00,  8.68it/s]Capturing batches (bs=1 avail_mem=68.18 GB):  85%|████████▌ | 17/20 [00:01<00:00,  8.68it/s]Capturing batches (bs=1 avail_mem=68.18 GB): 100%|██████████| 20/20 [00:02<00:00,  9.99it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Daniel. I'm an avid blogger and a bit of a social media guru. I get a lot of requests for how to grow my own plants. This week, I thought I'd do a little research on how to do it!

The first step on how to start your own garden is to decide where you want to plant your garden. Think of a natural or semi-natural area in your backyard or around your house. Look for a spot where your plants have access to plenty of sunlight, water, and soil. If you have a small yard, you can make a small garden in your backyard. If you're fortunate enough to have
Prompt: The president of the United States is
Generated text:  trying to decide how many military personnel to raise and train. He knows that the number of people who want to join the military increases by 5% each year. If he raised a certain number of people in the first year, how many people will he have to raise and train in the third year to have enough people for the military?
To determine the numbe

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill] [Ability] who has always been [Positive Traits] in my heart. I'm always ready to help others and always strive to improve myself. I'm a [Positive Traits] person who always puts others before myself. I'm a [Positive Traits] person who always puts others before myself. I'm a [Positive Traits] person who always puts others before myself. I'm a [Positive Traits] person who always puts others before myself. I'm a [Positive Traits] person who always puts others before myself

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

A. True
B. False
A. True

Paris is the capital of France and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and art galleries. Paris is a popular tourist destination and is known for its rich history, art, and cuisine. The city is also home to the French Parliament and the French Parliament building. 

B. False is incorrect because Paris is indeed the capital of France, and it is a major cultural and economic center in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased use of AI in healthcare: AI is already being used to improve patient outcomes, reduce costs, and increase efficiency in healthcare. As AI technology continues to improve, we can expect to see even more applications in healthcare, such as personalized medicine, disease diagnosis, and drug discovery.

2. AI in manufacturing: AI is already being used to optimize production processes, reduce costs, and improve quality. As AI technology continues to improve, we can expect to see even more applications in manufacturing, such



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [Type of Person, e.g., [Male, Female, etc.], [Age, etc.]]. My interests are [List your hobbies, interests, or skills]. I am a [Type of Person, e.g., [Sophisticated, Fresh, etc.]]. I have [Number of experience levels, e.g., [Beginner, Intermediate, Advanced, etc.]]. My [Type of Character, e.g., [Lonely, Happy, etc.]] is [Type of Character, e.g., [Lonely, Happy, etc.]]. I enjoy [List activities or hobbies

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in France and the seat of the government, administration, and culture in France. The city has a rich cultural heritage and is famous for its architecture, cuisine, and fashion. It is also known for its annual shopping and tourism festivals. Paris has a population of over 2.1 million people and is the capital of Fra

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 a

 [

Job

 Title

]

 in

 [

Company

 Name

].

 I

 am

 here

 to

 [

Objective

 of

 Your

 Job

],

 which

 is

 [

Your

 Objective

].

 I

 enjoy

 [

Your

 Hobby

/

Interest

].

 I

 have

 [

Number

 of

 Years

 of

 Experience

],

 and

 I

 am

 always

 looking

 for

 opportunities

 to

 [

Achie

vement

/

Impro

vement

].

 I

 value

 [

F

avourite

 Hobby

/

Interest

],

 and

 I

 believe

 that

 [

Your

 Character

 Trait

]

 can

 help

 me

 achieve

 my

 goals

.

 Thank

 you

 for

 considering

 me

 for

 a

 job

.

 How

 can

 I

 get

 to

 know

 you

 better

?

 It

 would

 be

 great

 to

 have

 more

 information

 on

 your

 background

 and

 experience

.

 What

 kind

 of

 challenges

 or

 opportunities



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

,

 the

 heart

 of

 France

,

 is

 a

 historic

 and

 vibrant

 city

 known

 for

 its

 stunning

 architecture

,

 rich

 culture

,

 and

 diverse

 food

 scene

.

 It

 is

 home

 to

 iconic

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

,

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

 Museum

,

 as

 well

 as

 a

 thriving

 arts

 and

 entertainment

 scene

.

 The

 city

 is

 also

 the

 seat

 of

 the

 French

 government

,

 representing

 France

's

 influence

 in

 global

 affairs

.

 Paris

 is

 a

 popular

 tourist

 destination

,

 with

 millions

 of

 visitors

 annually

,

 making

 it

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 France

.

 With

 its

 charming

 streets

,

 chic

 cafes

,

 and

 lively

 nightlife

,

 Paris

 is

 a

 city

 that

 truly

 reflects

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

 and

 there

 is

 no

 clear

 direction

 in

 which

 it

 will

 proceed

.

 However

,

 several

 trends

 are

 emerging

 that

 could

 impact

 the

 field

 in

 significant

 ways

:



1

.

 Increased

 emphasis

 on

 ethical

 considerations

:

 As

 AI

 becomes

 more

 prevalent

 in

 our

 daily

 lives

,

 there

 will

 be

 a

 greater

 emphasis

 on

 ensuring

 that

 its

 development

 is

 ethical

 and

 aligned

 with

 human

 values

.



2

.

 Integration

 of

 AI

 with

 human

 emotions

 and

 emotions

:

 AI

 is

 already

 capable

 of

 processing

 and

 understanding

 emotions

,

 and

 it

 is

 becoming

 increasingly

 important

 to

 integrate

 AI

 with

 human

 emotions

 in

 order

 to

 provide

 more

 accurate

 and

 empath

etic

 responses

.



3

.

 Increase

 in

 the

 use

 of

 AI

 for

 healthcare

:

 With

 the

 growing

 availability

 of

 data

 on




In [6]:
llm.shutdown()