# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-01 03:49:06] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-01 03:49:06] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-01 03:49:06] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-01 03:49:08] INFO server_args.py:1599: Attention backend not specified. Use fa3 backend by default.


[2026-01-01 03:49:08] INFO server_args.py:2471: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.42it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:05,  3.79it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:05,  3.79it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:05,  3.79it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.80it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.80it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.80it/s] 

Capturing batches (bs=88 avail_mem=76.79 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.80it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 13.98it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 13.98it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  30%|███       | 6/20 [00:00<00:01, 13.98it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  30%|███       | 6/20 [00:00<00:01, 13.98it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.77it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.77it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.77it/s]

Capturing batches (bs=40 avail_mem=76.76 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.77it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:00<00:00, 20.36it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:00<00:00, 20.36it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:00<00:00, 20.36it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:00<00:00, 20.36it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:00<00:00, 17.06it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:00<00:00, 17.06it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.06it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.06it/s]Capturing batches (bs=4 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.85it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.85it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.85it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 17.56it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alvin. I'm a bit of a mystery solver. I love solving tricky puzzles that I call "world problems" and I'm always looking for ways to explain the world to my friends.
I love the way that puzzles can teach us something important about the world. I love the idea that puzzles can help us learn about people, places and cultures all around the world.
Puzzles, by using language and logic, can teach us many things about the world. For example, to solve the puzzle of the "Shape of the Moon" I learned that the Moon's shape is called an oblate spheroid, meaning that it
Prompt: The president of the United States is
Generated text:  paid  Pls. 4,200,000 annually. The president receives 5% of the money as a salary and has to pay 20% of the amount as taxes. How much money will the president have in her pocket after 6 years of receiving the salary? To determine how much money the president will have in her pocket after 6 years of receiving the salary, we need 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your character or personality]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new experiences and learning new things. What's your favorite hobby or activity? I love [insert a short description of your favorite activity or hobby]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite book or movie? I love [insert a short description of your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, art, and cuisine. Paris is a popular tourist destination and a cultural hub for France. It is home to many famous landmarks and museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin. The city is also known for its annual festivals and events, including the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, and accountability.

2. Integration with other technologies: AI will continue to be integrated with other technologies such as blockchain, IoT, and quantum computing. This will create new opportunities for AI to be used in new and innovative ways.

3. Development of new AI models: AI models will continue to evolve



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [Your Profession] who has been in the [Your Role] field for [X] years. I'm currently a [Your Profession] who has been in the [Your Role] field for [X] years. I bring a unique blend of [Your Unique Skillset] to the table and enjoy [Your Passion or Hobby]. I'm passionate about [Your Passion] and strive to make a positive impact on the world. I'm a [Your Type] who is always ready to learn and improve, and I'm always looking for new opportunities to grow and evolve as a [Your Characteristic

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the most populous city in the country, with over 2.5 million inhabitants according to the 2020 census. Paris is known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also known for its

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

 [

Your

 Profession

]

 with

 [

Your

 Education

]

 degree

.

 I

 am

 passionate

 about

 [

Your

 Area

 of

 Expert

ise

].

 I

 believe

 in

 [

Your

 Philosophy

 or

 Values

].

 What

 is

 your

 role

 in

 the

 team

?


[

Your

 Name

]


[

Your

 Profession

]

 with

 [

Your

 Education

]

 degree

.


I

 am

 passionate

 about

 [

Your

 Area

 of

 Expert

ise

].

 I

 believe

 in

 [

Your

 Philosophy

 or

 Values

].

 What

 is

 your

 role

 in

 the

 team

?

 Let

 me

 know

 if

 you

 would

 like

 me

 to

 help

 you

 find

 the

 right

 words

 to

 write

 an

 introduction

 for

 a

 fictional

 character

!

 #

self

-int

roduction

 #

fiction

al

-character

 #

inn

ovation

 #

team

work



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 famous

 city

 known

 for

 its

 historic

 architecture

,

 vibrant

 culture

,

 and

 annual

 festivals

 such

 as

 the

 Carnival

.

 It

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

 and

 plays

 an

 important

 role

 in

 French

 culture

 and

 politics

.

 Paris

 is

 home

 to

 many

 renowned

 museums

,

 art

 galleries

,

 and

 theaters

,

 and

 is

 also

 known

 for

 its

 cuisine

 and

 dining

 scene

.

 The

 city

 is

 also

 home

 to

 the

 Lou

vre

 Museum

,

 a

 UNESCO

 World

 Heritage

 site

,

 and

 the

 Notre

-D

ame

 Cathedral

,

 a

 significant

 part

 of

 French

 architecture

.

 Paris

 is

 a

 cultural

 and

 historical

 capital

 of

 France

,

 and

 its

 importance

 has

 continued

 to

 grow

 over

 time

.

 Paris

 is

 often

 referred

 to

 as

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 a

 vast

 and

 unpredictable

 one

,

 with

 several

 potential

 trends

 shaping

 the

 technology

's

 trajectory

.

 Some

 of

 the

 potential

 trends

 include

:



1

.

 Increased

 integration

 with

 human

 intelligence

:

 AI

 is

 already

 becoming

 more

 integrated

 with

 human

 intelligence

,

 with

 machines

 able

 to

 learn

 from

 and

 interact

 with

 humans

 in

 new

 ways

.

 This

 integration

 is

 likely

 to

 continue

,

 with

 machines

 becoming

 more

 capable

 of

 empath

etic

 and

 adaptive

 interactions

 with

 humans

.



2

.

 Emer

gence

 of

 ethical

 considerations

:

 As

 AI

 becomes

 more

 integrated

 with

 human

 intelligence

,

 there

 will

 be

 a

 growing

 need

 for

 ethical

 considerations

.

 This

 will

 include

 questions

 about

 the

 nature

 of

 human

 intelligence

,

 the

 potential

 impact

 of

 AI

 on

 society

,

 and the

 role




In [6]:
llm.shutdown()