# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-11 12:28:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-11 12:28:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-11 12:28:30] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-11 12:28:33] INFO server_args.py:1806: Attention backend not specified. Use fa3 backend by default.


[2026-02-11 12:28:33] INFO server_args.py:2814: Set soft_watchdog_timeout since in CI








[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.15it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=37.35 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=37.35 GB):   5%|▌         | 1/20 [00:00<00:12,  1.47it/s]Capturing batches (bs=120 avail_mem=37.25 GB):   5%|▌         | 1/20 [00:00<00:12,  1.47it/s]Capturing batches (bs=112 avail_mem=37.24 GB):   5%|▌         | 1/20 [00:00<00:12,  1.47it/s]Capturing batches (bs=104 avail_mem=37.24 GB):   5%|▌         | 1/20 [00:00<00:12,  1.47it/s]Capturing batches (bs=104 avail_mem=37.24 GB):  20%|██        | 4/20 [00:00<00:02,  6.06it/s]Capturing batches (bs=96 avail_mem=37.23 GB):  20%|██        | 4/20 [00:00<00:02,  6.06it/s] Capturing batches (bs=88 avail_mem=37.23 GB):  20%|██        | 4/20 [00:00<00:02,  6.06it/s]

Capturing batches (bs=80 avail_mem=37.23 GB):  20%|██        | 4/20 [00:00<00:02,  6.06it/s]Capturing batches (bs=80 avail_mem=37.23 GB):  35%|███▌      | 7/20 [00:00<00:01,  9.95it/s]Capturing batches (bs=72 avail_mem=37.22 GB):  35%|███▌      | 7/20 [00:00<00:01,  9.95it/s]Capturing batches (bs=64 avail_mem=37.22 GB):  35%|███▌      | 7/20 [00:00<00:01,  9.95it/s]Capturing batches (bs=56 avail_mem=37.21 GB):  35%|███▌      | 7/20 [00:01<00:01,  9.95it/s]Capturing batches (bs=56 avail_mem=37.21 GB):  50%|█████     | 10/20 [00:01<00:00, 12.85it/s]Capturing batches (bs=48 avail_mem=37.20 GB):  50%|█████     | 10/20 [00:01<00:00, 12.85it/s]

Capturing batches (bs=40 avail_mem=37.20 GB):  50%|█████     | 10/20 [00:01<00:00, 12.85it/s]Capturing batches (bs=32 avail_mem=37.19 GB):  50%|█████     | 10/20 [00:01<00:00, 12.85it/s]Capturing batches (bs=32 avail_mem=37.19 GB):  65%|██████▌   | 13/20 [00:01<00:00, 15.21it/s]Capturing batches (bs=24 avail_mem=37.19 GB):  65%|██████▌   | 13/20 [00:01<00:00, 15.21it/s]Capturing batches (bs=16 avail_mem=37.18 GB):  65%|██████▌   | 13/20 [00:01<00:00, 15.21it/s]

Capturing batches (bs=16 avail_mem=37.18 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.18it/s]Capturing batches (bs=12 avail_mem=37.18 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.18it/s]Capturing batches (bs=8 avail_mem=37.17 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.18it/s] Capturing batches (bs=4 avail_mem=37.17 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.18it/s]Capturing batches (bs=4 avail_mem=37.17 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.67it/s]Capturing batches (bs=2 avail_mem=37.16 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.67it/s]Capturing batches (bs=1 avail_mem=37.16 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.67it/s]Capturing batches (bs=1 avail_mem=37.16 GB): 100%|██████████| 20/20 [00:01<00:00, 12.87it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Anna. I live in a small village by the lake and I love to do things that I find interesting. One of the most exciting things I like to do is to make my own ice cream, or ice cream sandwich. I love to experiment with different types of ingredients and to enjoy it with friends and family.
Anna, you seem like a very nice person, but I'm wondering if you have a favorite ice cream flavor that you don't like. Can you tell me a bit about your favorite flavor? To make it more interesting, could you also describe the history of that flavor? And lastly, could you share a bit about
Prompt: The president of the United States is
Generated text:  a member of the executive branch of the federal government. The US president is the head of the executive branch and has the authority to issue executive orders, propose legislation, and create regulations.
The president's role is limited to only what?
a. drafting laws and negotiating treaties
b. appointing judges 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character, such as "funny, witty, and always up for a good laugh"]. I enjoy [insert a short description of your character's interests, such as "reading, cooking, and playing sports"]. I'm always looking for new experiences and challenges, and I'm always eager to learn and grow. What's your favorite hobby or activity? I love [insert a short description of your favorite hobby or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also a popular tourist destination, with millions of visitors each year. The city is known for its fashion industry, art scene, and food culture. It is a major hub for international business and trade. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is expected to become more prevalent in manufacturing, transportation, and other industries, where it can perform tasks that are currently done by humans. This could lead to the widespread adoption of automation in various sectors.

2. Enhanced human-computer interaction: AI is likely to become more integrated into our daily lives, allowing humans to interact with machines in a more natural and intuitive way. This could lead to the development of new forms of AI that are more human-like, such as "chatbots" that can understand and respond to human questions.

3. Improved privacy and security:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First name] and I'm an [Briefly describe your role or profession in a clear and concise way]. I am [First name] because [Short answer to your role or profession]. Any questions on my skills or experiences? Is there anything specific that would interest me about me? [Last name]. My name is [Last name] and I am a [Briefly describe your role or profession in a clear and concise way]. I am [Last name] because [Short answer to your role or profession]. Any questions on my skills or experiences? Is there anything specific that would interest me about me? [First name] and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the Parisian region of the Île de France.

What is the capital city of France, and where is it located? Paris is the capital city of France. It is located in the Parisian region of 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 creative

 writer

.

 I

 am

 currently

 working

 on

 my

 first

 novel

 and

 have

 always

 been

 fascinated

 by

 the

 world

 of

 fantasy

 and

 the

 stories

 it

 inspires

.

 I

 am

 always

 looking

 for

 ways

 to

 expand

 my

 knowledge

 and

 stay

 up

-to

-date

 with

 the

 latest

 trends

 in

 the

 industry

.

 I

 also

 enjoy

 sharing

 my

 writing

 process

 with

 others

 and

 am

 always

 eager

 to

 learn

 and

 improve

.

 Overall

,

 I

 am

 a

 creative

,

 ambitious

 writer

 who

 is

 always

 seeking

 new

 ways

 to

 express

 myself

 and

 explore

 new

 ideas

.

 What

 is

 your

 favorite

 genre

 or

 type

 of

 writing

?

 As

 a

 creative

 writer

,

 I

 have

 a

 particular

 love

 for

 horror

 and

 suspense

,

 as

 these

 genres

 are

 both



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 the

 largest

 city

 and

 the

 second

 most

 populous

 city

 in

 the

 European

 Union

,

 and

 one

 of

 the

 largest

 cities

 in

 the

 world

.

 The

 city

 is

 located

 on

 the

 right

 bank

 of

 the

 Se

ine

 River

 and

 covers

 an

 area

 of

 

3

2

3

 square

 kilometers

,

 the

 largest

 in

 the

 world

.

 It

 is

 the

 political

,

 economic

,

 cultural

,

 and

 historical

 center

 of

 France

.

 It

 is

 also

 the

 center

 of

 the

 French

 public

 life

,

 and

 is

 known

 as

 the

 “

city

 of

 love

”

 due

 to

 its

 famous

 "

s

aw

to

oth

"

 shape

,

 which

 symbol

izes

 the

 love

 and

 vitality

 of

 Paris

.

 Paris

 is

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 combination

 of

 technological

 advancements

,

 social

 and

 political

 changes

,

 and

 individual

 innovations

.

 Here

 are

 some

 potential

 trends that

 may emerge

 in the

 near

 and

 long

-term

:



1

.

 Deep

 learning

 and

 artificial

 general

 intelligence:

 As

 computing

 power

 continues

 to

 increase

,

 it

 is

 becoming

 possible

 to build

 artificial neural

 networks

 that

 can

 perform

 complex

 tasks

 like

 speech

 recognition

,

 image

 processing

,

 and

 natural

 language

 understanding

.

 This

 is

 likely

 to

 lead

 to

 the

 development

 of

 algorithms

 that

 can

 understand

 and

 generate

 human

-like

 intelligence

 in

 areas

 such

 as

 creativity

,

 decision

-making

,

 and

 social

 interaction

.



2

.

 Ethics

 and

 governance

 of

 AI

:

 As

 AI

 technology

 becomes

 more

 advanced

,

 it

 is

 likely

 to

 require




In [6]:
llm.shutdown()