# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-27 01:36:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-27 01:36:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-27 01:36:32] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-27 01:36:35] INFO server_args.py:2420: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.50it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=57.29 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=57.29 GB):   5%|▌         | 1/20 [00:00<00:07,  2.68it/s]Capturing batches (bs=120 avail_mem=57.19 GB):   5%|▌         | 1/20 [00:00<00:07,  2.68it/s]Capturing batches (bs=112 avail_mem=57.19 GB):   5%|▌         | 1/20 [00:00<00:07,  2.68it/s]Capturing batches (bs=104 avail_mem=57.18 GB):   5%|▌         | 1/20 [00:00<00:07,  2.68it/s]Capturing batches (bs=104 avail_mem=57.18 GB):  20%|██        | 4/20 [00:00<00:01,  9.04it/s]Capturing batches (bs=96 avail_mem=57.18 GB):  20%|██        | 4/20 [00:00<00:01,  9.04it/s] 

Capturing batches (bs=88 avail_mem=57.17 GB):  20%|██        | 4/20 [00:00<00:01,  9.04it/s]Capturing batches (bs=88 avail_mem=57.17 GB):  30%|███       | 6/20 [00:00<00:01, 11.32it/s]Capturing batches (bs=80 avail_mem=57.16 GB):  30%|███       | 6/20 [00:00<00:01, 11.32it/s]Capturing batches (bs=72 avail_mem=57.16 GB):  30%|███       | 6/20 [00:00<00:01, 11.32it/s]

Capturing batches (bs=72 avail_mem=57.16 GB):  40%|████      | 8/20 [00:00<00:01, 10.70it/s]Capturing batches (bs=64 avail_mem=57.15 GB):  40%|████      | 8/20 [00:00<00:01, 10.70it/s]Capturing batches (bs=56 avail_mem=57.15 GB):  40%|████      | 8/20 [00:00<00:01, 10.70it/s]Capturing batches (bs=56 avail_mem=57.15 GB):  50%|█████     | 10/20 [00:00<00:00, 11.66it/s]Capturing batches (bs=48 avail_mem=57.15 GB):  50%|█████     | 10/20 [00:00<00:00, 11.66it/s]

Capturing batches (bs=40 avail_mem=57.14 GB):  50%|█████     | 10/20 [00:01<00:00, 11.66it/s]Capturing batches (bs=40 avail_mem=57.14 GB):  60%|██████    | 12/20 [00:01<00:00,  9.59it/s]Capturing batches (bs=32 avail_mem=57.14 GB):  60%|██████    | 12/20 [00:01<00:00,  9.59it/s]

Capturing batches (bs=24 avail_mem=57.13 GB):  60%|██████    | 12/20 [00:01<00:00,  9.59it/s]Capturing batches (bs=24 avail_mem=57.13 GB):  70%|███████   | 14/20 [00:01<00:00,  6.60it/s]Capturing batches (bs=16 avail_mem=57.13 GB):  70%|███████   | 14/20 [00:01<00:00,  6.60it/s]Capturing batches (bs=12 avail_mem=57.12 GB):  70%|███████   | 14/20 [00:01<00:00,  6.60it/s]Capturing batches (bs=12 avail_mem=57.12 GB):  80%|████████  | 16/20 [00:01<00:00,  8.01it/s]Capturing batches (bs=8 avail_mem=57.12 GB):  80%|████████  | 16/20 [00:01<00:00,  8.01it/s] 

Capturing batches (bs=4 avail_mem=57.11 GB):  80%|████████  | 16/20 [00:01<00:00,  8.01it/s]Capturing batches (bs=2 avail_mem=57.11 GB):  80%|████████  | 16/20 [00:01<00:00,  8.01it/s]Capturing batches (bs=2 avail_mem=57.11 GB):  95%|█████████▌| 19/20 [00:02<00:00, 10.99it/s]Capturing batches (bs=1 avail_mem=57.10 GB):  95%|█████████▌| 19/20 [00:02<00:00, 10.99it/s]Capturing batches (bs=1 avail_mem=57.10 GB): 100%|██████████| 20/20 [00:02<00:00,  9.69it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lenny and I'm a student of programming at the university. I'm studying Data Structures and algorithms, and I'm just learning about stacks and queues. The real time problem that I'm trying to solve is to find the maximum value in a given array using recursion. Can you help me with the solution for this problem? Sure, I can help you with that! Here's a Python solution using recursion to find the maximum value in an array:

```
def max_in_array(arr):
    # Base case: if the array has only one element, that element is the maximum
    if len(arr) == 1:
        return
Prompt: The president of the United States is
Generated text:  selected randomly from 10 vice presidents. If the probability of selecting a vice president from each state is the same and the president of the United States is selected randomly, what is the probability of selecting a vice president from Alabama or Florida?
To determine the probability of selecting a vice president from A

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its fashion industry, with many famous fashion designers and boutiques located in the city. Paris is a bustling metropolis with a diverse population and a vibrant culture that attracts tourists from all over the world. It is a city of contrasts, with its modern architecture and historical landmarks blending together to create a unique and fascinating city. Paris is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends include:

1. Increased integration of AI into various industries: AI is already being used in a wide range of industries, from healthcare and finance to transportation and manufacturing. As AI becomes more integrated into these industries, we can expect to see even more applications of AI in various sectors.

2. Enhanced privacy and security concerns: As AI becomes more integrated into our daily lives, there will be increased concerns about privacy and security. This will likely lead to more regulations and standards being put in place to protect people's data



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Career/Position] with [Number of Years in Industry]. I've been a [Number of Years in Industry] in [Position] for [Number of Years in Industry]. My goal is to stay current with [Industry/Market Trends] and [Job Title], and to seek opportunities for growth and advancement. I am a [Professional Value] in my field and a [Motivational Attribute] who is always looking for new ways to reach my full potential. I'm a [Positive Attitude] and I value relationships with people and a sense of humor. I am always open to new experiences

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the "City of Love". The city is located in the center of the country and is home to many of the country's most iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is als

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 professional

 investor

 with

 over

 [

number

]

 years

 of

 experience

 in

 [

industry

].

 I

'm

 passionate

 about

 investing

 and

 creating

 a

 sustainable

 financial

 future

.

 My

 financial

 goals

 are

 clear

:

 to

 reach

 $

1

 million

 by

 age

 

3

0

 and

 to

 achieve

 a

 high

 degree

 of

 retirement

 income

.

 What

 better

 way

 to

 achieve

 this

 than

 by

 investing

 in

 the

 stocks

 and

 bonds

 of

 your

 favorite

 companies

?

 How

 do

 you

 spend

 your

 free

 time

?

 I

 enjoy

 reading

,

 going

 hiking

,

 and

 spending

 time

 with

 my

 family

.

 What

 hobbies

 do

 you

 have

?

 I

 love

 to

 cook

,

 bake

,

 and

 travel

.

 How

 does

 your

 work

 differ

 from

 other

 investors

?

 I

 believe

 in



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

 and

 the

 City

 of

 Fine

 Arts

.


Paris

 is

 the

 most

 populous

 city

 in

 France

 and

 the

 third

 largest

 in

 the

 European

 Union

.

 The

 city

 is

 located

 on

 the

 left

 bank

 of

 the

 Se

ine

 river

 and

 is

 known

 for

 its

 medieval

 architecture

,

 classical

 art

,

 and

 French

 cuisine

.

 It

 is

 also

 home

 to

 numerous

 museums

,

 theaters

,

 and

 monuments

,

 making

 it

 a

 popular

 tourist

 destination

 and

 a

 symbol

 of

 France

's

 cultural

 heritage

.

 Paris

 is

 known

 for

 its

 lively

 nightlife

,

 stunning

 views

 of

 the

 city

,

 and

 its

 role

 as

 a

 hub

 for

 international

 affairs

 and

 commerce

.

 The

 city

 is

 home

 to

 over

 

1

0

 million

 people

 and

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 a

 significant

 increase

 in

 the

 integration

 of

 AI

 into

 our

 daily

 lives

,

 with

 the

 technology

 becoming

 more

 prevalent

 and

 widespread

.

 One

 possible

 future

 trend

 is

 the

 emergence

 of

 AI

-powered

 autonomous

 vehicles

 that

 can

 navigate

 and

 navigate

 safely

 in

 a

 variety

 of

 terr

ains

 and

 situations

.

 This

 could

 lead

 to

 a

 reduction

 in

 accidents

 and

 a

 decrease

 in

 the

 use

 of

 fossil

 fuels

,

 as

 vehicles

 powered

 by

 AI

 could

 be

 designed

 to

 be

 more

 efficient

 and

 reduce

 the

 amount

 of

 fuel

 needed

 for

 transportation

.

 This

 could

 also

 lead

 to

 increased

 efficiency

 in

 manufacturing

,

 logistics

,

 and

 other

 industries

 as

 AI

 is

 increasingly

 being

 used

 to

 automate

 repetitive

 tasks

 and

 improve

 process

 efficiency

.

 



Another

 trend

 is

 the

 development




In [6]:
llm.shutdown()