# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-31 04:50:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-31 04:50:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-31 04:50:01] INFO utils.py:164: NumExpr defaulting to 16 threads.




config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

[2026-01-31 04:50:03] INFO server_args.py:1774: Attention backend not specified. Use fa3 backend by default.


[2026-01-31 04:50:03] INFO server_args.py:2700: Set soft_watchdog_timeout since in CI




tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]



[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.91it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.93 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s]Capturing batches (bs=120 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s]Capturing batches (bs=112 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s]Capturing batches (bs=104 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s]Capturing batches (bs=96 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:07,  2.41it/s] Capturing batches (bs=96 avail_mem=76.82 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.13it/s]Capturing batches (bs=88 avail_mem=76.81 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.13it/s]Capturing batches (bs=80 avail_mem=76.81 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.13it/s]Capturing batches (bs=72 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.13it/s]

Capturing batches (bs=64 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.13it/s]Capturing batches (bs=64 avail_mem=76.80 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.21it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.21it/s]Capturing batches (bs=48 avail_mem=76.79 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.21it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.21it/s]Capturing batches (bs=32 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.21it/s]Capturing batches (bs=32 avail_mem=76.78 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.28it/s]Capturing batches (bs=24 avail_mem=76.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.28it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.28it/s]

Capturing batches (bs=12 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.28it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:00<00:00, 21.65it/s]Capturing batches (bs=8 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:00<00:00, 21.65it/s] Capturing batches (bs=4 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 21.65it/s]Capturing batches (bs=2 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 21.65it/s]Capturing batches (bs=1 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 21.65it/s]Capturing batches (bs=1 avail_mem=76.74 GB): 100%|██████████| 20/20 [00:01<00:00, 25.81it/s]Capturing batches (bs=1 avail_mem=76.74 GB): 100%|██████████| 20/20 [00:01<00:00, 19.12it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Michael. I am a student at the University of Michigan. I have a degree in Psychology from the University of Michigan. After graduation, I have worked as a professional counselor, first in the United States and then in the United Kingdom. During my time as a professional counselor, I have helped many people. I have worked with people who were troubled with problems such as depression, anxiety, and other mental illnesses. One of my most memorable clients was a man who had been struggling with the negative impact of his severe depression. As a counselor, I began to work with him to help him to cope with his depression. I gave him a good
Prompt: The president of the United States is
Generated text:  a post continuously vacated. It is essential to have a permanent president who can continue to hold office. The appointment of the person to be president is an important point of contention. Currently, the president is appointed by the president of the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Character] who has always been [Positive Traits]. I'm [Positive Traits] and I'm [Positive Traits]. I'm [Positive Traits] and I'm [Positive Traits]. I'm [Positive Traits] and I'm [Positive Traits]. I'm [Positive Traits] and I'm [Positive Traits]. I'm [Positive Traits] and I'm [Positive Traits]. I'm [Positive Traits] and I'm [Positive Traits]. I'm [Positive Traits] and I'm [Positive Traits]. I'm [Positive Traits

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is also known for its fashion industry, with many famous fashion houses and boutiques located in the city. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city that is both beautiful and exciting, and is a must-visit destination for anyone interested in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems will become more integrated with human intelligence, allowing them to learn and adapt to new situations. This will enable AI to perform tasks that are currently beyond the capabilities of humans, such as playing chess or driving a car.

2. Enhanced machine learning: AI will become more capable of learning from data and making more accurate predictions and decisions. This will enable AI to perform tasks that were previously impossible, such as diagnosing diseases or predicting weather patterns.

3. Increased use of AI in healthcare: AI will be used to improve the



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Career/Position] at [Company Name]. I enjoy [Occupation/Interest] and have a passion for [Professional Goal/Interest]. In my free time, I enjoy [Physical Activity/Social Interaction/Relaxation]. I am always looking for new experiences to broaden my horizons and learn new things. What would you like me to know about you? [Name] is friendly, adventurous, and always eager to learn and try new things. I enjoy working with others and have a strong sense of teamwork. I am a great communicator and always make sure to listen actively when others are speaking

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Please note that the capital of France is often referred to as "Paris" even though it is also the capital of the Department of Paris and the French Region of the same name. 
Choose your a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 a

 [

Your

 Profession

]

 at

 [

Your

 Company

].

 I

 am

 excited

 to

 meet

 you

 and

 discuss

 the

 importance

 of

 [

Your

 Profession

].

 I

'm

 a

 [

Your

 Skills

/

Experience

]

 with

 [

Your

 Profession

]

 and

 I

'm

 always

 looking

 to

 learn

 and

 grow

.

 What

 brings

 you

 here

 today

?

 How

 can

 I

 help

 you

 today

?

 What

 brings

 you

 here

 today

?

 My

 name

 is

 [

Your

 Name

],

 a

 [

Your

 Profession

]

 at

 [

Your

 Company

].

 I

 am

 excited

 to

 meet

 you

 and

 discuss

 the

 importance

 of

 [

Your

 Profession

].

 I

'm

 a

 [

Your

 Skills

/

Experience

]

 with

 [

Your

 Profession

]

 and

 I

'm

 always

 looking

 to

 learn

 and

 grow

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

,

 the

 

1

4

th

 largest

 city

 in

 the

 world

,

 and

 the

 largest

 city

 in

 Western

 Europe

.

 Paris

 has

 a

 population

 of

 over

 

2

 million

 people

.

 It

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 numerous

 other

 attractions

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

,

 culture

,

 and

 cuisine

.

 It

 has

 been

 a

 major

 hub

 of

 European

 affairs

 and

 diplomacy

 since

 its

 founding

 in

 

7

9

2

 AD

.

 The

 city

 is

 a

 popular

 tourist

 destination

 and

 is

 home

 to

 many

 world

-ren

owned

 museums

,

 landmarks

,

 and

 festivals

.

 Its

 reputation

 as

 a

 cosm



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 can

 be

 influenced

 by

 a

 variety

 of

 factors

 such

 as

 technological

 advancements

,

 societal

 changes

,

 and

 public

 perception

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 focus

 on

 ethics

 and

 AI

 responsible

 design

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 is

 essential

 to

 consider

 the

 ethical

 implications

 of

 its

 development

.

 This

 includes

 issues

 such

 as

 bias

,

 transparency

,

 and

 accountability

.

 There

 is

 a

 growing

 recognition

 of

 the

 need

 for

 responsible

 design

 of

 AI

,

 with

 more

 focus

 on

 creating

 systems

 that

 can

 be

 understood

 and

 trusted

 by

 humans

.



2

.

 Development

 of

 new

 forms

 of

 AI

:

 As

 AI

 continues

 to

 evolve

,

 we

 may

 see

 new

 forms

 of

 AI

 that

 are

 more

 advanced

,




In [6]:
llm.shutdown()