# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-26 08:25:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-26 08:25:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-26 08:25:41] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-01-26 08:25:44] INFO server_args.py:1764: Attention backend not specified. Use fa3 backend by default.


[2026-01-26 08:25:44] INFO server_args.py:2672: Set soft_watchdog_timeout since in CI








[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.93 GB):   5%|▌         | 1/20 [00:00<00:07,  2.52it/s]Capturing batches (bs=120 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:07,  2.52it/s]Capturing batches (bs=112 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:07,  2.52it/s]Capturing batches (bs=104 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:07,  2.52it/s]Capturing batches (bs=104 avail_mem=76.82 GB):  20%|██        | 4/20 [00:00<00:01,  9.72it/s]Capturing batches (bs=96 avail_mem=76.82 GB):  20%|██        | 4/20 [00:00<00:01,  9.72it/s] Capturing batches (bs=88 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:01,  9.72it/s]Capturing batches (bs=80 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:01,  9.72it/s]

Capturing batches (bs=80 avail_mem=76.81 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.09it/s]Capturing batches (bs=72 avail_mem=76.80 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.09it/s]Capturing batches (bs=64 avail_mem=76.80 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.09it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.09it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  50%|█████     | 10/20 [00:00<00:00, 19.07it/s]Capturing batches (bs=48 avail_mem=76.79 GB):  50%|█████     | 10/20 [00:00<00:00, 19.07it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.07it/s]Capturing batches (bs=32 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.07it/s]

Capturing batches (bs=32 avail_mem=76.78 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.25it/s]Capturing batches (bs=24 avail_mem=76.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.25it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.25it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.25it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:00<00:00, 21.46it/s]Capturing batches (bs=8 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:00<00:00, 21.46it/s] Capturing batches (bs=4 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 21.46it/s]Capturing batches (bs=2 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 21.46it/s]

Capturing batches (bs=1 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 21.46it/s]Capturing batches (bs=1 avail_mem=76.74 GB): 100%|██████████| 20/20 [00:01<00:00, 25.69it/s]Capturing batches (bs=1 avail_mem=76.74 GB): 100%|██████████| 20/20 [00:01<00:00, 18.69it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Radouel, I have a professional background in military software and a love for all things travel.
I'm also a proud fan of the Telemark Rider and the Telemark Motorcycle brand. I'm eager to bring my experience and passion for motorcycle culture to the world. Let's make something great happen! How do I get started on the Telemark Rider journey?
Starting the Telemark Rider journey can be a fascinating and rewarding experience! Here are some steps you can take to get started:

1. Get familiar with the Telemark Rider brand: The Telemark Rider is a brand that has been around for over
Prompt: The president of the United States is
Generated text:  trying to become more environmentally conscious. He decides to lead a campaign for a new car that uses a low-emission engine. He sets a goal to reduce the carbon footprint by 20% by 2025.

The campaign involves the president selling cars at a price point to his target audience, which is expected to be around 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

This statement is factually correct and provides a clear and concise overview of the capital city's location and significance in French culture and politics. However, it could be expanded to include additional information about Paris's historical and cultural importance, such as its status as the world's most populous city and its role as a major cultural and economic center. For example, the statement could be expanded to: "Paris, the capital of France, is the world's most populous city, with a population of over 20 million people, making it the largest city in the world by both population and area. The city is also home to numerous

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This could include issues such as bias, transparency, accountability, and privacy.

2. Advancements in machine learning and deep learning: As AI technology continues to advance, we are likely to see more sophisticated models that can learn from large amounts of data and make more accurate predictions.

3. Integration with other technologies



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [major/majority] student. I am passionate about [interest], and I love [reason for liking the subject]. I have an [occupation/graduate degree], and I enjoy [excuse for being here] through [excuse for being here]. I am [age] years old, and I currently live in [city/region]. If you could give me any advice on how to overcome [strength], I would greatly appreciate it. [Name] [Age] [Interests] [Experiences] [Future Goals] [Personalities] [Unique Traits] [Adaptable Skills

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the Île de la Cite in the southern part of the country, and is the most populous city in the country with an estimated population of over 2.7 million people.

The city is known for its rich history, beautiful architecture, and vibrant culture. It is a popular tour

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Sarah

 and

 I

 am

 a

 

3

0

-year

-old

 marketing

 professional

.

 I

 have

 a

 passion

 for

 data

 analysis

 and

 have

 been

 using

 various

 tools

 to

 help

 me

 make

 informed

 decisions

.

 I

 love

 being

 able

 to

 strateg

ize

 and

 plan

,

 and

 I

 have

 a

 natural

 ability

 to

 connect

 with

 people

 and

 build

 relationships

.

 I

 am

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 grow

 and

 learn

.

 Thanks

 for

 considering

 my

 application

!

 I

'm

 excited

 to

 meet

 you

!

 Sarah

.

 I

'll

 make

 sure

 to

 explain

 my

 background

 and

 experience

 in

 the

 response

.

 Hi

 there

!

 Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

.

 I

'm

 Sarah

,

 a

 

3

0

-year

-old

 marketing

 professional

 with

 a

 passion



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



F

acts

 about

 Paris

:


-

 Located

 on

 the

 Se

ine

 river




-

 Famous

 for

 its

 historic

 center




-

 Known

 for

 its

 museums

,

 art

 galleries

,

 and

 iconic

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral




-

 Home

 to

 the

 E

iff

el

 Tower

 and

 many

 other

 notable

 structures




-

 Known

 for

 its

 sophisticated

 and

 refined

 lifestyle

 


-

 Often

 referred

 to

 as

 "

The

 City

 of

 Light

"

 due

 to

 its

 bo

hem

ian

 atmosphere

 and

 vibrant

 nightlife





Today

's

 Paris

:

 


Paris

 is

 one

 of

 the

 world

's

 most

 popular

 tourist

 destinations

 and

 continues

 to

 attract

 millions

 of

 visitors

 every

 year

.

 The

 city

 is

 known

 for

 its

 romantic

 ambiance

,

 authentic

 French

 culture

,

 and

 diverse



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 evolving

 at

 an

 incredible

 pace

.

 Here

 are

 some

 of

 the

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 precision

 and

 accuracy

:

 As

 AI

 continues

 to

 learn

 and

 improve

,

 it

 is

 becoming

 more

 precise

 and

 accurate

.

 This

 means

 that

 AI

 systems

 can

 make

 more

 accurate

 predictions

,

 diagnoses

,

 and

 decisions

 in

 various

 industries

,

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.



2

.

 Integration

 with

 human

 AI

:

 AI

 is

 already

 being

 integrated

 with

 human

 AI

 systems

,

 such

 as

 Siri

,

 Alexa

,

 and

 Google

 Assistant

.

 In

 the

 future

,

 we

 may

 see

 even

 more

 integration

,

 with

 AI

 systems

 becoming

 more

 integrated

 with

 human

 AI

 systems

,

 leading

 to

 more

 seamless

 interactions

.



3

.




In [6]:
llm.shutdown()