# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]


  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-27 05:19:32] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.00it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:04,  4.46it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:04,  4.46it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:04,  4.46it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.81it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.81it/s]

Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.81it/s] Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:00<00:01,  9.86it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:01,  9.86it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:01,  9.86it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:01,  9.49it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:01,  9.49it/s]

Capturing batches (bs=72 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:02,  4.89it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:02,  4.89it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:02,  5.43it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:02,  5.43it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:02,  5.43it/s]

Capturing batches (bs=40 avail_mem=76.76 GB):  45%|████▌     | 9/20 [00:01<00:02,  5.43it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00,  8.93it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00,  8.93it/s]Capturing batches (bs=24 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:01<00:00,  8.93it/s]

Capturing batches (bs=24 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00,  9.18it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00,  9.18it/s]

Capturing batches (bs=12 avail_mem=76.74 GB):  70%|███████   | 14/20 [00:01<00:00,  9.18it/s]Capturing batches (bs=12 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:02<00:00,  8.60it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:02<00:00,  8.60it/s] 

Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:02<00:00,  8.60it/s]Capturing batches (bs=4 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:02<00:00,  8.22it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:02<00:00,  8.22it/s]Capturing batches (bs=1 avail_mem=76.72 GB):  90%|█████████ | 18/20 [00:02<00:00,  8.22it/s]Capturing batches (bs=1 avail_mem=76.72 GB): 100%|██████████| 20/20 [00:02<00:00,  8.42it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dina and I am from Australia. I live in the UK now. I like travelling. I always ask my friends for suggestions on where to go and what to see. That is what I like to do. Now I want to know about the climate of different places in the UK. I want to find out the climate and the temperature of different places in the UK so that I can plan where I should go. Would you please tell me some information about the UK? 1. What's the weather like in different places in the UK? 2. How do people like to live in different places in the UK? 3
Prompt: The president of the United States is
Generated text:  trying to decide how many military tanks to buy. He has two choices, tanks with 4 tanks and 2 tanks per aircraft carrier. The cost of a new aircraft carrier is $20 million. Each tank has an initial cost of $100,000. The president wants to maximize the total value of tanks he can purchase. How many tanks should he buy? To determine how many tanks the presiden

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a few key points about yourself, such as your age, gender, occupation, etc.]. And what's your favorite hobby or activity? I love [insert a few hobbies or activities you enjoy, such as reading, playing sports, or spending time with friends]. And what's your favorite book or movie? I love [insert a few favorite books or movies you've seen, such as [insert a few titles] or [insert

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. 

(Note: The statement should be a single, clear sentence that captures the essence of Paris's importance and cultural significance.) Paris is the capital city of France, renowned for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant cultural scene. 

(Note: The statement should be a single, clear sentence that captures the essence of Paris's importance and cultural significance.) Paris is the capital city of France, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. 

(Note: The statement

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation: AI is already being used in a wide range of industries, from manufacturing to healthcare to customer service. As automation continues to advance, we can expect to see even more widespread use of AI in various sectors.

2. Improved privacy and security: As AI systems become more sophisticated, there will be an increased risk of data breaches and other security issues. As a result, there will be a push for more robust privacy and security measures to protect against



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I'm a [Your Profession]. I'm always up for a good laugh and always eager to learn new things. My background is in [Your Field], and I have experience [Examples of relevant experience and achievements]. I'm also a [Your personal trait or skill] and am always looking for ways to make my life more interesting and fulfilling. Thank you for considering me for a potential friend or colleague. Happy to meet you! How can I help you in your professional life? Please let me know if there are any specific topics or interests you would like to discuss. I'm here to make you feel comfortable and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known as the City of Light. 

The statement is accurate and succinctly describes the official title and historical significance of Paris. 

The answer is: **Par

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Age

]

 year

 old

 [

Gender

]

 [

Gender

 Identity

]

 person

.

 I

'm

 from

 [

City

/

Region

]

 and

 I

've

 always

 been

 an

 intro

vert

 who

 prefers

 to

 be

 alone

,

 but

 I

'm

 excited

 to

 meet

 you

 and

 let

 you

 know

 who

 I

 am

.

 What

 makes

 you

 stand

 out

 to

 you

?

 



Please

 provide

 me

 with

 more

 information

 on

 how

 you

 come

 to

 be

 an

 intro

vert

,

 and

 any

 insights

 or

 advice

 you

 would

 give

 to

 someone

 who

 is

 trying

 to

 become

 more

 comfortable

 with

 their

 intro

verted

 nature

.

 Additionally

,

 please

 share

 any

 personal

 anecdotes

 or

 experiences

 that

 demonstrate

 your

 unique

 personality

 traits

,

 as

 well

 as

 any

 strategies

 or

 tools



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 complete

 answer

 would

 be

:

 The

 capital

 of

 France

 is

 Paris

,

 an

 ancient

 city

 located

 in

 the

 heart

 of

 the

 Paris

 Basin

,

 situated

 on

 the

 banks

 of

 the

 River

 Se

ine

,

 and

 served

 as

 the

 political

 and

 cultural

 center

 of

 France

 since

 the

 

1

2

th

 century

.

 Paris

 is

 home

 to

 the

 Notre

 Dame

 Cathedral

,

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 and

 the

 Place

 de

 la

 Con

cor

de

,

 among

 other

 landmarks

.

 The

 city

 has

 a

 rich

 history

,

 including

 the

 Roman

,

 Gothic

,

 Bar

oque

,

 and

 Modern

ist

 er

as

,

 and

 remains

 a

 vital

 hub

 of

 French

 culture

,

 politics

,

 and

 economy

.

 Paris

 is

 known

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 continue

 to

 evolve

 and

 transform

 in

 several

 key

 areas

,

 including

:



1

.

 Autonomous

 and

 connected

 vehicles

:

 AI

 is

 expected

 to

 have

 a

 significant

 impact

 on

 the

 development

 of

 autonomous

 vehicles

,

 which

 are

 designed

 to

 operate

 on

 their

 own

 without

 the

 need

 for

 human

 drivers

.

 This

 includes

 developing

 algorithms

 that

 can

 recognize

 traffic

 signs

,

 navigate

 routes

,

 and

 handle

 complex

 driving

 situations

,

 potentially

 reducing

 the

 likelihood

 of

 accidents

 and

 improving

 overall

 traffic

 efficiency

.



2

.

 Smart

 homes

:

 AI

 is

 expected

 to

 become

 more

 integrated

 into

 our

 homes

,

 with

 technology

 such

 as

 smart

 ther

most

ats

,

 smart

 lighting

,

 and

 smart

 security

 systems

 becoming

 increasingly

 prevalent

.

 These

 systems

 can

 use

 AI

 to

 optimize

 energy

 consumption

,




In [6]:
llm.shutdown()