# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-09 00:47:28] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.97it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.97it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:03,  5.10it/s]Capturing batches (bs=120 avail_mem=76.31 GB):   5%|▌         | 1/20 [00:00<00:03,  5.10it/s]

Capturing batches (bs=112 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.10it/s]Capturing batches (bs=112 avail_mem=76.30 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.35it/s]Capturing batches (bs=104 avail_mem=76.30 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.35it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.35it/s] 

Capturing batches (bs=96 avail_mem=76.29 GB):  25%|██▌       | 5/20 [00:00<00:01,  7.78it/s]Capturing batches (bs=88 avail_mem=76.28 GB):  25%|██▌       | 5/20 [00:00<00:01,  7.78it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  25%|██▌       | 5/20 [00:00<00:01,  7.78it/s]

Capturing batches (bs=80 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:01,  7.28it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:01,  7.28it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  40%|████      | 8/20 [00:01<00:01,  7.40it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  40%|████      | 8/20 [00:01<00:01,  7.40it/s]

Capturing batches (bs=56 avail_mem=76.27 GB):  40%|████      | 8/20 [00:01<00:01,  7.40it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  50%|█████     | 10/20 [00:01<00:01,  8.80it/s]Capturing batches (bs=48 avail_mem=75.90 GB):  50%|█████     | 10/20 [00:01<00:01,  8.80it/s]Capturing batches (bs=40 avail_mem=75.90 GB):  50%|█████     | 10/20 [00:01<00:01,  8.80it/s]Capturing batches (bs=32 avail_mem=75.19 GB):  50%|█████     | 10/20 [00:01<00:01,  8.80it/s]Capturing batches (bs=32 avail_mem=75.19 GB):  65%|██████▌   | 13/20 [00:01<00:00, 12.34it/s]Capturing batches (bs=24 avail_mem=75.18 GB):  65%|██████▌   | 13/20 [00:01<00:00, 12.34it/s]

Capturing batches (bs=16 avail_mem=75.18 GB):  65%|██████▌   | 13/20 [00:01<00:00, 12.34it/s]Capturing batches (bs=16 avail_mem=75.18 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.23it/s]Capturing batches (bs=12 avail_mem=75.17 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.23it/s]Capturing batches (bs=8 avail_mem=75.17 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.23it/s] Capturing batches (bs=4 avail_mem=75.16 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.23it/s]Capturing batches (bs=4 avail_mem=75.16 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.69it/s]Capturing batches (bs=2 avail_mem=75.16 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.69it/s]

Capturing batches (bs=1 avail_mem=75.15 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.69it/s]Capturing batches (bs=1 avail_mem=75.15 GB): 100%|██████████| 20/20 [00:01<00:00, 12.13it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  James and I am a 13 year old boy. I live in Boston, Massachusetts and I am also the owner of a pet rescue shelter. My best friend, Daniel, is 26 years old and has been my best friend for almost 50 years. He has always been there for me and helped me when I needed it. When Daniel was 23 years old, he got married and he took me with him to their wedding. Daniel and I are still inseparable and I look up to him as my friend. I recently heard that Daniel has been diagnosed with a type of cancer, he has been told that
Prompt: The president of the United States is
Generated text:  a very important person. They are the leader of the country. They are the most powerful person in the country. But how do you know who the president is? Sometimes people name the president someone else. Sometimes the president is not born when you were born. Sometimes the president is not even born! That's not nice. The president is a real person. That's what makes him or h

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Skill] who has [Number of Years] years of experience in [Field]. I am [Gender] and I am [Race]. I am [Height] inches tall and [Weight] pounds. I have [Number of Children] children and I am [Gender] and [Race]. I am [Height] inches tall and [Weight] pounds. I have [Number of Children] children and I am [Gender] and [Race]. I am [Height] inches tall and [Weight] pounds. I have [Number of Children

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city that serves as the political, cultural, and economic center of the country. It is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also famous for its cuisine, fashion, and music, making it a popular tourist destination. The city is home to many world-renowned museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moder

Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots in factories to personalized medicine and virtual assistants.

One of the most exciting trends in AI is the increasing integration of AI into everyday life. This includes the use of AI in healthcare, where AI-powered diagnostic tools and treatment plans are becoming more common. AI is also being used to improve the efficiency and accuracy of transportation systems, such as autonomous vehicles and drones.

Another area where AI is likely to have a significant impact is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  ____. I'm an ____. I come from ____. I'm a ____. I've always been ____. And my ____. I'm ____. I'm ____. 

Please don't use any profanity or inappropriate language. Also, please make sure to keep the introduction neutral and informative, avoiding any personal attacks or attacks on anyone or any group. Good luck with your self-introduction! Let's get started! 

[Your Name]  
[Your Profession]  
[Your Nationality]  
[Your Age]  
[Your Current Location]  
[Your Personal Trait or Skill]  
[Your Origin]  
[Your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the center of the country and known for its rich history, art, and culture. It serves as the political, economic, and cultural capital of the country, attracting visitors from around the world with its stunning architecture, vibrant nightlife

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Occup

ation

]

 expert

.

 I

've

 been

 learning

 about

 [

Industry

]

 for

 [

Number

]

 years

,

 but

 I

've

 always

 been

 fascinated

 by

 [

Industry

]

 because

 [

Why

].

 I

'm

 always

 ready

 to

 share

 my

 knowledge

 and

 help

 anyone

 who

 needs

 it

.

 I

 look

 forward

 to

 meeting

 new people

 who share

 my interest

.

 Thanks for

 asking

! [

Name]

 Self-int

roduction.

 Can you

 please provide

 me with

 a list

 of potential

 topics that

 could be

 discussed in

 my introduction

 to my

 future clients

? Certainly

! Here

 are some

 potential topics

 that could

 be discussed

 in

 your

 introduction to

 your future

 clients

:



1.

 Industry-related

 information:

 This



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



The

 statement

 about

 Paris

's

 capital

 city

 is

 factual

 and

 un

ambiguous

.

 It

 provides

 a

 clear

 and

 concise

 description

 of

 Paris

,

 which

 is

 the

 capital

 city

 of

 the

 French

 Republic

.

 The

 statement

 is

 fact

ually

 correct

,

 and

 it

 does

 not

 contain

 any

 unclear

 or

 ambiguous

 elements

 that

 could

 cause

 confusion

.

 The

 full

 statement

 about

 Paris

's

 capital

 city

 is

:

 "

The

 capital

 of

 France

 is

 Paris

."

 



For

 the

 sake

 of

 completeness

,

 here

 is

 the

 full

 statement

:

 "

The

 capital

 of

 France

 is

 Paris

.

 "

 This

 statement

 is

 a

 factual

 and

 un

ambiguous

 representation

 of

 Paris

's

 position

 in

 the

 French

 Republic

.

 



To

 re

iterate

,

 the

 statement

 about

 Paris

's

 capital

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 trends

 and

 developments

.

 Some

 of

 the

 most

 promising

 and

 potential

 future

 trends

 include

:



1

.

 Autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 likely

 to

 become

 increasingly

 common

,

 with

 a

 range

 of

 companies

 working

 on

 developing

 fully

 autonomous

 vehicles

 that

 can

 operate

 on

 roads

 and

 in

 factories

.



2

.

 Virtual

 assistants

:

 Virtual

 assistants

 like

 Siri

,

 Alexa

,

 and

 Google

 Assistant

 are

 likely

 to

 become

 even

 more

 integrated

 into

 our

 daily

 lives

,

 with

 more

 advanced

 features

 and

 capabilities

.



3

.

 Blockchain

:

 Blockchain

 technology

 is likely

 to play

 a

 growing

 role

 in

 AI

,

 with

 companies

 like

 IBM

 and

 Google

 working

 on

 developing

 applications

 and

 services that

 leverage the

 technology to

 create

 more

 secure

 and




In [6]:
llm.shutdown()