# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0912 07:47:11.002000 285349 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 07:47:11.002000 285349 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0912 07:47:19.788000 285973 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 07:47:19.788000 285973 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0912 07:47:19.791000 285974 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 07:47:19.791000 285974 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-12 07:47:20] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.58it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.58it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.58it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  8.62it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Patrick. I'm a student of Computer Science at UCL. I'm passionate about the intersection of computer science and art. I have a degree in Computer Science, and have worked as a software engineer for a number of years. My interest in the intersection of computer science and art stems from my love of making art and the way in which art can be transformed into something new.
I find it incredibly interesting to see how art can be represented in code, and vice versa. I see this as a way of extending the boundaries of art, and exploring how art can be created using programming languages. I have been working on a project that explores
Prompt: The president of the United States is
Generated text:  a 44-year-old man of noble birth and impeccable education. He is a man of great achievements in the field of science. In this respect, he is unique. He is the only American president who has not been born in the United States, and he has also been elected pre

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Type of Character] who is [Describe your character's personality]. I'm [Describe your character's hobbies and interests]. I'm [Describe your character's strengths and weaknesses]. I'm [Describe your character's goals and aspirations]. I'm [Describe your character's personality traits]. I'm [Describe your character's unique selling point]. I'm [Describe your character's personality type]. I'm [Describe your character's personality traits]. I'm [Describe your character's unique selling point]. I'm [Describe your character's personality type].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is located in the south of the country. It is the largest city in Europe and the second-largest city in the world by population. Paris is known for its rich history, art, and culture, and is a major tourist destination. The city is also home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a vibrant and dynamic city with a rich cultural and artistic heritage. It is a popular tourist destination for people from all over the world. The city is also home to many important institutions such as the French

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars to personalized medicine. Additionally, AI is likely to continue to be used for a wide range of applications, from improving healthcare outcomes to enhancing customer service and fraud detection. As AI continues to evolve, it is likely to have a significant impact on the way we work, live, and interact with each other. However, it is also important to consider the potential risks and ethical considerations associated with AI, and to ensure that it is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a person who loves to travel and explore new places. I have always been fascinated by the concept of adventure and have always dreamed of experiencing a new culture, cuisine, or art form. I have always been interested in science and technology, and have a passion for learning about the world around me. Whether I am in a foreign city or in my own backyard, I always have a good time. I am a self-employed blogger, and I love sharing my thoughts and ideas on the latest trends and fashion. I am passionate about sustainability and am always looking for ways to reduce my carbon footprint. My love for photography and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

A. Correct
B. Incorrect
C. Insufficient information

**A. Correct** 

Here's a brief and accurate fact about Paris: it is the capital

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 [

age

].

 I

 am

 a

 [

occupation

]

 who

 have

 a

 passion

 for

 [

interest

 or

 hobby

].

 I

 have

 always

 been

 passionate

 about

 [

reason

],

 and

 I

 believe

 that

 [

reason

]

 will

 lead

 me

 to

 success

 and

 fulfillment

.

 I

 am

 always

 looking

 for

 new

 opportunities

 to

 learn

 and

 grow

,

 and

 I

 am

 always

 eager

 to

 share

 my

 knowledge

 and

 expertise

 with

 those

 who

 can

 benefit

 from

 it

.

 My

 goal

 is

 to

 [

career

 goal

 or

 purpose

],

 and

 I

 am

 confident

 that

 I

 can

 achieve

 it

 through

 [

your

 talents

,

 skills

,

 or

 qualities

].

 Thank

 you

 for

 asking

,

 and

 I

 look

 forward

 to

 the

 opportunity

 to

 meet

 you

.

 [

Name

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 true

 because

:



1

.

 It

 is

 widely

 known

 and

 recognized

 as

 the

 capital

 of

 France

.


2

.

 It

 is

 the

 main

 city

 and

 largest

 city

 in

 France

.


3

.

 It

 has

 been

 the

 capital

 city

 of

 France

 for

 more

 than

 

3

0

0

 years

.


4

.

 It

 is

 a

 city

 with

 a

 rich

 history

,

 including

 its

 influence

 on

 French

 literature

,

 art

,

 and

 cuisine

.

 



Furthermore

,

 Paris

 has

 a

 significant

 impact

 on

 France

's

 culture

,

 politics

,

 and

 economy

,

 making

 it

 a

 prominent

 and

 recognizable

 city

 in

 French

 society

.

 The

 city

's

 status

 as

 the

 capital

 has

 earned

 it

 a

 highly

 respected

 and

 prestigious

 title

,

 thus

 allowing

 it

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to be

 characterized by

 several

 key

 trends

 that

 will

 shape

 how

 we

 interact

 with

 technology

 and

 its

 applications

 in

 various

 sectors

,

 including

 education

,

 healthcare

,

 and

 transportation

.

 Here

 are

 some

 of

 the

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 Integration

 with

 Human

 Wisdom

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 will

 be

 able

 to

 learn

 and

 adapt

 more

 effectively

 to

 human

 behavior

 and

 context

.

 This

 means

 that

 AI

 systems

 will

 be

 able

 to

 better

 understand

 and

 respond

 to

 the

 needs

 and

 preferences

 of

 humans

,

 potentially

 leading

 to

 more

 personalized

 and

 efficient

 solutions

.



2

.

 Autonomous

 and

 Intelligent

 Agents

:

 As

 AI

 becomes

 more

 advanced

,

 it

 will

 be

 able

 to

 operate

 independently

 and

 make

 decisions

 based

 on

 complex




In [6]:
llm.shutdown()