# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0909 08:56:46.046000 1857693 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 08:56:46.046000 1857693 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0909 08:56:53.963000 1857964 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 08:56:53.963000 1857964 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0909 08:56:54.165000 1857963 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 08:56:54.165000 1857963 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-09 08:56:54] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.57it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=54.05 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=54.05 GB):  33%|███▎      | 1/3 [00:17<00:34, 17.39s/it]Capturing batches (bs=2 avail_mem=56.09 GB):  33%|███▎      | 1/3 [00:17<00:34, 17.39s/it]Capturing batches (bs=1 avail_mem=56.09 GB):  33%|███▎      | 1/3 [00:17<00:34, 17.39s/it]Capturing batches (bs=1 avail_mem=56.09 GB): 100%|██████████| 3/3 [00:17<00:00,  5.83s/it]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Madeline and I am 18 years old. My family is also my best friend. I love reading, making my own decisions, and learning. I want to be a doctor because I want to help sick people. I'm going to the University of Miami for college. I'm really excited to learn about medicine and work with patients. I'm also a fan of sports and playing basketball. I have a friend who has a great tennis player. I am sure he will be the best player in the team. I want to be famous for being the best tennis player in the world. I have no idea how to do that though
Prompt: The president of the United States is
Generated text:  running for a second term, and is planning a campaign. He has 5 different campaigns in mind, and each campaign has a certain number of different ways to be presented. He decides to evaluate the number of ways to present the campaigns and decides to compare the total number of presentations he can make with the total number of ways he can distribu

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling city with a rich history and culture, and it is a popular tourist destination. The city is known for its fashion, art, and cuisine, and it is a major hub for business and commerce in Europe. Paris is also home to many famous museums and attractions, including the Louvre, the Musée d'Orsay, and the Musée d'Orsay. Overall, Paris is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased integration of AI into everyday life: AI is already being integrated into many aspects of our lives, from smart home devices to self-driving cars. As AI becomes more integrated into our daily lives, we may see even more widespread adoption of AI technologies.

2. Greater emphasis on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for the development and use of AI technologies.

3. Increased



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Age] year old, [Gender] [Name]. I have a [job] at [company] and I'm passionate about [something]. I enjoy [why]. I'm currently [status] and I'm open to [advice].
**[Name]**, a **[Age]** year-old, **[Gender]**, **[Name]** at **[Company]**, is passionate about **[something]**. I am currently **[status]**, and I'm ready to share my **[advice]** with you. What’s something you’d like to know about me? I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. It is the largest and most populous city in France, located on the banks of the Seine River, within the Haute-Marne region. Paris is the cultural, economic, and political center of France, and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also known f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

_,

 and

 I

 am

 a

 __

________

__.

 I

 am

 currently

 living

 in

 __

________

_

 and

 I

 enjoy

 __

________

_

.



I

 hope

 you

 enjoy

 learning

 more

 about

 me

 and

 my

 character

.

 Let

 me

 know

 if

 you

 have

 any

 questions

 or

 if

 there

 is

 anything

 I

 can

 do

 to

 make

 you

 feel

 more

 comfortable

 with

 me

.

 Your

 time

 is

 important

 to

 me

,

 and

 I

 value

 your

 input

.

 



Do

 you

 have

 any

 questions

 or

 topics

 you

 would

 like

 to

 discuss

?

 I

'm

 here

 to

 answer

 any

 questions

 you

 might

 have

 or

 to

 share

 information

 about

 my

 character

.

 



Feel

 free

 to

 reach

 out

 to

 me

 through

 my

 social

 media

 channels

,

 my

 website

,

 or

 any

 other

 communication



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 and

 vibrant

 culture

.

 It

 is

 located

 in

 the

 north

western

 region

 of

 France

 and

 is

 the

 seat

 of

 government

,

 diplomacy

,

 and

 politics

 for

 France

.

 



Paris

 is

 a

 major

 cultural

 and

 economic

 center

,

 with

 attractions

 like

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

,

 as

 well

 as

 its

 famous

 landmarks

 like

 the

 Ch

amps

-

É

lys

ées

 and

 Lou

vre

.

 It

 has

 a

 population

 of

 around

 

2

.

3

 million

 people

,

 making

 it

 the

 largest

 city

 in

 the

 European

 Union

.

 



The

 French

 capital

 is

 also

 home

 to

 a

 large

 international

 community

,

 with

 significant

 European

 influence

 and

 contributions



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 driven

 by

 several

 trends

 that

 are

 shaping

 the

 direction

 of

 development

 in

 this

 rapidly

 evolving

 field

.

 Here

 are

 some

 key

 trends

 that

 are

 expected

 to

 shape

 the

 future

 of

 artificial

 intelligence

:



1

.

 Increased

 focus

 on

 ethical

 considerations

:

 As

 more

 and

 more

 AI

 systems

 become

 integrated

 into

 our

 daily

 lives

,

 there

 will

 be

 a

 growing

 emphasis

 on

 ethical

 considerations

.

 This will

 include issues

 such

 as

 bias

,

 transparency

,

 accountability

,

 and

 privacy

.

 As

 AI

 systems

 become

 more

 complex

 and

 capable

,

 it

 will

 become

 essential

 to

 ensure

 that

 they

 are

 developed

 and

 deployed

 in

 ways

 that

 respect

 these

 ethical

 concerns

.



2

.

 Expansion

 of

 AI

 applications

:

 One

 of

 the

 most

 significant

 trends

 in

 the

 future

 of




In [6]:
llm.shutdown()