# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0830 21:31:10.005000 3726566 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 21:31:10.005000 3726566 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0830 21:31:20.255000 3727417 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 21:31:20.255000 3727417 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-08-30 21:31:20] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.17it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=61.41 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=61.41 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.87it/s]Capturing batches (bs=2 avail_mem=61.35 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.87it/s]Capturing batches (bs=1 avail_mem=61.34 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.87it/s]Capturing batches (bs=1 avail_mem=61.34 GB): 100%|██████████| 3/3 [00:00<00:00, 11.37it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily. I live in New York City and I'm an intern at a nonprofit called Frequent Foundation. I'm here to share what I've learned about social change and volunteering. I'll be providing updates on what's on the horizon for Frequent Foundation, what I've been working on at Frequent Foundation, and sharing updates on a variety of interesting social issues, such as the opioid crisis and mental health. What's interesting is, I'm not sure where to start. I'm actually really excited to be a part of Frequent Foundation and I have a lot to learn.
It's something that's really interesting for me. I
Prompt: The president of the United States is
Generated text:  3/5 times taller than the president of Brazil, and the president of Brazil is 3 times taller than the president of China. If the president of China is 180 feet tall, how tall is the president of Brazil?
To determine the height of the president of Brazil, we need to follow the given relationships ste

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill] who has always been [Attraction/Interest/Challenge] to me. I'm always looking for [Challenge/Interest/Attraction] and I'm always eager to learn more about [Subject]. I'm always ready to help others and I'm always willing to share my knowledge with anyone who asks. I'm a [Skill] who is always [Challenge/Interest/Attraction] to me. I'm always looking for [Challenge/Interest/Attraction] and I'm always eager to learn more about [Subject].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Parliament building. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. It is the capital of France and the largest city in the European Union. The city is known for its cuisine, fashion, and art, and is a major center for business, politics, and culture. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city that is both a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be an increased focus on ethical considerations. This will include issues such as bias, transparency, accountability, and the potential for AI to be used for harmful purposes.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, it is likely to be



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Title] at [Company Name]. I am currently working remotely and am excited to bring my innovative approach to this project. I have a passion for [Area of Expertise] and am always looking to learn new things. Thank you! [Name] [Title] / [Company Name] / [Position]
As an [Area of Expertise], I bring a unique approach to the project, ensuring a successful and innovative outcome. I am currently working remotely and excited to bring my innovative ideas to the table. I have a passion for [Area of Expertise] and always aim to learn new things.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France by population and is located on the Seine River. It has a rich history and is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral,

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 [

Age

]

 years

 old

.

 I

 currently

 work

 as

 a

 [

Job

 Title

]

 at

 [

Company

],

 and

 I

 have

 been

 in

 this

 profession

 for

 [

Number

 of

 Years

]

 years

.

 Throughout

 my

 career

,

 I

 have

 always

 been

 [

Positive

 Trait

],

 and

 I

 am

 always

 [

Positive

 Attribute

].

 I

 am

 an

 [

A

wards

 Won

],

 [

Achie

vements

 Achie

ved

],

 and

 [

F

acts

 Not

 Known

]

 person

.

 Thank

 you

.

 Good

 luck

!

 Your

 self

-int

roduction

 should

 be

 neutral

,

 without

 bias

 or

 con

notations

,

 and

 should

 convey

 a

 straightforward

 and

 professional

 tone

.

 I

'm

 glad

 you

 decided

 to

 share

 your

 story

 with

 me

.

 As

 a

 [

Name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 world

’s

 

2

1

st

 most

 populated

 city

 and

 is

 the

 largest

 city

 in

 the

 European

 Union

.

 It

 is

 home

 to

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Palace

.

 It

 was

 founded

 in

 

7

8

9

 AD

 and

 is

 the

 seat

 of

 the

 French

 government

.

 The

 city

 is

 known

 for

 its

 architecture

,

 culinary

 traditions

,

 and

 annual

 festivals

.

 As

 of

 

2

0

2

1

,

 Paris

 is

 the

 

1

5

th

-largest

 city

 in

 the

 world

 by

 population

 and

 is

 considered

 one

 of

 the

 world

's

 most

 famous

 cities

.

 Paris

 is

 also

 home

 to

 many

 of

 the

 world

's

 most

 famous

 landmarks

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 one

 of

 rapidly

 evolving

 technologies

 with

 an

 increasing

 number

 of

 applications

 and

 applications

 of

 AI

 in

 various

 industries

 and

 fields

.



One

 of

 the

 key

 trends

 that

 are

 expected

 to

 shape

 the

 future

 of

 AI

 is

 the

 increasing

 integration

 of

 AI

 into

 everyday

 life

.

 With

 the

 development

 of

 deep

 learning

 models

 and

 the

 improvement

 of

 computer

 hardware

,

 we

 can

 expect

 to

 see

 more

 intelligent

 assistants

,

 virtual

 assistants

,

 and

 even

 personal

 assistants

 that

 can

 understand

 and

 respond

 to

 our

 queries

 and

 actions

.



Another

 trend

 that

 is

 expected

 to

 shape

 the

 future

 of

 AI

 is

 the

 development

 of

 more

 advanced

 and

 complex

 AI

 models

,

 which

 are

 able

 to

 handle

 more

 complex

 and

 nuanced

 tasks

.

 This

 will

 require

 more

 resources

,

 such

 as

 more




In [6]:
llm.shutdown()