# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0831 06:36:03.490000 882302 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0831 06:36:03.490000 882302 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0831 06:36:11.595000 882676 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0831 06:36:11.595000 882676 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-08-31 06:36:12] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.52 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=76.52 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.14it/s]Capturing batches (bs=2 avail_mem=76.46 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.14it/s]Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.14it/s]

Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00,  4.23it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00,  3.85it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Yolanda Flores. I am a Junior at A. R. M. High School and I am currently in the math program. I have taken many math classes and I love math. I have been very good at math and I am very good at math. I have also received a math scholarship. I am looking forward to continuing my education and obtaining a degree in mathematics. Thank you. What is your name? I am Yolanda Flores and I am a Junior at A. R. M. High School. I am currently in the math program. I have taken many math classes and I love math. I have been very
Prompt: The president of the United States is
Generated text:  20 years older than the president of Central America, and the president of Central America is 30 years younger than the president of Asia. If the president of Asia is currently 30 years old, how old will the president of Asia be in 10 years? Let's break down the problem step by step.

1. Let the current age of the president of Asia be \( A \).
2. According to the proble

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] who has been [number of years] in the industry. I'm passionate about [reason for passion], and I'm always looking for ways to [action or achievement]. I'm always eager to learn and grow, and I'm always open to new experiences. I'm a [type of person] who is [character trait or quality] and I'm always ready to help others. I'm [character trait or quality] and I'm always ready to help others. I'm [character trait or quality] and I'm always ready to help others. I'm [character trait or quality

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French National Radio and Television Network. Paris is a bustling metropolis with a rich history and a diverse population, making it a popular tourist destination. The city is known for its cuisine, fashion, and art, and is a major center for business and finance in Europe. Paris is also home to many international organizations and institutions, including UNESCO and the European Union. The city is known for its cultural and artistic heritage, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This will lead to the creation of more efficient and productive machines that can perform tasks that were previously done by humans.

2. Enhanced privacy and security: As AI technology becomes more advanced, we are likely to see an increase in the use of AI in areas that involve sensitive data



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Age] year old [Occupation]. I am currently [Current Location], and I have a [Favorite Activity], and [Favorite Color]. I enjoy spending time with [Friend or Family], and I have a [Favorite Book or Music] that has always been a big part of my life. What else could I possibly say about myself? Let's make the conversation more interesting and engaging! And if you want to start, I am ready to share my story. [Name] [Age] [Occupation] [Current Location] [Favorite Activity] [Favorite Color] [Friend or Family]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known as "La Ville de Paris." It is the largest city in France and one of the oldest continuously inhabited cities in the world. 

Aristotle once said: "To understand the city, understand its people." The city is a melting pot of diverse

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 I

 am

 an

 [

age

]

 year

 old

 student

.

 I

 am

 [

gender

]

 and

 [

address

]

 is

 my

 permanent

 home

.

 I

 love

 [

something

]

 and

 I

 enjoy

 [

reason

 why

 I

 love

 [

something

]]

.



I

 am

 passionate

 about

 [

interest

/

interest

 in

 life

]

 and

 I

 am always

 learning

.

 I

 am

 eager

 to

 grow

 and

 develop

 into

 someone

 who

 is

 [

v

ital

 life

 skill

].

 I

 am

 always

 looking

 for

 new

 experiences

,

 challenges

,

 and

 growth

 opportunities

.

 I

 am

 also

 very

 interested

 in

 [

event

 or

 issue

 that

 interests

 you

]

 and

 I

 am

 always

 ready

 to

 help

 others

 understand

 and

 discover

 what

 it

 means

 to

 be

 [

interest

/

interest

 in

 life



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 the

 largest

 city

 in

 France

 and

 the

 

1

5

th

-largest

 city

 in

 the

 world

 by

 population

.


You

 are

 an

 AI

 assistant

 that

 helps

 you

 understand

 the

 reasons

 behind

 the

 answers

.

 No

 purpose

 other

 than

 to

 provide

 the

 answers

 to

 the

 questions

 presented

.

 There

 is

 no

 talk

 about

 nothing

.

 No

 questions

 answered

 that

 have

 not

 been

 asked

.

 The

 questions

 and

 answers

 adhere

 to

 the

 limit

 sequence

 

1

2

0

.

 Learn

 that

 the

 sum

 of

 the

 rows

 and

 columns

 in

 a

 grid

 is

 called

 its

 area

.

 Create

 the

 grid

 that

 represents

 the

 area

 of

 the

 desk

 shown

.

 Label

 the

 grid

 cells

 with

 the

 coordinates

 of

 the

 grid

 cells

.

 Use

 them

 to

 answer

 the

 question



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 incredibly

 diverse

 and

 rapidly

 evolving

,

 driven

 by

 new

 technologies

,

 advances

 in

 neuroscience

,

 and

 a

 growing

 awareness

 of

 the

 ethical

 implications

 of

 AI

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 machine

 learning

 and

 automation

:

 As

 AI

 continues

 to

 become

 more

 sophisticated

,

 it

 is

 likely

 to

 become

 more

 prevalent

 in

 areas

 such

 as

 manufacturing

,

 healthcare

,

 and

 transportation

.



2

.

 AI

 in

 healthcare

:

 With

 the

 increasing

 use

 of

 AI

 in

 medicine

,

 we

 may

 see

 more

 personalized

 treatments

,

 more

 efficient

 patient

 care

,

 and

 a

 greater

 understanding

 of

 the

 human

 body

.



3

.

 AI

 in

 finance

:

 AI

 is

 already

 used

 in

 financial

 services

 to

 automate

 trading

 and

 reduce

 fraud

.

 As




In [6]:
llm.shutdown()