# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0916 02:40:36.009000 3885618 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 02:40:36.009000 3885618 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0916 02:40:44.184000 3886176 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 02:40:44.184000 3886176 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0916 02:40:44.756000 3886175 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 02:40:44.756000 3886175 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-16 02:40:45] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.74it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.92it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.92it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.92it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  9.18it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sam. I'm a 17-year-old middle school student. I go to a Chinese boarding school. I want to tell you about my school life. In my school, we have classes in the morning and afternoons. There are 18 classes in the morning, and 15 classes in the afternoons. I have to be in class at 8:15 every morning. After that, I have to do my homework in the afternoons. We have a lot of homework, so I have to do it every day. I have to eat meals every day too. I have to drink milk
Prompt: The president of the United States is
Generated text:  3/4 times the age of the president of the faculty of arts and sciences. The president of the faculty of arts and sciences is 30 years younger than the president of the middle school. If the president of the middle school is 50 years old, how old is the president of the faculty of arts and sciences? To determine the age of the president of the faculty of arts and sciences, we start by identifying the age of the president of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I help you with today? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I help you with today? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I help you with today? [Name] is a [job title] at [company name]. I'm excited to meet you and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and economic hub, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. It is a popular tourist destination and a major center for business and commerce. Paris is known for its diverse cuisine, including French cuisine, and its vibrant nightlife. It is also home to many international organizations and institutions, including the European Parliament and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased integration of AI into everyday life: AI is already being integrated into many aspects of our lives, from self-driving cars to personalized medicine. As AI technology continues to advance, we can expect to see even more integration into our daily routines.

2. AI will become more autonomous: As AI technology continues to improve, we can expect to see more autonomous vehicles on the road. This will require more advanced AI algorithms and machine learning techniques to make autonomous vehicles safe and reliable.

3. AI will



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am an AI language model created by Alibaba Cloud. I am here to assist you with your questions and provide you with relevant information about a wide range of topics. Feel free to ask me anything you'd like to know, and I'll do my best to help you with it. Let me know if there's anything I can do for you! [Name] [Phone number] [Email address] [Social media handle] [Location] I'm currently available 24/7 and can be reached at any time. [Name] [Location] Hello, my name is [Name] and I am an

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Additionally, Paris is renowned for its cuisine, including its famous dishes like croissants, poutine, and baguette. The French capital offers a div

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 passionate

 about

 [

mention

 something

 specific

 about

 your

 role

].

 My

 background

 is

 in

 [

mention

 your

 previous

 education

 or

 professional

 experience

],

 and

 I

 have

 a

 strong

 [

mention

 your

 skill

 or

 expertise

].

 I

'm

 always

 looking

 for

 opportunities

 to

 learn

 and

 grow

,

 and

 I

'm

 always

 eager

 to

 contribute

 to

 [

mention

 your

 company

's

 mission

 or

 objectives

].

 I

'm

 a

 team

 player

 and

 enjoy

 working

 with

 others

,

 and

 I

 have

 a

 friendly

 demeanor

 and

 are

 always

 available

 to

 help

.

 I

'm

 committed

 to

 [

mention

 something

 specific

 about

 your

 company

's

 values

 or

 beliefs

],

 and

 I

'm

 always

 looking

 for

 ways



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 fact

ually

 correct

 and

 concise

.

 However

,

 it

 can

 be

 made

 more

 precise

 by

 specifying

 that

 Paris

 is

 the

 capital

 of

 France

.

 Here

 is

 a

 more

 detailed

 version

:

 



The

 capital

 city

 of

 France

 is

 Paris

,

 located

 in

 the

 heart

 of

 the

 country

 and

 known

 as

 "

la

 Ville

 de

 la

 Rose

"

 (

the

 City

 of

 Roses

)

 due

 to

 its

 iconic

 rose

 garden

.

 It

 is

 also

 known

 as

 "

La

 Ville

 de

 la

 Belle

 E

po

que

"

 (

the

 City

 of

 Be

aux

-A

rts

)

 and

 as

 "

La

 Ville

 d

'

É

t

ats

"

 (

the

 Capital

 of

 the

 Empire

).

 Paris

 is

 the

 seat

 of

 government

 and

 the

 largest

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 is

 likely

 to

 continue

 to

 evolve

 rapidly

.

 Here

 are

 some

 possible

 trends

 in

 AI

:



1

.

 Increased

 automation

:

 AI

 is

 likely

 to

 become

 more

 integrated

 into

 the

 workforce

,

 increasing

 automation

 and

 replacing

 humans

 in

 tasks

 that

 involve

 repetitive

 and

 mundane

 work

.



2

.

 Enhanced

 privacy

 and

 security

:

 As

 AI

 systems

 become

 more

 sophisticated

,

 they

 will

 need

 to

 be

 programmed

 to

 respect

 human

 privacy

 and

 security

.

 This

 means

 that

 AI

 will

 need

 to

 be

 designed

 to

 be

 transparent

,

 secure

,

 and

 accountable

 to

 users

.



3

.

 AI

 will

 become

 more

 aware

 and

 responsive

:

 As

 AI

 systems

 become

 more

 sophisticated

,

 they

 will

 become

 more

 aware

 of

 their

 surroundings

 and

 able

 to

 respond

 to

 human

 emotions

 and




In [6]:
llm.shutdown()