# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0909 07:30:07.762000 2653462 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 07:30:07.762000 2653462 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0909 07:30:16.522000 2654100 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 07:30:16.522000 2654100 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0909 07:30:16.736000 2654101 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 07:30:16.736000 2654101 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-09 07:30:16] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.39it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=67.42 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=67.42 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.75it/s]Capturing batches (bs=2 avail_mem=67.35 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.75it/s]Capturing batches (bs=2 avail_mem=67.35 GB):  67%|██████▋   | 2/3 [00:00<00:00,  3.77it/s]Capturing batches (bs=1 avail_mem=67.33 GB):  67%|██████▋   | 2/3 [00:00<00:00,  3.77it/s]

Capturing batches (bs=1 avail_mem=67.33 GB): 100%|██████████| 3/3 [00:00<00:00,  5.08it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Anne. I’m from England, and I’m living in the United States now. I used to be very lazy and didn’t like to exercise. I couldn’t stand the idea of spending time in the sun. That was why I decided to go to the United States a long time ago. Now I am 17 years old and I have a lot of friends. I like to play sports and go swimming. I am very healthy. I like to eat healthy foods. I eat junk food a little, but I eat vegetables and fruits as often as possible. I like to have a walk in the park when I feel like it
Prompt: The president of the United States is
Generated text:  a member of the highest governing body of the United States. Which of the following is true of the president of the United States?
A: President is a member of Congress
B: President has no official duties
C: President can sign executive orders
D: President has no role in the legislative branch
Answer:

A

$\angle ABC$ is an interior angle of a triangle, $\angle A = 45^\circ$, and t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic and cultural center with a rich history dating back to the Middle Ages. It is the largest city in France and the second-largest city in the European Union, with a population of over 2.7 million people. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also famous for its cuisine, fashion, and art, and is a major tourist destination. Paris is a vibrant and dynamic city with a rich cultural heritage and a strong sense of identity. It

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends in AI include:

1. Increased use of AI in healthcare: AI is already being used to improve patient care, from diagnosing diseases to predicting patient outcomes. As AI technology continues to improve, we can expect to see even more sophisticated applications in healthcare, such as personalized medicine and predictive analytics.

2. AI in finance: AI is already being used to improve financial services, from fraud detection to risk assessment. As AI technology continues to evolve, we can expect to see even more sophisticated applications in finance, such



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to start this new chapter in my professional life and hope to bring my unique perspective and leadership skills to the team. Please let me know what kind of career trajectory you have in mind. [Name] Hello! I'm [Name] and I'm a [job title] at [company name]. I'm really looking forward to starting my new chapter in my career and I'm excited to bring my unique perspective and leadership skills to the team. Any advice on what kind of career trajectory you have in mind? [Name] Hello! I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the north of the country and is known for its rich history, iconic landmarks, and vibrant culture. It is home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral, among other tourist attra

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

].

 I

 am

 [

insert

 character

's

 age

],

 and

 I

 am

 a

 [

insert

 character

's

 occupation

].

 I

 am

 passionate

 about

 [

insert

 something

 that

 interests

 you

],

 and

 I

 believe

 that

 [

insert

 something

 that

 you

 believe

 in

],

 and

 I

 am

 a

 [

insert

 character

's

 personality

 trait

 or

 trait

,

 such

 as

 a

 good

 listener

,

 an

 organizer

,

 a

 leader

,

 etc

.

].

 I

 am

 [

insert

 character

's

 character

 traits

 or

 traits

,

 such

 as

 a

 good

 listener

,

 a

 person

 who

 is

 approach

able

,

 a

 leader

,

 etc

.

].

 I

 have

 a

 team

 of

 [

insert

 how

 many

 people

 in

 your

 team

],

 and

 I

 love

 to

 [

insert

 something

 that

 you



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



A

 

5

-star

 hotel

 located

 near

 the

 E

iff

el

 Tower

 is

 the

 centerpiece

 of

 the

 city

's

 skyline

.

 



Choose

 your

 answer

 from

:

 -

 no

.


-

 yes

.


Is

 the

 question

 asking

 about

 the

 capital

 city

 of

 France

?

 yes

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 trends

,

 including

:



1

.

 Increased

 automation

:

 AI

 is

 becoming

 increasingly

 integrated

 into

 various

 industries

,

 including

 manufacturing

,

 healthcare

,

 and

 transportation

.

 As

 automation

 becomes

 more

 prevalent

,

 we

 may

 see

 further

 reduction

 in

 manual

 labor

 and

 increase

 in

 AI

-driven

 tasks

.



2

.

 More

 nuanced

 AI

:

 As

 AI

 evolves

,

 we

 may

 see

 the

 development

 of

 more

 nuanced

 and

 sophisticated

 AI

 systems

 that

 can

 better

 understand

 and

 respond

 to

 human

 emotions

 and

 situations

.



3

.

 Greater

 emphasis

 on

 ethics

 and

 fairness

:

 As

 AI

 becomes

 more

 integrated

 into

 various

 sectors

,

 there

 is

 a

 growing

 emphasis

 on

 ensuring

 that

 AI

 systems

 are

 designed

 with

 ethical

 considerations

 in

 mind

.

 This

 may

 include




In [6]:
llm.shutdown()