# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0901 10:13:32.652000 3590353 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0901 10:13:32.652000 3590353 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0901 10:13:41.117000 3590889 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0901 10:13:41.117000 3590889 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0901 10:13:41.231000 3590890 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0901 10:13:41.231000 3590890 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-01 10:13:41] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.83it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.14it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Michelle, a 41-year-old mother of two kids, my husband is a college professor, and I am in my early 20s. I have a lot of low self-esteem and I struggle with anxiety. I want to get a coaching job at a startup with my husband and I are learning to code. Can you provide any guidance or advice on how to get started with a coaching job at a startup? Setting up a coaching job at a startup can be a unique and exciting journey. Here are some steps to help you get started:

1. Research the industry and the company: Before starting your coaching job at a startup
Prompt: The president of the United States is
Generated text:  trying to decide what issue to address first in his first 100 days in office. He has 100 choices, and each choice has a 50% chance of being a high-priority issue. Let \(X\) be the number of high-priority issues the president chooses in his first 100 days. 

(a) Find the probability that \(X = 100\).

(b) Find the probability that \(X

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, positive description of your personality or skills]. I'm always looking for new challenges and opportunities to grow and learn. What do you think makes you a good fit for this role? I'm confident in my abilities and have a strong work ethic. I'm always eager to learn and improve, and I'm always willing to take on new challenges. What do you think makes you a good fit for this role? I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also famous for its rich history, including the French Revolution and the French Revolution Museum. Paris is a cultural and economic hub, with a diverse population and a vibrant nightlife. It is a popular tourist destination, with millions of visitors annually. The city is also home to many museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. Its reputation as a city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is likely to become more prevalent in various industries, from manufacturing to healthcare, and will continue to automate tasks that are currently performed by humans. This will lead to increased efficiency and productivity, but it will also create new job opportunities.

2. AI will become more integrated with other technologies: AI will continue to be integrated with other technologies, such as machine learning, natural language processing, and computer vision, which will lead to even more sophisticated and complex AI systems.

3. AI will become more ethical: As AI becomes more integrated into our daily lives, there will be



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  __________ and I'm a/an [insert occupation or profession]. I'm a/an [insert occupation or profession] and I'm an employee at [insert company name]. I've been at this job for [insert number] years and I've always [insert personal trait or achievement] here. I'm always here to help, and I'm always [insert value, like "dedicated", "brilliant", "outgoing", etc.]. I'm the best at [insert a job-related skill or hobby], and I'm always [insert a statement like "dedicated", "brilliant", "outgoing", "

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its historic landmarks, vibrant culture, and French cuisine. 

Reason for choosing Paris: It is the cultural, economic, and political center of France, hosting many of the world's most important monuments and events. It is also a major tourist destination an

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

'm

 a

 [

Your

 Profession

].

 I

've

 always

 been

 fascinated

 by

 the

 idea

 of

 solving

 problems

 and

 making

 things

 work

 better

.

 I

've

 taken

 a

 passion

 for

 computer

 science

 and

 have

 hon

ed

 my

 skills

 to

 a

 high

 level

.

 I

'm

 always

 looking

 for

 new

 ways

 to

 use

 technology

 to

 make

 the

 world

 a

 better

 place

.

 I

'm

 confident

 in

 my

 abilities

 and

 look

 forward

 to

 helping

 others

.

 How

 would

 you

 describe

 your

 character

?

 Yes

,

 I

 could

 describe

 myself

 as

 a

 driven

 and

 determined

 individual

 who

 is

 always

 looking

 for

 new

 ways

 to

 make

 a

 difference

.

 I

'm

 a

 problem

 solver

 and

 enjoy

 helping

 others

 find

 solutions

 to

 their

 challenges

.

 I

'm

 confident



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 beautiful

 architecture

,

 vibrant

 culture

,

 and

 historical

 significance

.

 It

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Palace

 of

 Vers

ailles

.

 Paris

 also

 hosts

 various

 annual

 festivals

 and

 events

,

 such

 as

 the

 M

ardi

 Gr

as

 celebrations

.

 The

 city

 is

 known

 for

 its

 diverse

 cuisine

,

 including

 traditional

 French

 dishes

 like

 cro

iss

ants

 and

 past

ries

,

 as

 well

 as

 international

 cuisine

.

 Paris

 is

 a

 bustling

 and

 dynamic

 city

 with

 a

 rich

 history

 and

 a

 wide

 range

 of

 attractions

 for

 tourists

 and

 locals

 alike

.

 Its

 status

 as

 the

 world

's

 most

 important

 capital

 city

 is

 reflected

 in

 its

 elegant

 architecture

,

 vibrant

 culture

,

 and

 rich

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 there

 are

 many

 different

 trends

 that

 could

 potentially

 shape

 the

 direction

 of

 the

 technology

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

 that

 have

 the

 potential

 to

 significantly

 impact

 the

 field

:



1

.

 Increased

 automation

 and

 artificial

 general

 intelligence

:

 As

 AI

 continues

 to

 get

 more

 sophisticated

 and

 powerful

,

 it

's

 likely

 that

 we

'll

 see

 even

 more

 automation

 and

 the

 development

 of

 artificial

 general

 intelligence

 (

AG

I

).

 This

 could

 lead

 to

 a

 shift

 in

 job

 responsibilities

 and

 the

 creation

 of

 entirely

 new

 industries

 that

 rely

 on

 AI

-driven

 automation

.



2

.

 Enhanced

 human

-A

I

 collaboration

:

 AI

 is

 becoming

 more

 sophisticated

,

 and

 it

's

 likely

 that

 we

'll

 see

 more

 human

-A

I

 collaboration

 in

 the




In [6]:
llm.shutdown()