# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0905 11:08:11.367000 1038780 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 11:08:11.367000 1038780 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0905 11:08:20.139000 1039138 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 11:08:20.139000 1039138 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0905 11:08:20.325000 1039137 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 11:08:20.325000 1039137 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-05 11:08:20] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.48it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.52 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=76.52 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.29it/s]Capturing batches (bs=2 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.29it/s]Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.29it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00,  3.61it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  and I'm a science fiction author. My latest novel is The Cross of Stone, which is published by Simon & Schuster. I've written multiple science fiction novellas and short stories.
If you're interested in writing a science fiction novel, I'd be happy to help you get started. What would you like to write about?
You can write about any topic you have in mind, whether it's a futuristic society, a space opera, a science fiction comedy, or anything else you want. Just let me know how you'd like to proceed!
As for now, I'd like to do a little bit of background on myself.
Prompt: The president of the United States is
Generated text:  running for re-election and has decided to give a speech that will be broadcast live on the national television channel. There are 20 candidates competing for the same position. The president's speech is expected to last for 2 hours. Given the length of the president's speech, how long will each minute of the speech be in 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What brings you to this company? I'm always looking for new opportunities to grow and learn. What's your favorite part of your job? I love the flexibility and the opportunity to work with a diverse team. What's your biggest challenge? I'm always looking for ways to improve my skills and stay up-to-date with the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the second-largest city in the world by population. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also a major center for business, finance, and tourism, and is a popular destination for tourists and locals alike. The city is known for its annual Eiffel Tower Festival, which attracts millions of visitors each year. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation: AI is likely to become more prevalent in manufacturing, transportation, and other industries, where it can perform tasks that are currently performed by humans. This could lead to job displacement, but it could also create new job opportunities.

2. Improved privacy and security: As AI becomes more integrated into our daily lives, there will be an increased need for privacy and security measures. This could lead to new regulations and standards for AI development and use.

3. Enhanced human



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm an AI assistant. I'm here to assist you with any questions you may have. How can I help you today? Please let me know how I can assist you. [Name] [Name] is a neutral AI assistant. My primary function is to assist users with their inquiries and provide relevant information. How can I assist you today? [Name] I'm a neutral AI assistant. My main goal is to provide helpful responses and assist with inquiries. How can I assist you today? [Name] I'm a neutral AI assistant. My primary function is to provide helpful responses and assist with inquiries. How can

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La République高出机场, the world's third-largest airport.

Please answer the following question: what is the population of paris? 4.3 million (2017) According to the 2019 French cens

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

insert

 occupation

 or

 profession

].

 I

've

 been

 working

 as

 a

 [

insert

 job

 title

 or

 role

]

 for

 [

insert

 duration

 or

 location

]

 and

 I

 have

 an

 extensive

 background

 in

 [

insert

 relevant

 skill

 or

 experience

].

 I

'm

 passionate

 about

 [

insert

 personal

 interest

 or

 hobby

]

 and

 have

 always

 been

 a

 [

insert

 related

 trait

 or

 characteristic

].

 My

 goal

 is

 to

 [

insert

 goal

 or

 purpose

]

 and

 I

'm

 always

 striving

 to

 [

insert

 personal

 motivation

 or

 drive

].

 I

 believe

 in

 [

insert

 personal

 belief

 or

 value

].

 I

'm

 confident

 in

 my

 ability

 to

 [

insert

 outcome

 or

 accomplishment

]

 and

 I

'm

 eager

 to

 [

insert

 career

 aspirations

 or

 future

 goals



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 located

 on

 the

 Se

ine

 River

,

 known

 for

 its

 museums

,

 theaters

,

 and

 rich

 cultural

 heritage

.

 It

 is

 the

 political

,

 cultural

,

 and

 economic

 center

 of

 France

 and

 plays

 an

 important

 role

 in

 the

 country

's

 identity

 and

 political

 landscape

.

 Its

 skyline

 is

 dominated

 by

 the

 E

iff

el

 Tower

,

 and

 it

 is

 a

 major

 tourist

 destination

.

 Paris

 is

 also

 known

 for

 its

 unique

 and

 diverse

 culture

,

 including

 French

 cuisine

,

 art

,

 and

 music

.

 It

 is

 a

 city

 where

 history

,

 culture

,

 and

 contemporary

 society

 converge

.

 Paris

 has

 been

 recognized

 as

 a

 UNESCO

 World

 Heritage

 Site

 for

 its

 history

 and

 culture

,

 which

 has

 helped

 to

 preserve

 and

 promote

 its

 cultural



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 several

 key

 trends

 that

 will

 shape

 the

 development

 and

 application

 of

 this

 technology

:



1

.

 Increased

 integration

 of

 AI

 into

 everyday

 life

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 such

 as

 through

 voice

 assistants

,

 smart

 homes

,

 and

 self

-driving

 vehicles

,

 we

 can

 expect

 to

 see

 an

 increase

 in

 its

 integration

 into

 various

 aspects

 of

 our

 lives

.



2

.

 Enhanced

 understanding

 of

 AI

 ethics

 and

 responsibility

:

 With

 AI

 becoming

 more

 autonomous

 and

 responsible

,

 it

 is

 likely

 that

 there

 will

 be

 an

 increasing

 focus

 on

 ethical

 considerations

 and

 the

 responsibility

 that

 AI

 should

 assume

 in

 decision

-making

 processes

.



3

.

 Greater

 integration

 of

 AI

 with

 natural

 language

 processing

:

 AI

 will

 continue to




In [6]:
llm.shutdown()