# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0911 14:12:49.474000 2206988 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 14:12:49.474000 2206988 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0911 14:13:02.015000 2207584 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 14:13:02.015000 2207584 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 14:13:02.192000 2207583 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 14:13:02.192000 2207583 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 14:13:02] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.10it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=22.77 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=22.77 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.40it/s]Capturing batches (bs=2 avail_mem=21.44 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.40it/s]Capturing batches (bs=1 avail_mem=21.23 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.40it/s]Capturing batches (bs=1 avail_mem=21.23 GB): 100%|██████████| 3/3 [00:00<00:00, 10.41it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex and I am studying at a university. What is your name and what subjects are you currently taking? Hello there! My name is Alex, and I am currently in my third year of studying at a university. As for my subjects, I am currently taking courses in computer science, mathematics, and a few other humanities and social sciences. How can I assist you further? Do you have any specific questions about my studies or information about universities in general? It would be great to have a more personalized conversation if you'd like. Best of luck with your studies! How are you doing on your self-care routine? I understand you might be
Prompt: The president of the United States is
Generated text:  in New York for an important meeting. His telephone number is 212-555-1111. Two of the numbers in the sequence are: 212-555-0000 and 212-555-1011. The president's phone number is not 212-555-1111. What is the smallest possible value of the president's phone nu

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and art scene. Paris is a vibrant and dynamic city with a diverse population and a rich cultural heritage. It is a popular tourist destination and a major economic center in Europe. Paris is a city that has been a center of politics, culture, and art for centuries and continues to be a major hub for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced cognitive abilities: AI is likely to become more capable of processing and understanding complex information, allowing it to perform tasks that were previously impossible for humans.

3. Autonomous and semi-autonomous systems: AI is likely to become more autonomous and semi-autonomous, allowing machines to make decisions and take actions without human intervention.

4. Improved privacy and security: AI is likely to become more transparent and secure, with better privacy



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [character type] who has been following you for [number of years] years. I'm [number of years] years old and love to [describe your hobbies or interests]. I have a knack for [describe your ability or skill] and enjoy [describe your hobbies or interests]. I'm a [number of years] year old person who has been following you for [number of years] years. I'm [number of years] years old and love to [describe your hobbies or interests]. I have a knack for [describe your ability or skill] and enjoy [describe your hobbies or interests

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light and the City of Gardens.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  bound to be diverse and complex, but there are ma

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Occup

ation

]

 with

 [

Number

 of

 Years

]

 years

 of

 experience

.

 I

 enjoy

 [

Something

 about

 my

 profession

 or

 interests

].

 I

'm

 [

A

 question

 about

 my

 personality

 or

 personal

 qualities

]

 and

 I

 strive

 to

 [

A

 goal

 or

 a

 project

 I

'm

 currently

 working

 on

].

 What

 can

 you

 tell

 me

 about

 yourself

?

 [

Include

 your

 answers

 to

 these

 questions

].

 Hey

 [

Name

],

 I

'm

 excited

 to

 meet

 you

 and

 chat

 with

 you

 about

 [

What

 topic

 you

'd

 like

 to

 talk

 about

].

 [

Name

],

 what

's

 up

?

 [

Name

]

!

 So

,

 I

'm

 [

Name

],

 an

 [

Occup

ation

],

 [

Number

 of

 Years

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 beautiful

 architecture

,

 iconic

 landmarks

,

 and

 rich

 history

.

 It

's

 the

 largest

 city

 in

 the

 world

,

 with

 a

 population

 of

 over

 

2

 million

 people

,

 and

 is

 home

 to

 some

 of

 the

 world

's

 most

 famous

 museums

,

 fashion

 shows

,

 and

 other

 cultural

 institutions

.

 The

 city

 is

 also

 famous

 for

 its

 annual

 E

iff

el

 Tower

 and

 its

 annual

 Les

 C

aff

é

 du

 Midi

 festival

,

 which

 features

 local

 cuisine

,

 music

,

 and

 wine

.

 Paris

 is

 a

 bustling

 and

 dynamic

 city

,

 with

 a

 rich

 and

 diverse

 culture

 and

 a

 cosm

opolitan

 atmosphere

 that

 draws

 people

 from

 around

 the

 world

 to

 visit

 and

 explore

.

 It

's

 also

 a

 major

 international

 financial

 center

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 potential

 and

 exciting

 developments

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 availability

 of

 AI

-powered

 tools

:

 As

 the

 cost

 and

 availability

 of

 AI

 technology

 continue

 to

 decrease

,

 we

 can

 expect

 more

 AI

-powered

 tools

 and

 applications

 to

 become

 more

 widely

 available

.

 For

 example

,

 more

 people

 may

 be

 able

 to

 access

 AI

-powered

 tools

 for

 analysis

 and

 decision

-making

,

 such

 as

 AI

-powered

 finance

 tools

 and

 chat

bots

.



2

.

 Greater

 integration

 of

 AI

 into

 daily

 life

:

 As

 AI

 becomes

 more

 widely

 integrated

 into

 our

 daily

 lives

,

 we

 can

 expect

 to

 see

 more

 applications

 of

 AI

 being

 developed

 for

 various

 industries

.

 For

 example

,

 AI

-powered

 self

-driving

 cars

 and

 drones

 may




In [6]:
llm.shutdown()