# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0902 18:35:37.428000 1621905 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 18:35:37.428000 1621905 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0902 18:35:45.839000 1622280 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 18:35:45.839000 1622280 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0902 18:35:46.005000 1622281 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 18:35:46.005000 1622281 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-02 18:35:46] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.62it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.62it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.48it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.48it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.48it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.67it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ryan White, and I am 17 years old. I am from the state of New Jersey. I graduated from the University of New Brunswick with a Bachelor of Science in Business Administration. I have been involved in music and music education for the past year and a half. I am currently the captain of the University of New Brunswick drum and percussion team. I am also the drum major and the captain of the University of New Brunswick jazz band.
As of now, I am at the University of New Brunswick, one of the universities in New Brunswick, New Brunswick, Canada. I was the captain of the university’s marching band. I have
Prompt: The president of the United States is
Generated text:  a popular candidate for the Democratic nomination for the 2012 presidential election. He has a favorable opinion of the electoral college, and he wants to maximize the probability of winning the election. In the year 2010, the presidential winner was John Kerry, who received 63% of the p

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill] who has been [Number of Years] years in the field of [Field of Interest]. I'm a [Skill] who has been [Number of Years] years in the field of [Field of Interest]. I'm a [Skill] who has been [Number of Years] years in the field of [Field of Interest]. I'm a [Skill] who has been [Number of Years] years in the field of [Field of Interest]. I'm a [Skill] who has been [Number of Years] years in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant cultural scene. 

(Note: The statement is factually correct, but it's important to note that the Eiffel Tower is actually located in Paris, not in the city itself.) 

The statement is accurate, but it could be improved by including the fact that Paris is the capital of France, which is the largest country in Europe by area. Additionally, it's worth noting that Paris is known for its rich history, including the French Revolution and the Opéra Garnier, a famous opera house. Finally, it's important to note

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends include:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and preferences.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be increased concerns about privacy and security. There will be a need for more robust privacy and security measures to protect the data and information that is generated and processed by AI.

3. Greater use of AI in healthcare: AI is already being used in healthcare



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a self-taught programming enthusiast who has been programming since childhood. I have a passion for creating games and am always looking for ways to push the limits of what's possible with programming. I'm always learning new programming languages and techniques, and I enjoy collaborating with other programmers to solve problems and create innovative solutions. I'm eager to contribute to the world of programming and keep improving my skills and knowledge. Is there anything I should know about me or my background before I speak to you? As an artificial intelligence, I don't have personal experiences or background, but I'm here to assist you with any questions

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

In 1995, the population of Paris was approximately 1.5 million. The city is locat

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

]

 and

 I

'm

 a

 [

insert

 profession

 or

 occupation

].

 I

'm

 an

 [

insert

 nationality

]

 [

insert

 religion

,

 political

 affiliation

,

 or

 any

 other

 characteristic

].

 I

 have

 a

 [

insert

 favorite

 hobby

 or

 interest

].

 My

 [

insert

 favorite

 word

 or

 phrase

]

 is

 [

insert

 favorite

 word

 or

 phrase

].

 I

 enjoy

 [

insert

 why

 I

 enjoy

 what

 I

 do

].

 I

'm

 a

 [

insert

 age

]

 [

insert

 gender

]

 [

insert

 nationality

]

 [

insert

 profession

 or

 occupation

]

 who

 is

 constantly

 learning

 and

 growing

.

 I

 strive

 to

 be

 a

 [

insert

 goal

 or

 aspiration

].

 Thank

 you

.

 How

 about

 you

?

 How

 would

 you

 like

 to

 introduce

 yourself

?

 I

'm

 [

insert



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Ex

plain

 how

 Paris

 contributes

 to

 its

 status

 as

 a

 major

 cultural

 and

 economic

 hub

 in

 Europe

.

 The

 city

 is

 home

 to

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 the

 Palace

 of

 Vers

ailles

,

 and

 many

 other

 famous

 landmarks

.

 It

 is

 also

 a

 major

 center

 of

 science

,

 culture

,

 and

 education

,

 and

 has

 a

 long

 history

 of

 artistic

 and

 intellectual

 pursuits

.

 Its

 status

 as

 a

 major

 cultural

 and

 economic

 hub

 is

 further

 supported

 by

 its

 status

 as

 a

 major

 international

 trade

 center

,

 its

 status

 as

 a

 symbol

 of

 France

's

 cultural

 and

 political

 importance

,

 and

 its

 contributions

 to

 European

 and

 world

 history

.

 



Which

 of

 the

 following

 pairs

 of

 facts

 about

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 it

 is

 likely

 to

 continue

 to

 evolve

 and

 transform

 our

 lives

 in

 profound

 ways

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

 that

 could

 shape

 the

 future

 of

 the

 industry

:



1

.

 Increased

 privacy

 and

 security

 concerns

:

 As

 AI

 becomes

 more

 prevalent

 in

 our

 lives

,

 there

 will

 be

 an

 increasing

 need

 to

 protect

 our

 privacy

 and

 security

.

 AI

 systems

 may

 be

 designed

 with

 more

 advanced

 privacy

 and

 security

 features

,

 and

 there

 may

 be

 an

 increased

 focus

 on

 using

 AI

 to

 protect

 our

 personal

 information

.



2

.

 AI

 becoming

 more

 integrated

 into

 our

 everyday

 lives

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 it

 is

 likely

 that

 we

 will

 see

 more

 use

 of

 AI

 in

 areas




In [6]:
llm.shutdown()