# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0908 22:15:33.848000 1656082 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 22:15:33.848000 1656082 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0908 22:15:41.932000 1656610 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 22:15:41.932000 1656610 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0908 22:15:42.267000 1656611 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 22:15:42.267000 1656611 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-08 22:15:42] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.49it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=75.45 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=75.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.14it/s]Capturing batches (bs=2 avail_mem=75.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.14it/s]Capturing batches (bs=1 avail_mem=75.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.14it/s]Capturing batches (bs=1 avail_mem=75.39 GB): 100%|██████████| 3/3 [00:00<00:00,  9.93it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Diandak. I work as a researcher in the field of urban planning and architecture. My research interests include landscape architecture, urban and public health, urban sustainability, and architectural sustainability. I am passionate about making a positive impact in the world and using my research to contribute to that impact. I am a part of the 2019-2020 Research Fellow Program at the University of Southampton's School of Architecture. My research has been supported by the European Union's Horizon Europe framework programme and was awarded the 'Dorothy K. Forsyth Prize' by the Urban Land Institute for the best poster presentation on urban
Prompt: The president of the United States is
Generated text:  trying to decide whether to go on the airplane or stay at home. He will choose to go on the airplane if it is a weekend. He will stay home if it is not a weekend. If the probability of it being a weekend is 0.4, and the probability of staying at h

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [age] year old, [gender] and I have [number] years of experience in [field of work]. I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [age] year old, [gender] and I have [number] years of experience in [field of work]. I'm a [job title

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population. The city is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major center for art, music, and literature, and is home to many famous museums, theaters, and restaurants. The city is known for its cuisine, including French cuisine, and is a popular tourist destination. Paris is a vibrant and dynamic city that continues to thrive as a major global city. The city is home to many cultural institutions, including the Louvre

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and adaptive AI systems.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be greater emphasis on ethical considerations, such as privacy, bias, and fairness. This could lead to more robust AI systems that are designed to be transparent, accountable, and responsible.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a friendly, outgoing person who loves to share my thoughts and ideas. I love to read and write, and I'm always looking for new ways to express myself creatively. I'm also a bit of a puzzle master, and I love to solve puzzles with my thoughts and ideas. So if you ever come across a good one, I'd love to hear about it! What brings you to this world, and how would you like to spend your days? [Name] [Introduce yourself, including your role and any unique skills or passions you possess. Use a friendly, conversational tone and engage with the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a historical, cultural, and economic center of the country. The city is known for its rich history, vibrant culture, and famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Mu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 a

 [

Your

 Profession

/

Role

]

 with

 over

 [

X

]

 years

 of

 experience

 in

 [

Your

 Field

/

Industry

].

 I

 enjoy

 [

X

]

 and

 I

 love

 [

X

].

 I

 am

 a

 [

Your

 Personality

/

Character

].

 I

 am

 always

 [

X

]

 and

 always

 [

X

].

 I

 am

 a

 [

Your

 Values

/

Character

 Traits

].

 I

 am

 [

Your

 Goal

/

Goal

].

 I

 am

 always

 [

X

]

 and

 always

 [

X

].

 What

 is

 your

 name

?

 [

Your

 Name

].

 What

 is

 your

 profession

?

 [

Your

 Profession

].

 What

 is

 your

 role

?

 [

Your

 Role

].

 What

 is

 your

 field

/

industry

?

 [

Your

 Field

/

Industry

].



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 romantic

 ambiance

,

 and

 vibrant

 culture

.

 The

 city

 is

 also

 home

 to

 the

 French

 Parliament

 and

 is

 considered

 the

 cultural

 and

 economic

 center

 of

 the

 country

.

 French

 cuisine

,

 particularly

 regional

 dishes

 like

 esc

arg

ot

 and

 fo

ie

 gras

,

 is

 also

 famous

 in

 Paris

.

 Paris

 is

 known

 for

 its

 fashion

,

 film

,

 and

 opera

 scenes

,

 and

 the

 city

 is

 a

 UNESCO

 World

 Heritage

 site

.

 The

 French

 language

 is

 widely

 spoken

,

 and

 the

 city

 is

 home

 to

 many

 famous

 landmarks

 and

 historical

 sites

,

 such

 as

 the

 Lou

vre

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 a

 bustling

 and

 exciting

 city

,

 known

 for

 its

 art



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 a

 proliferation

 of

 small

,

 complex

,

 and

 autonomous

 systems

 that

 can

 operate

 in

 a

 variety

 of

 domains

,

 from

 healthcare

 and

 transportation

 to

 security

 and

 manufacturing

.

 Some

 possible

 future

 trends

 include

:



1

.

 Increased

 integration

 of

 AI

 into

 human

 decision

-making

 processes

:

 AI

 is

 already

 beginning

 to

 make

 its

 way into

 healthcare

,

 where

 it

 can

 help

 predict

 and

 prevent

 disease

,

 improve

 patient

 outcomes

,

 and

 personalize

 treatment

.

 In

 the

 future

,

 we

 may

 see

 even

 more

 integration

 of

 AI

 into

 human

 decision

-making

 processes

,

 such

 as

 in

 finance

,

 transportation

,

 and

 manufacturing

.



2

.

 AI

-powered

 automation

 of

 routine

 tasks

:

 As

 AI

 continues

 to

 improve

,

 we

 may

 see

 a

 growing

 number

 of

 routine




In [6]:
llm.shutdown()