# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0911 04:54:14.845000 2653322 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 04:54:14.845000 2653322 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0911 04:54:26.765000 2653918 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 04:54:26.765000 2653918 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 04:54:26.853000 2653919 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 04:54:26.853000 2653919 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 04:54:27] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.46it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.09it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.09it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.09it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  7.79it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rong and I'm a Korean-American. I'm a college graduate and have been working in finance and consulting for 2 years. My expertise is in the financial and financial markets, including stocks and bonds. I'm an experienced trader and have used technical analysis to make decisions. My main goal is to improve my skills and knowledge, and to learn new things. My aim is to become a successful financial professional.

I hope to find a job that allows me to provide financial services to my clients. I'm interested in a job that involves market analysis, risk management, and investment strategy. I'm looking for a job that can provide me
Prompt: The president of the United States is
Generated text:  trying to decide whether to use the military to restore order in a country that has been in civil war.  Given a choice of hormones, the gland that secretes insulin in the pancreas and regulates blood sugar levels is most likely to help the president restore ord

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or profession]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [insert a short description of your favorite activity]. I'm always looking for new ways to challenge myself and expand my horizons. What's your favorite book or movie? I love [insert a short description

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French National Opera. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. The city is known for its fashion, cuisine, and art, and is a major tourist destination. It is also home to the French Parliament, the French National Library, and the French National Opera. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, from manufacturing to healthcare. This could lead to increased efficiency and productivity, but it could also lead to job displacement for some workers.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we can expect to see even more personalized



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a computer programmer. I love coding and problem-solving. My best skill is being able to solve problems quickly and efficiently. I'm passionate about technology and always want to learn new things. I'm always looking for new challenges and ideas to try. Thank you for asking about me! If you have any questions or need any information, feel free to ask me. What's your name? How can I help you today? What's your name? Good morning, and welcome to [Company Name]. I'm [Name] from [Company Name], and I'm excited to meet you and help you in any way

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its architecture, art, and cuisine. It is a bustling and cosmopolitan city with a rich cultural heritage and world-class museums, monuments, and restaurants. Paris is the political, economic, 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

Age

]

 year

 old

.

 I

 am

 a

 computer

 scientist

 with

 a

 passion

 for

 [

Your

 Area

 of

 Expert

ise

].

 I

 have

 a

 knack

 for

 solving

 problems

 and

 coming

 up

 with

 creative

 solutions

.

 I

'm

 excited

 to

 help

 you

 navigate

 the

 challenges

 you

 face

 and

 achieve

 your

 goals

.

 Let

's

 get

 started

,

 [

Name

].

 



Remember

 to

 keep

 your

 interactions

 friendly

 and

 constructive

,

 and

 to

 show

 your

 enthusiasm

 for

 solving

 problems

 and

 coming

 up

 with

 creative

 solutions

.

 I

'm

 looking

 forward

 to

 helping

 you

 with

 whatever

 you

 need

.

 Welcome

 to

 [

Your

 Name

],

 your

 unique

 problem

 solver

,

 here

 to

 make

 your

 life

 easier

 and

 your

 projects

 more

 successful

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 a

 cultural

 hub

 with

 numerous

 museums

,

 theaters

,

 and

 cafes

,

 making

 it

 a

 popular

 tourist

 destination

.

 The

 city

 is

 known

 for

 its

 bustling

 street

 life

,

 diverse

 food

 options

,

 and

 its

 role

 as

 a

 major

 economic

 and

 political

 center

 in

 Europe

.

 The

 French

 government

 and

 cultural

 institutions

 play

 a

 significant

 role

 in

 preserving

 and

 promoting

 Paris

's

 heritage

 and

 identity

.

 Paris

 is

 often

 referred

 to

 as

 "

la

 ville

 bl

anche

"

 (

white

 city

)

 due

 to

 its

 snowy

 white

 buildings

 and

 architecture

.

 The

 city

 is

 also

 famous

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 diverse

,

 with many

 different trends

 developing

 in

 the

 next

 few

 decades

.

 Here

 are

 some

 of

 the

 most

 notable

 trends

 that

 are

 likely

 to

 shape

 the

 field

 in

 the

 coming

 years

:



1

.

 Faster

 and

更低

的成本

 of

 AI

:

 As

 we

 get

 more

 data

 and

 improve

 the

 quality

 of

 algorithms

,

 the

 costs

 of

 AI

 will

 decrease

.

 This

 will

 make

 it

 more

 accessible

 to

 businesses

 and

 governments

,

 allowing

 them

 to

 create

 and

 use

 AI

 more

 easily

.



2

.

 More

 advanced

 models

 and

 techniques

:

 The

 AI

 field

 is

 rapidly

 evolving

,

 with

 new

 models

 and

 techniques

 emerging

 all

 the

 time

.

 As

 a

 result

,

 it

 is

 likely

 that

 we

 will

 see

 even

 more

 powerful

 algorithms

 in

 the

 future

.






In [6]:
llm.shutdown()