# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0912 22:12:03.497000 1135482 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 22:12:03.497000 1135482 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0912 22:12:11.645000 1135842 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 22:12:11.645000 1135842 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0912 22:12:11.951000 1135841 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 22:12:11.951000 1135841 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-12 22:12:12] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.00it/s]



All deep_gemm operations loaded successfully!


  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.60it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.60it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.60it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.85it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Gennady and I am a retired professor from Saint Petersburg University. I teach university and college students in the field of statistics. I have a lot of experience in academic and business statistics. I have taught statistics courses both online and in person, and I have been a teaching assistant for several classes. I have worked with students of all ages and all levels of ability. My goal is to provide students with clear, concise, and helpful information. I have a lot of experience in statistics, and I am always eager to learn about new statistics topics. I am a member of the Statistics Society of the Federal Republic of Russia and the Russian Statistical
Prompt: The president of the United States is
Generated text:  very busy preparing for the annual Christmas and New Year's holidays. The United States consists of 50 states and the District of Columbia. The president has to travel to 50 states and 10000 miles each day from January 1st to

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament and the French National Library. Paris is a bustling city with a rich cultural heritage and is a major tourist destination. It is the capital of France and the largest city in the European Union. It is also the birthplace of the French Revolution and the French Revolution is considered one of the most significant events in French history. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city of art, culture, and history that is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased integration of AI into everyday life: AI is already being integrated into our daily lives, from voice assistants like Siri and Alexa to self-driving cars. As AI technology continues to advance, we can expect to see even more integration into our daily routines.

2. AI becoming more autonomous: As AI technology continues to improve, we can expect to see more autonomous vehicles on the road. This will likely lead to a reduction in accidents and a decrease in the use of human drivers.

3. AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I’m a [Job Title] with [Job Title] from [Company]. I am a [Job Title] with [Job Title] from [Company] and I’ve been working here for [Number of Years]. I’ve always loved being [Job Title] and [Job Title] from [Company] and I’ve always wanted to be [Job Title] from [Company]. I’m a [Job Title] with [Job Title] from [Company] and I’ve always loved being [Job Title] and [Job Title] from [Company]. I have a love for [Job Title] and [Job

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is also known as the "City of Light" for its unique artistic and cultural scene, and is home to the Eiffel Tower. The city is also famous for its vibrant nightlife, numerous museums and theaters, and its prestigious universities. Paris is a cultural, economic, and political center of France. Its history dates bac

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

Occup

ation

]

 who

 enjoys

 [

Job

 Description

].

 My

 passion

 for

 my

 job

 is

 [

Job

 Description

],

 and

 I

 am

 always

 looking

 for

 ways

 to

 [

Favorite

 Skill

/

Action

]. Please

 let me

 know if

 you'd

 like to

 get to

 know me

 better by

 asking me

 questions about

 my life

 and experiences

, or

 if you

'd rather

 just meet

 me

 as

 a

 neutral

 character

.

 [Name

]: Hello

, my

 name is

 [Name

]

 and

 I am

 a [

Occupation

]

 who enjoys

 [

Job

 Description].

 My

 passion for

 my job

 is

 [

Job

 Description

],

 and

 I

 am

 always looking

 for

 ways to

 [

Favorite Skill

/Action

]. Please



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

. Its

 official language

 is

 French

,

 though

 English

 is

 widely

 spoken

.

 French

 is

 the

 country

's

 most

 spoken

 language

,

 and

 it

's

 also

 the

 official

 language

 of

 the

 European

 Union

.

 Paris

 is

 a

 cosm

opolitan

 met

ropolis

 with

 a

 rich

 cultural

 history

 dating

 back

 centuries

.

 It

's

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

.

 The

 city

 also

 has

 a

 vibrant

 nightlife

,

 and

 many

 visitors

 travel

 from

 around

 the

 world

 for

 its

 grand

io

se

 architecture

 and

 exquisite

 cuisine

.

 Paris

 is

 an

 important

 center

 for

 finance

,

 fashion

,

 and

 the

 arts

.

 It

's

 a

 major

 transportation

 hub

 with

 several

 major

 airports

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 different

 factors

,

 including

 advancements

 in

 computing

 power

,

 deep

 learning

,

 machine

 learning

,

 and

 other

 emerging

 technologies

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

 that

 could

 potentially

 shape

 the

 future

:



1

.

 More

 sophisticated

 language

 models

:

 As

 AI

 continues

 to

 learn

 and

 improve

,

 it

 is

 likely

 to

 become

 even

 more

 capable

 of

 understanding

 and

 generating

 human

 language

.

 This

 could

 lead

 to

 more

 sophisticated

 language

 models

 that

 can

 produce

 more

 coherent

,

 nuanced

,

 and

 context

ually

 rich

 outputs

.



2

.

 Greater

 use

 of

 AI

 in

 healthcare

:

 AI

 can

 be

 used

 to

 improve

 the

 accuracy

 and

 efficiency

 of

 diagnoses

,

 to

 personalize

 treatment

 plans

,

 and

 to

 monitor

 and

 manage

 chronic




In [6]:
llm.shutdown()