# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0905 05:46:49.373000 3876508 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 05:46:49.373000 3876508 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0905 05:47:00.230000 3877101 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 05:47:00.230000 3877101 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0905 05:47:00.292000 3877102 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 05:47:00.292000 3877102 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-05 05:47:00] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.28it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=4.14 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=4.14 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.50it/s]Capturing batches (bs=2 avail_mem=3.71 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.50it/s]Capturing batches (bs=1 avail_mem=3.69 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.50it/s]Capturing batches (bs=1 avail_mem=3.69 GB): 100%|██████████| 3/3 [00:00<00:00,  4.06it/s]Capturing batches (bs=1 avail_mem=3.69 GB): 100%|██████████| 3/3 [00:00<00:00,  3.47it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  [insert name], and I am a Senior at [insert school name] in [insert city, country]. I am currently a senior at [insert school name] in [insert city, country]. My focus is on [insert relevant field of study or area of interest], and I am passionate about [insert your favorite hobby or extracurricular activity].

My hobbies include [insert hobbies, such as playing sports, learning languages, or exploring new places]. I am also a member of [insert club or organization], and I love [insert club or organization's club name, such as the chess club or the art club].

My extr
Prompt: The president of the United States is
Generated text:  a person; an academician is an academician; a scientist is an academician. Is the relationship between 'president of the United States, academician, scientist' a one-to-many relationship?
A. Yes
B. No
C. Cannot be determined
D. Sometimes Yes, Sometimes No

To determine the relationship between the roles of president o

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm a [Number] year old [Gender] [Race] [Nationality] from [City, State]. I'm a [Number] year old [Gender] [Race] [Nationality] from [City, State]. I'm a [Number] year old [Gender] [Race] [Nationality] from [City, State]. I'm a [Number] year old [Gender] [Race] [Nationality] from [City, State]. I'm a [Number] year old [Gender] [Race] [National

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower and vibrant cultural scene. It is also the birthplace of the French Revolution and the seat of the French government. Paris is a major cultural and economic center, with a rich history dating back to the Roman Empire and the French Renaissance. The city is home to many famous landmarks and attractions, including the Louvre Museum, Notre-Dame Cathedral, and the Champs-Élysées. Paris is a popular tourist destination, with millions of visitors annually. The city is also known for its cuisine, including French cuisine and international dishes. Paris is a city of contrasts, with its modern architecture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased automation: One of the most significant trends in AI is the increasing automation of tasks that are currently performed by humans. This could include tasks such as data analysis, decision-making, and problem-solving. As AI becomes more advanced, it is likely to be able to perform these tasks more efficiently and accurately than humans.

2. Improved privacy and security: As AI becomes more integrated into our daily lives, there will be an increasing need for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [职业/职位] at [Company]. I'm excited to meet you and learn more about our company and our products. How can I help you today? I look forward to meeting you. 

(Note: Feel free to replace "Name" and "Company" with your own names and organizations, and replace "I'm a [职业/职位] at [Company]" with the correct format for your character's profession.)

Overall, your self-introduction should be polite, professional, and show that you have a positive attitude towards your character's profession and company. It's also important to be concise and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also renowned for its rich culture and culinary traditions, with a strong emphasis on French cuisin

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [insert

 your

 name

]

 and

 I

 am

 a

 [

insert

 a

 profession

 or

 role

 you

 play

]

 at

 [

insert

 a

 company

 name

 or

 industry

]

 in

 [

insert

 your

 hometown

,

 city

 or

 country

].

 I

 am

 a

 [

insert

 a

 brief

 description

 of

 your

 character

's

 personality

 traits

 or

 background

].

 I

 am

 [

insert

 a

 joke

 about

 your

 character

 or

 about

 the

 industry

 you

 play

].

 Let

's

 be

 friends

!

 [

insert

 a

 friendly

 "

Hi

 there

!

"]

 Hi

 there

!

 I

'm

 [

insert

 your

 name

],

 a

 [

insert

 a

 profession

 or

 role

 you

 play

]

 at

 [

insert

 a

 company

 name

 or

 industry

],

 and

 I

'm

 [

insert

 a

 brief

 description of

 your character

's personality

 traits or



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 city

 and

 the

 seat

 of

 the

 French

 government

 and

 its

 leading

 cultural

,

 financial

,

 and

 political

 center

.

 It

 is

 located

 on

 the

 banks

 of

 the

 River

 Se

ine

 and

 was

 founded

 in

 the

 

1

2

th

 century

.

 It

 is

 known

 for

 its

 iconic

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

, Notre

-D

ame

 Cathedral

, and

 the Arc

 de

 Tri

omp

he

.

 The

 city

 is

 also

 home

 to

 many

 famous

 artists

,

 including

 Vincent

 Van

 G

ogh

,

 Michel

angelo

,

 and

 Pablo

 Picasso

.

 Paris

 is

 a

 vibrant

 and

 diverse

 city

 with

 a

 rich

 history

 and

 a

 thriving

 cultural

 scene

,

 and

 it

 is

 a

 popular

 tourist

 destination

.

 The

 city

 has



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 an

 exciting

 and

 rapidly

 evolving

 field

,

 with

 new

 developments

 constantly

 emerging

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 Integration

:

 As

 AI

 becomes

 more

 sophisticated

,

 we

 can

 expect

 to

 see

 even

 more

 seamless

 integration

 between

 different

 AI

 technologies

 and

 systems

.

 For

 example

,

 we

 may

 see

 more

 seamless

 integration

 between

 chat

bots

,

 voice

 assistants

,

 and

 wearable

 devices

.



2

.

 Autonomous

 Learning

:

 Future

 AI

 systems

 will

 likely

 be

 more

 able

 to

 learn

 and

 adapt

 on

 their

 own

,

 without

 human

 intervention

.

 This

 will

 involve

 a

 range

 of

 AI

 techniques

 such

 as

 deep

 learning

,

 reinforcement

 learning

,

 and

 probabil

istic

 modeling

.



3

.

 Personal

ized

 Learning

:

 As

 AI

 becomes

 more

 accessible

,

 we




In [6]:
llm.shutdown()