# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0910 05:50:31.916000 1171603 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0910 05:50:31.916000 1171603 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0910 05:50:40.878000 1172296 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0910 05:50:40.878000 1172296 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0910 05:50:41.085000 1172295 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0910 05:50:41.085000 1172295 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-10 05:50:41] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.19it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.60it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.60it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.60it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.83it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kim, I’m 21 years old and I’m from the Philippines. I plan to study in the United States and I’m now stuck at my university. How can I get the best college experience?

I want to maximize my time in the US, but I also want to avoid boring me with classes and professors.

I want to do something that is different from the typical education and learn a lot about the culture of America. I know that I need to find a university that is the best for me and that's the main part of the question.

It's difficult for me to choose because I want to find a university that is
Prompt: The president of the United States is
Generated text:  trying to improve the quality of the nation's education system. He plans to implement a new program that will provide free tutoring to students in math and science subjects. The program will last for one year, starting with the 10th grade students in the state. The president wants to ensure that the tutoring program is effe

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill or Hobby] enthusiast who loves to explore new places and learn new things. I'm always looking for new adventures and trying to make the most of every moment. I'm a [Favorite Activity] lover who enjoys hiking, skiing, and camping. I'm also a [Favorite Book] collector who loves to read and write. I'm a [Favorite Music] lover who listens to [Artist's Name] and [Artist's Name's Band Name] on my headphones. I'm a [Favorite Movie] fan who loves to watch [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French Quarter. Paris is a bustling metropolis with a rich history and a diverse population. The city is known for its fashion, art, and cuisine, and is a popular tourist destination. It is also home to many famous landmarks and attractions, including the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced machine learning capabilities: AI is likely to become more capable of learning from large amounts of data, which will enable machines to make more accurate predictions and decisions.

3. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be increased emphasis on ethical considerations, such as privacy, fairness, and accountability.

4. Increased use of AI in healthcare: AI is likely to be used more



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [occupation] at [organization]. I am passionate about [value proposition], and I'm determined to make a positive impact on the world through my work. I enjoy [social responsibility activities] and am always looking for opportunities to contribute to a positive change. My goal is to inspire others to take action and make a difference in their lives. I am always learning and growing, and I am dedicated to staying up-to-date on the latest trends and technologies in the field. What's your name and what's your occupation? I'm [Name] at [organization]. What can you tell me about yourself?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its vibrant art scene, historic Notre Dame Cathedral, and vibrant street life. Paris is a popular tourist destination and is also home to 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 an

 [

Age

]

 year

 old

 [

Occup

ation

].

 I

 love

 [

big

 thing

 about

 me

].

 In

 my

 free

 time

,

 I

 enjoy

 [

activities

 or

 hobbies

].

 What

 kind

 of

 person

 would

 you

 like

 to

 be

?

 I

 would

 like

 to

 be

 [

Person

ality

 trait

 or

 character

 trait

],

 [

Char

ity

 or

 social

 responsibility

],

 or

 [

M

ent

ality

 or

 attitude

].

 


If

 you

 could

 create

 me

 as

 a

 character

,

 what

 would

 you

 like

 to

 be

?

 As

 an

 AI

 language

 model

,

 I

 am

 capable

 of

 creating

 and

 manipulating

 fictional

 characters

 based

 on

 the

 prompts

 given

 to

 me

.

 However

,

 I

 do

 not

 have

 a

 physical

 form

 or

 personality

,

 so

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 on

 the

 northern

 bank

 of

 the

 Se

ine

 River

,

 near

 the

 Lou

vre

 Museum

 and

 the

 E

iff

el

 Tower

.

 The

 city

 is

 also

 known

 for

 its

 famous

 Bast

ille

 prison

,

 the

 Ch

amps

-

É

lys

ées

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 



The

 history

 of

 Paris

 goes

 back

 to

 the

 time

 of

 the

 Romans

,

 when

 the

 city

 was

 founded

 as

 a

 major

 Roman

 settlement

 on

 the

 banks

 of

 the

 Se

ine

 River

.

 The

 city

 has

 been

 a

 center

 of

 European

 culture

,

 politics

,

 and

 art

 for

 centuries

,

 and

 continues

 to

 be

 a

 major

 hub

 for

 international

 trade

 and

 diplomacy

 today

.

 



Despite

 its

 long

 history

,

 Paris

 has

 undergone

 several



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 marked

 by

 rapid

 advancements

 and

 significant

 breakthrough

s

 in

 areas

 such

 as

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 The

 following

 are

 some

 of

 the

 key

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 use

 of

 AI

 in

 everyday

 life

:

 One

 of

 the

 most

 significant

 trends

 in

 AI

 is

 the

 increased

 use

 of

 AI

 in

 everyday

 life

.

 This

 includes

 the

 development

 of

 AI

-powered

 devices

 such

 as

 self

-driving

 cars

,

 smart

 home

 assistants

,

 and

 intelligent

 transportation

 systems

.

 These

 devices

 are

 expected

 to

 reduce

 the

 need

 for

 humans

 to

 engage

 in

 routine

 tasks

,

 freeing

 up

 time

 and

 energy

 for

 more

 meaningful

 work

.



2

.

 AI

 will

 continue

 to

 become




In [6]:
llm.shutdown()