# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0904 12:56:23.490000 2702774 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0904 12:56:23.490000 2702774 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0904 12:56:34.015000 2703128 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0904 12:56:34.015000 2703128 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0904 12:56:34.150000 2703127 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0904 12:56:34.150000 2703127 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-04 12:56:34] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.17it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.16it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.16it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.16it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  7.98it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chad and I'm a fantasy novelist and rock star who loves watching Harry Potter. I'm a big fan of fantasy and story structure. My favorite book is "Harry Potter and the Sorcerer's Stone" by J.K. Rowling. I have a few stories in my mind about Harry, and I'm about to write an adventure about him. Can you help me get the plot started?

I have a bit of a long story idea but I want it to be more action-packed and show how the characters will develop in the coming adventures. What could I include in my first paragraph?

Certainly! Here’s a suggested first paragraph for your first
Prompt: The president of the United States is
Generated text:  a very important person. But what does a president do every day? The president is like a king, but he has a lot of power. To be the president, he has to get the vote of the people. But sometimes, the president can't get the vote of the people. Then, the president has to use the power of the people to get the vote 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Type of Character] who has [Number of Years in the Profession] years of experience in [Field of Work]. I'm [Favorite Hobby or Activity] and I enjoy [Reason for Hobby or Activity]. I'm [Favorite Color] and I love [Favorite Food]. I'm [Favorite Book or Movie] and I read [Number of Books or Movies] a year. I'm [Favorite Sport or Activity] and I play [Number of Sports or Activities] a year. I'm [Favorite Music or Artist] and I listen to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also famous for its fashion industry, art scene, and its role in the French Revolution. Paris is a vibrant and diverse city with a population of over 2 million people and a rich cultural and historical heritage. It is a popular tourist destination and a major economic center in France. The city is also home to many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence. This could lead to more sophisticated forms of AI that can learn from and adapt to human behavior and decision-making.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and validation of AI systems, as well as greater transparency and accountability in their use.

3. Increased reliance on AI for decision-making: As AI becomes more integrated with human intelligence, it is likely to become



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a/an [Age] year old [Occupation]. I am a/an [Gender] [Gender Identity/Preference] who is [Describe your physical appearance, including height, weight, hair, eyes, and any unique features]. My [Favorite Food] is [Favorite Meal], and my [Favorite Book/Artist/Album/TV Show] is [Favorite Source of Inspiration]. I am a/an [Age] year old [Occupation] who is passionate about [Project/Challenge]. In my free time, I enjoy [Gaming, Watching Movies, Reading, Cooking, etc.] and often spend time

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known as "La République" or "Le Palais Royal" (Royal Palace) and a major cultural and political center, famous for its Notre-Dame Cathedral, Eiffel Tower, Louvre Museum, and other landmarks. Paris is also known for its unique cuisine, including croissants, bag

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

age

]

 year

-old

 girl

 from

 [

location

].

 I

'm

 an

 [

occupation

/

interest

].

 I

 like

 to

 [

activity

/

interest

],

 and

 I

 love

 [

thing

/

thing

 to

 do

].

 I

'm

 very

 [

character

istic

]

 and

 [

would

 you

 like

 to

 share

 anything

 else

 about

 yourself

?

].

 So

,

 how

 can

 I

 best

 describe

 you

 to

 others

?

 [

Name

],

 you

're

 an

 [

occupation

/

interest

]

 with

 [

activity

/

interest

],

 and

 you

 love

 [

thing

/

thing

 to

 do

].

 I

 love

 you

,

 [

Name

].

 You

 have

 [

character

istic

],

 and

 you

're

 very

 [

would

 you

 like

 to

 share

 anything

 else

 about

 yourself



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 as

 the

 "

City

 of

 Love

"

 and

 the

 "

City

 of

 Light

."

 It

 is

 the

 country

's

 cultural

 and

 political

 center

 and

 is

 home

 to

 many

 world

-ren

owned

 museums

,

 galleries

,

 and

 landmarks

.

 Paris

 is

 also

 the

 birth

place

 of

 numerous

 notable

 figures

,

 including

 the

 composer

 Ludwig

 van

 Be

ethoven

 and

 the

 novelist

 Charles

 Dickens

.

 The

 city

 is

 home

 to

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

,

 which

 are

 both

 UNESCO

 World

 Heritage

 sites

.

 Additionally

,

 Paris

 is

 a

 popular

 tourist

 destination

,

 known

 for

 its

 historic

 architecture

,

 stunning

 views

 of

 the

 city

,

 and

 vibrant

 nightlife

.

 The

 city

 is

 often

 referred

 to

 as

 "

la

 Ville

 Bl

anche



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 depends

 on

 a

 wide

 range

 of

 factors

,

 including

 technological

 advancements

,

 changes

 in

 societal

 values

,

 and

 evolving

 global

 dynamics

.

 However

,

 here

 are

 some

 possible

 trends

 that

 could

 impact

 AI

 in

 the

 next

 few

 years

:



1

.

 More

 emphasis

 on

 ethical

 considerations

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 lives

,

 there

 will

 be

 a

 greater

 focus

 on

 ethical

 considerations

.

 This

 could

 involve

 debates

 about

 AI

's

 role

 in

 society

,

 its

 impact

 on

 employment

,

 and

 its

 responsibility

 to

 users

.



2

.

 Greater

 adoption

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 medical

 applications

,

 such

 as

 predicting

 patient

 outcomes

 and

 recommending

 treatments

.

 As

 AI

 continues

 to

 improve

,

 we

 may

 see

 more




In [6]:
llm.shutdown()