# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0812 08:55:12.235000 3828890 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0812 08:55:12.235000 3828890 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0812 08:55:21.344000 3829550 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0812 08:55:21.344000 3829550 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.39it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.28it/s]Capturing batches (bs=2 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.28it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.28it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00,  5.93it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sasha and I am a professional photographer and videographer. I was born and raised in Tresecas, Texas, USA and moved to Houston to pursue my career. I have a passion for capturing the beauty of the world and sharing it with others through photography. I am a photographer, videographer, and filmmaker who has a unique way of bringing people together with our shared passion for photography and the beauty of the world. I believe that photography is a powerful tool to create memories and inspire people, and I am passionate about using my skills to capture the unique stories of individuals and the stories of our world. I believe that photography is not
Prompt: The president of the United States is
Generated text:  two times as old as the president of Russia. The president of Russia is 30 years younger than the president of China. If the president of China is 40 years old, calculate the average age of the three presidents.

To determine the average a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character, such as "funny, witty, and always up for a good laugh"]. I enjoy [insert a short description of your character's interests, such as "reading, cooking, or playing sports"]. I'm always looking for new experiences and challenges, and I'm always eager to learn and grow. What's your favorite hobby or activity? I'm always looking for new ways to challenge myself and expand

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and is home to many famous French artists, writers, and musicians. The city is also known for its rich history, including the Roman Empire, French Revolution, and French Revolution. Paris is a vibrant and diverse city with a rich cultural heritage that continues to inspire and captivate people around the world. The city is also home to many international organizations and institutions, including the French

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased use of AI in healthcare: AI is already being used to improve patient outcomes in areas such as diagnosis, treatment planning, and patient monitoring. As AI technology continues to improve, we can expect to see even more sophisticated applications in healthcare.

2. AI in manufacturing: AI is already being used to optimize production processes, reduce costs, and improve quality. As AI technology continues to evolve, we can expect to see even more advanced applications in manufacturing.

3. AI in finance: AI is already



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name]. I'm a [insert occupation/role] who has a passion for [insert why you love what you do]. Whether it's writing, photography, or whatever you enjoy, I'm always exploring the world through my lens. 

I love [insert why this is important to you]. If you're interested in photography, I'd love to share my experiences and insights with you. And if you're just starting out, I'm here to help you on your journey. 

So, if you're ever in need of a good look, I'm your go-to person. And if you're in need of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the city of the Emperor, the president, and the parliament. It is located in the south of France, on the western coast, and is the largest city in the country and the second most populous in the European Union. Its most famous lan

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 young

 woman

 living

 in

 [

City

].

 I

'm

 [

Age

]

 years

 old

,

 and

 I

 love

 [

Favorite

 Activity

/

Interest

].

 I

'm

 passionate

 about

 [

Reason

 Why

 I

 Love

 [

Activity

/

Interest

]],

 and

 I

 strive

 to

 be

 [

Character

's

]

 best

 friend

.

 I

 believe

 in

 [

Why

 I

 Believe

 in

 [

Activity

/

Interest

]]

 and

 I

'm

 always

 [

What

 I

 Do

 When

 [

Activity

/

Interest

]

 Is

 Diff

icult

].

 I

'm

 [

Age

]

 years

 old

 and

 I

 have

 [

Number

 of

 Pets

]

 pets

.

 I

 love

 spending

 [

Number

 of

 Hours

]

 hours

 a

 week

 at

 [

Activity

/

Interest

].

 I

'm

 a

 [

Gender

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

,

 home

 to

 the

 iconic

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.



That

's

 correct

!

 The

 capital

 city

 of

 France

 is

 Paris

,

 famous

 for

 its

 iconic

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Is

 there

 anything

 else

 you

'd

 like

 to

 know

 about

 Paris

?

 It

's

 a

 beautiful

 city

 with

 a

 rich

 history

 and

 amazing

 architecture

.

 How

 about

 you

?

 Do

 you

 have

 any

 questions

 about

 Paris

 or

 French

 culture

?

 



That

's

 a

 great

 point

!

 Paris

 is

 definitely

 a

 city

 of

 contrasts

,

 with

 a

 lively

 atmosphere

 and

 a

 fascinating

 blend

 of

 French

 culture



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 full

 of

 exciting

 developments

 that

 will

 continue

 to

 change

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 technology

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 more

 AI

 systems

 are

 used

 in

 critical

 areas

 such

 as

 healthcare

,

 transportation

,

 and

 security

,

 it

 is

 becoming

 increasingly

 important

 to

 ensure

 that

 AI

 is

 designed

 and

 used

 eth

ically

 and

 responsibly

.

 This

 includes

 developing

 new

 ethical

 frameworks

 and

 standards

 to

 guide

 the

 development

 of

 AI

 systems

 and

 ensuring

 that

 they

 are

 not

 designed

 to

 harm

 humans

.



2

.

 Integration

 of

 AI

 with

 other

 technologies

:

 AI

 will

 continue

 to

 be

 integrated

 into

 many

 other

 technologies

,

 such

 as

 virtual

 and

 augmented




In [6]:
llm.shutdown()