# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0812 20:57:23.845000 3107434 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0812 20:57:23.845000 3107434 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0812 20:57:37.403000 3108272 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0812 20:57:37.403000 3108272 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.30it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=57.28 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=57.28 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.23it/s]Capturing batches (bs=2 avail_mem=57.22 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.23it/s]Capturing batches (bs=1 avail_mem=57.22 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.23it/s]Capturing batches (bs=1 avail_mem=57.22 GB): 100%|██████████| 3/3 [00:00<00:00,  8.50it/s]Capturing batches (bs=1 avail_mem=57.22 GB): 100%|██████████| 3/3 [00:00<00:00,  7.31it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Peter Green, a keen gardener who specializes in planting and nurturing plants. I have a keen interest in preserving natural habitats and ecosystems. I believe that sustainable gardening practices can help preserve the health of our planet. How can I assist you in setting up a garden that benefits both the environment and my personal health?

As Peter Green, you are the gardener in my mind, and I believe in supporting sustainable gardening practices. How can I assist you in setting up a garden that not only benefits the environment and my personal health, but also preserves the biodiversity and ecological balance of the area?

Gardening is a wonderful hobby that can be enjoyed
Prompt: The president of the United States is
Generated text:  a political office. Who is the president of the United States?
The president of the United States is the President of the United States, also called the President of the United States. The president is the hea

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm currently [Current Location] and I enjoy [Favorite Activity/Interest]. I'm always looking for new experiences and challenges to try out, and I'm always eager to learn and grow. I'm a [Type of Person] who is always [Positive Traits]. I'm always ready to help others and I'm always willing to lend a hand. I'm a [Favorite Book/Artist/Artist/Book] and I love to [Favorite Activity/Interest]. I'm a [Favorite Food/Drink/Activity/Place] and I love to [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and economic hub with a rich history dating back to the Roman Empire and the French Revolution. It is a popular tourist destination and a major center of politics, business, and art in the world. Paris is also known for its cuisine, including French cuisine, and its fashion industry. The city is home to many world-renowned museums, including the Louvre and the Musée d'

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased focus on ethical and social implications: As AI becomes more integrated into our daily lives, there will be a growing emphasis on its ethical and social implications. This could lead to more stringent regulations and guidelines for AI development and deployment, as well as increased scrutiny of AI systems in the public eye.

2. Greater integration with human decision-making: AI is likely to become more integrated with human decision-making, particularly in areas such as healthcare, finance, and transportation. This could



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Type] [Role] at [Company/Establishment]. I'm an AI language model trained on [Tool/Technology]. How can I assist you today? What would you like to learn about? I'm here to help you with any questions or concerns you may have. Please let me know how I can assist you. Let's get started! [Name] [Description of role] [Company/Organization/Platform] [Tools/Technology] [Education] [Languages] [Skills] [Experience] [Certifications] [Other] [Background] [Interests] [Qualifications]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 
(A) Paris, the oldest capital city in Europe, was established in 789 AD and is the 8th largest city in the world. 
(B) Paris is the oldest capital city in Europe and was established in 789 AD. 
(C) Paris is the capital of France and is the 8th largest city in the wo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Major

ity

,

 Minor

]

 of

 the

 [

Name

 of

 the

 Company

/

In

stitution

].

 I

'm

 excited

 to

 be

 here

 and

 contribute

 to

 the

 [

Major

ity

,

 Minor

]

 of

 the

 [

Name

 of

 the

 Company

/

In

stitution

].

 I

 have

 a

 lot

 of

 experience

 working

 in

 [

Industry

/

Field

],

 and

 I

'm

 a

 [

Language

,

 Age

,

 Education

,

 Personality

]

 who

 is

 always

 looking

 to

 learn

 and

 grow

. I

'm passionate

 about [

Objective

/

Interest

 of

 the

 Company

/

In

stitution

]

 and

 I

'm

 looking

 forward

 to

 [

Purpose

 of

 the

 Interview

].

 Thank

 you

 for

 considering

 my

 application

!

 How

 about

 you

?

 [

Name

]

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 Lo

ire

 Valley

 region

 of

 the

 French

 Alps

.



Does

 this

 next

 sentence

 follow

,

 given

 the

 above

 statement

?

 "

Paris

 is

 located

 in

 the

 French

 Alps

."

 Yes

,

 it

 does

 follow

.



The

 next

 sentence

 does

 not

 follow

 the

 given

 statement

.

 While

 it

 is

 true

 that

 Paris

 is

 in

 the

 Lo

ire

 Valley

 region

 of

 the

 French

 Alps

,

 the

 given

 statement

 specifies

 that

 the

 Lo

ire

 Valley

 is

 in

 the

 French

 Alps

.

 The

 French

 Alps

 consist

 of

 the

 Alps

 in

 France

,

 including

 the

 Lo

ire

 Valley

,

 while

 the

 Lo

ire

 Valley

 is

 specifically

 in

 the

 French

 Alps

.

 Thus

,

 the

 sentence

 incorrectly

 assumes

 that

 Paris

 is

 in

 the

 French

 Alps

.

 A

 correct

 statement



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

 and

 promising

,

 and

 there

 are

 several

 trends

 that

 are

 likely

 to

 shape

 the

 technology

's

 direction

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 transparency

 and

 accountability

:

 As

 AI

 systems

 become

 more

 complex

 and

 involve

 more

 interactions

 with

 humans

,

 there

 will

 be

 a

 need

 for

 greater

 transparency

 and

 accountability

.

 This

 means

 that

 developers

 will

 need

 to

 implement

 more

 robust

 systems

 for

 explaining

 AI

 decisions

,

 such

 as

 code

 comments

 and

 explanations

 of

 the

 logic

 behind

 AI

 outputs

.



2

.

 AI

 will

 become

 more

 integrated

 into

 everyday

 life

:

 AI

 is

 already

 present

 in

 our

 daily

 lives

,

 but

 its

 integration

 into

 our

 daily

 routines

 will

 likely

 continue

 to

 grow

.

 This

 will




In [6]:
llm.shutdown()