# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0814 02:47:14.907000 2104176 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 02:47:14.907000 2104176 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0814 02:47:26.464000 2104620 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 02:47:26.464000 2104620 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.48it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.92 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.92 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.88it/s]Capturing batches (bs=2 avail_mem=72.86 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.88it/s]Capturing batches (bs=1 avail_mem=72.85 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.88it/s]Capturing batches (bs=1 avail_mem=72.85 GB): 100%|██████████| 3/3 [00:00<00:00,  9.37it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Miep van der Rohe. I am an artist, writer, and professor in the fields of art and philosophy. My work is rooted in a tradition of art and philosophy that is deeply rooted in the East. In my art, I seek to explore the meaning of existence, and in my writing, I explore the significance of the body and the human condition.

My work often involves the exploration of the dualism of the body and the psyche, as well as the role of art in our cultural and philosophical traditions. My creative projects can include the creation of large-scale sculptures and installations, the development of digital art, the creation of
Prompt: The president of the United States is
Generated text:  in the White House, a certain distance from the President of the United Kingdom. If the president of the United States walks to the President of the United Kingdom, which is 20 miles away, the distance between them will be reduced to 15 miles. However, if the president of the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a city with a rich history and culture. It is located on the Seine River and is the largest city in France by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its fashion industry, with many famous designers and fashion houses operating in the area. Paris is a popular tourist destination and is home to many museums, art galleries, and cultural institutions. It is a cultural and intellectual center of France and a major economic hub. The city is also known for its cuisine,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, AI will continue to be used for ethical and social reasons, such as improving access to healthcare and reducing poverty. Finally, AI will continue to be used for military and security purposes, with the potential to revolutionize warfare and surveillance. Overall, the future of AI is likely to be one of continued innovation and progress, with a focus on ethical and social implications.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [occupation], [occupation] on [name]. I have always been fascinated by the idea of [occupation], so I decided to become a [occupation] to explore its depths. With a passion for [occupation], I strive to understand the different aspects of [occupation] and the challenges that come with it. I believe that it's important to be honest and transparent about the journey of becoming a [occupation], and I hope to share my experiences and insights with anyone interested in learning more. I am excited to have the opportunity to share my knowledge and experience with you. And, most importantly, I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Calculate the probability of an individual finding an apple in an unopened fruit basket in Paris, given that the apple is randomly selected from the 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

Professional

/

Background

/

Role

].

 I

 specialize

 in

 [

Job

 Title

/

Position

].

 My

 [

Professional

/

Background

/

Role

]

 has

 been

 instrumental

 in

 [

Purpose

 of

 the

 Job

].

 Let

's

 talk

 about

 [

Objective

/

Challenge

].

 I

'm

 a

 [

Type

 of

 Person

]

 and

 I

 enjoy

 [

Joy

/

Phil

osoph

y

/

Value

].

 Here

 is

 an

 example

 of

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

:



"

Hi

,

 my

 name

 is

 [

Name

]

 and

 I

 am

 a

 [

Professional

/

Background

/

Role

].

 I

 specialize

 in

 [

Job

 Title

/

Position

].

 My

 [

Professional

/

Background

/

Role

]

 has

 been

 instrumental

 in



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Would

 you

 like

 to

 know

 more

 about

 Paris

 or

 its

 landmarks

?

 Paris

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 



I

 will

 now

 focus

 on

 the

 importance

 of

 Paris

 in

 French

 history

 and

 culture

.

 Paris

 has

 been

 the

 capital

 of

 France

 since

 

8

8

6

 AD

 when

 Emperor

 Clo

vis

 was

 crowned

 as

 the

 first

 King

 of

 France

.

 The

 city

 served

 as

 a

 major

 French

 port

 city

 and

 trading

 center

 until

 the

 

1

9

th

 century

 when

 it

 was

 largely

 abandoned

.

 



During

 the

 French

 Revolution

 of

 

1

7

8

9

,

 Paris

 was

 the

 center

 of

 political

 and

 social

 uphe

aval

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 exciting

 and

 potentially

 transformative

 possibilities

,

 and

 it

 is

 difficult

 to

 predict

 with

 certainty

 what

 will

 shape

 it

 in

 the

 coming

 decades

.

 However

,

 here

 are

 some

 potential

 trends

 that

 AI

 is

 likely

 to

 continue

 to

 evolve

 and

 advance

 in

 the

 coming

 years

:



1

.

 Increased

 dependence

 on

 AI

 in

 work

:

 With

 the

 increasing

 importance

 of

 automation

 in

 the

 workplace

,

 it

 is

 likely

 that

 AI

 will

 continue

 to

 play

 a

 more

 significant

 role

 in

 many

 jobs

.

 This

 could

 lead

 to

 a

 higher

 level

 of

 automation

 in

 certain

 areas

 and

 a

 more

 skilled

 workforce

.



2

.

 Greater

 focus

 on

 ethical

 AI

:

 There

 is

 growing

 awareness

 of

 the

 potential

 risks

 and

 ethical

 concerns

 associated

 with

 AI

,

 and

 there

 is

 likely




In [6]:
llm.shutdown()