# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0826 17:57:05.131000 1825203 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 17:57:05.131000 1825203 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0826 17:57:15.038000 1826258 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 17:57:15.038000 1826258 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.18it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=58.37 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=58.37 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.37it/s]Capturing batches (bs=2 avail_mem=58.26 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.37it/s]Capturing batches (bs=1 avail_mem=58.18 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.37it/s]Capturing batches (bs=1 avail_mem=58.18 GB): 100%|██████████| 3/3 [00:00<00:00, 10.01it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Max Hall and I am a Vedic Astrologer. How can I help you today?
Hello Max Hall! It's great to meet you, and I'm here to help with any questions you may have. What can I assist you with today? Any special requests or concerns you have? I'd be happy to discuss them further. Let's get started! 
Is there anything in particular you'd like to know about Vedic astrology or time management? I can certainly share more details if you'd like. Let's move forward with the conversation! How can I assist you today? Is there anything in particular you'd like to know
Prompt: The president of the United States is
Generated text:  considering implementing a new policy requiring all federal employees to have at least one year of experience in their field of work. According to a study by the Government Accountability Office, 45% of federal employees have less than one year of experience in their field of work. The president plans to implement the policy and wants 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill or Trait] who have been [Number of Years] years in the [Field of Study] field. I'm passionate about [What I Love About My Profession]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [Favorite Hobby] that I enjoy spending time [How]. I'm [What I'm Known For]. I'm [What I'm Looking Forward To Doing Next]. I'm [What I'm Looking Forward To Doing Next]. I'm [What I'm Looking Forward To Doing Next].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also famous for its rich history, including the French Revolution and the French Revolution Museum. Paris is a bustling city with a diverse population and is home to many famous French artists, writers, and musicians. It is a popular tourist destination and a cultural hub for France. The city is also known for its cuisine, with many famous French dishes and restaurants. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly together. It is a city that has been a center of French culture and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced interactions between machines and humans. This could lead to more personalized and adaptive AI systems that can better understand and respond to human needs and preferences.

2. Greater emphasis on ethical and responsible AI: As AI systems become more complex and sophisticated, there will be a greater emphasis on ethical and responsible design and development. This could include considerations of fairness, transparency, and accountability, as well as the need to address potential biases and unintended consequences of AI systems.

3. Increased use of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Emily, and I am a seasoned career coach. I have over 15 years of experience in the field, and my goal is to help people achieve their career goals. I am passionate about helping people grow and succeed in their chosen field, and I love working with clients who are looking for success. I have a unique approach to coaching, using a combination of technical skills, soft skills, and a deep understanding of the industry to help my clients. I believe that everyone has the potential to succeed in their career, and that's why I am dedicated to helping others reach their full potential. Thank you for taking the time to meet me

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is also the largest city in France by area, and is the third largest in the world. It is a UNESCO World Heritage site and is known for i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

].

 I

 am

 [

age

]

 years

 old

,

 and

 I

've

 been

 traveling

 around

 the

 world

 for

 [

number

 of

 years

].

 I

'm

 a

 [

occupation

]

 with

 a

 passion

 for

 [

reason

 for

 passion

],

 and

 I

've

 always

 been

 curious

 about

 the

 world

 around

 me

.

 What

 kind

 of

 experiences

 do

 you

 enjoy

,

 and

 how

 do

 you

 stay

 motivated

 to

 keep

 learning

 and

 growing

?

 What

's

 your

 favorite

 hobby

 or

 activity

,

 and

 how

 do

 you

 find

 time

 for

 it

?

 Can

 you

 tell

 me

 a

 bit

 about

 your

 background

 or

 any

 relevant

 experiences

 that

 you

 have

 gained

?

 Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

,

 and

 I

 look

 forward

 to

 learning

 more

 about

 you

!

 �



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.


The

 answer

 is

:

 Paris

,

 the

 capital

 city

 of

 France

,

 is

 famous

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 also

 home

 to

 the

 Lou

vre

 Museum

,

 which

 houses

 a

 vast

 collection

 of

 art

 and

 artifacts

 from

 around

 the

 world

.

 Additionally

,

 Paris

 is

 known

 for

 its

 cultural

 attractions

,

 including

 the

 Opera

 House

,

 Arc

 de

 Tri

omp

he

,

 and

 the

 Mus

ée

 d

'

Or

say

.

 The

 city

 also

 has

 a

 rich

 history

,

 dating



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 highly

 uncertain

 and

 rapidly

 evolving

 field

,

 with

 a

 wide

 range

 of

 possible

 trends

 and

 developments

.

 Some

 potential

 trends

 that

 are

 emerging

 include

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 medical

 diagnostics

,

 drug

 discovery

,

 and

 personalized

 treatment

 plans

.

 As

 AI

 technology

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 further

 applications

 in

 healthcare

,

 including

 better

 patient

 outcomes

 and

 more

 personalized

 treatment

 plans

.



2

.

 More

 autonomous

 vehicles

:

 Self

-driving

 cars

 are

 already

 being

 developed

,

 and

 there

 is

 a

 lot

 of

 potential

 for

 further

 developments

 in

 this

 area

.

 As

 autonomous

 vehicles

 become

 more

 common

,

 we

 can

 expect

 to

 see

 more

 intelligent

 control

 systems

 and

 improved

 safety

 features

.






In [6]:
llm.shutdown()