# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0811 04:10:06.755000 3331354 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 04:10:06.755000 3331354 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0811 04:10:15.561000 3331793 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 04:10:15.561000 3331793 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.96it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.96it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.90 GB):   0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.90 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.05it/s]Capturing batches (bs=2 avail_mem=55.83 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.05it/s]

Capturing batches (bs=1 avail_mem=55.83 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.05it/s]Capturing batches (bs=1 avail_mem=55.83 GB): 100%|██████████| 3/3 [00:00<00:00, 11.70it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lucille and I am an Amazon Associate and Affiliate. I earn commissions for qualifying purchases.
As an Amazon Associate, I earn from qualifying purchases. Please see my Privacy Policy for more details.
I would like to share with you a little bit about myself and what you can learn from reading this blog.
I am a professional business coach and I specialize in personal development, coaching and training. I have been teaching coaching for over 20 years and have coached over 10,000 students in 26 countries. I have worked with students of all ages and backgrounds, including high school students, college students, and adults.

Prompt: The president of the United States is
Generated text:  running for a second term. He needs to raise $500,000 to keep his campaign going. He has already raised $120,000. If he raises an additional $30,000, what percent of the way to his goal will he have to raise the remaining amount in order to reach his goal?

To dete

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French Academy of Sciences. Paris is a bustling city with a rich history and culture, and it is a popular tourist destination for visitors from around the world. The city is known for its fashion, art, and cuisine, and it is a major center for business and commerce. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also home to many cultural institutions

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some possible future trends in AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This includes issues such as bias, transparency, accountability, and privacy.

2. More advanced hardware: As AI technology continues to advance, we may see the development of more powerful hardware that can process and analyze large amounts of data more efficiently.

3. Greater integration with other technologies: AI is likely to become more integrated with other technologies, such



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name], and I'm an AI language model. How can I assist you today? [name] is an AI language model trained by [company name] to assist users in generating natural language text. I can answer questions, provide information, and complete tasks related to language processing and generation. [name] has been designed to be impartial, fair, and transparent in its interactions, ensuring that all users receive accurate and helpful responses. What can I do for you today? [name] can be used for a variety of purposes, including language translation, summarization, and generating text for various applications. If you have any questions or need

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city renowned for its historical and cultural landmarks, including the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedra

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 your

 name

],

 and

 I

'm

 an

 AI

 language

 model

 created

 by

 Anth

ropic

.

 I

'm

 here

 to

 assist

 you

 with

 any

 questions

 you

 might

 have

,

 and

 to

 help

 you

 achieve

 your

 goals

 and

 objectives

.

 How

 can

 I

 assist

 you

 today

?

 Don

't

 hesitate

 to

 ask

 me

 any

 questions

 or

 provide

 me

 with

 any

 information

 you

 need

.

 Remember

,

 my

 goal

 is

 to

 provide

 the

 best

 possible

 service

 to

 everyone

 who

 interacts

 with

 me

.

 So

 please

,

 let

 me

 know

 how

 I

 can

 be

 of

 assistance

 to

 you

 today

!

 #

Anth

ropic

 #

AI

 #

Language

Model

ing

 #

Self

Introduction

 #

Open

And

Support

ive





Hello

,

 my

 name

 is

 [

insert

 your

 name

],

 and

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

,

 and

 it

 is

 located

 on

 the

 Î

le

 de

 France

 and

 the

 Se

ine

 river

,

 on

 the

 western

 bank

 of

 the

 Se

ine

 in

 the

 heart

 of

 the

 French

 Riv

iera

.

 Paris

 is

 a

 historical

 and

 cultural

 center

,

 known

 for

 its

 major

 art

 museums

,

 literary

,

 and

 scientific

 institutions

,

 and

 for

 its

 iconic

 architecture

,

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

,

 and

 the

 Ch

amps

-

É

lys

ées

.

 It

 is

 also

 a

 major

 financial

 center

,

 known

 for

 its

 world

-class

 banks

 and

 fashion

,

 and

 for

 its

 annual

 cout

ure

 exhibition

,

 the

 G

ugg

enheim

 Show

.

 Paris

 is

 a

 large

 and

 diverse



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 significant

 advances

 in

 areas

 such

 as

 deep

 learning

,

 machine

 learning

,

 and

 natural

 language

 processing

.

 Here

 are

 some

 potential

 trends

 that

 could

 shape

 AI

 in

 the

 coming

 years

:



1

.

 Enhanced

 Predict

ive

 Analytics

:

 AI

 will

 become

 even

 more

 sophisticated

 in

 its

 ability

 to

 predict

 future

 events

 and

 behaviors

.

 Machine

 learning

 algorithms

 will

 be

 used

 to

 analyze

 data

 and

 patterns

 in

 real

-time

,

 allowing

 companies

 to

 anticipate

 potential

 challenges

 and

 opportunities

.



2

.

 Integration

 with

 Other

 Technologies

:

 AI

 will

 continue

 to

 integrate

 with

 other

 technologies

,

 including

 the

 Internet

 of

 Things

 (

Io

T

)

 and

 sensors

.

 This

 integration

 will

 enable

 AI

-powered

 systems

 to

 learn

 and

 adapt

 to

 changing

 conditions

,

 which

 could




In [6]:
llm.shutdown()