# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0806 04:16:12.612000 647801 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0806 04:16:12.612000 647801 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0806 04:16:19.969000 648532 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0806 04:16:19.969000 648532 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.54it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Andrew, and I am a SQL Server expert.
I would like to ask you a question that is related to SQL Server and I will give you my answer.
Sure thing! Please go ahead and ask your question. I'm ready to answer. Andrew. 

What is the difference between a stored procedure and a trigger in SQL Server? Andrew. 
In SQL Server, a stored procedure is a SQL command that is stored in a database and can be called from other SQL statements. It is a user-defined function that returns a value or a set of values. A stored procedure is created using the CREATE PROCEDURE statement and can be executed by
Prompt: The president of the United States is
Generated text:  5 feet 10 inches tall. If the average person is 5 feet tall, how many times taller is the president than the average person?
To determine how many times taller the president is compared to the average person, we need to convert the heights from feet and inches to just inches and then calculate the ratio

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] because [reason for passion]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] because [reason for passion]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [job title] with [number of years] years of experience

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. The city is also famous for its rich history, art, and culture, and is home to many famous museums, theaters, and restaurants. Paris is a major tourist destination and a major economic center in France. It is the seat of the French government and is home to many important institutions such as the French Academy of Sciences and the French Parliament. Paris is also known for its fashion industry, with many famous designers and bout

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This includes issues such as bias, transparency, accountability, and privacy. AI developers will need to be more mindful of these concerns as they work to create more trustworthy and ethical systems.

2. Integration with human decision-making: AI is likely to become more integrated with human decision-making in the future. This could include the use of AI to assist with decision-making in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name]. I am a [insert genre, such as fiction, non-fiction, or romantic fiction] author. I have a passion for [insert favorite subject, such as travel, art, or science fiction]. I love to write about [insert topic, such as the power of storytelling, the importance of being true to oneself, or the future of technology]. I enjoy exploring new writing techniques and experimenting with different genres. I am always looking for fresh writing ideas and want to create something that is both original and engaging for my readers. I strive to be a supportive and honest friend to the characters I write about, while

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its rich history, vibrant culture, and numerous museums and art galleries. 

(Note: I'll complete the sentence, mak

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

type

 of

 character

]

!

 I

 have

 a

 background

 in

 [

major

 field

 of

 study

 or

 career

],

 and

 I

'm

 passionate

 about

 [

mention

 something

 that

 describes

 your

 interests

 or

 hobbies

].

 Despite

 my

 profession

,

 I

 enjoy

 [

mention

 a

 hobby

 or

 activity

 that

 I

 enjoy

].

 I

 love

 [

mention

 a

 trait

 or

 quality

 that

 sets

 me

 apart

 from

 others

],

 and

 I

 strive

 to

 [

mention

 a

 personal

 goal

 or

 aspiration

].

 I

 value

 [

mention

 a

 value

 or

 characteristic

 that

 is

 important

 to

 me

],

 and

 I

'm

 dedicated

 to

 [

mention

 a

 specific

 activity

 or

 pursuit

 that

 I

 commit

 to

].

 I

'm

 always

 looking

 for

 [

mention

 something

 that

 challenges

 or

 inspires

 me



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 fact

ually

 correct

 and

 accurately

 describes

 the

 capital

 city

 of

 France

.

 



To

 elaborate

 further

:



1

.

 The

 capital

 city

 of

 France

 is

 Paris

 (

also

 known

 as

 "

la

 capit

ale

 française

").



2

.

 Located

 in

 the

 Lo

ire

 Valley

 region

 of

 southern

 France

,

 Paris

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

.



3

.

 It

 is

 the

 seat

 of

 the

 French

 government

,

 the

 French

 Parliament

,

 and

 the

 heart

 of

 the

 French

 culture

,

 art

,

 and

 architecture

.



4

.

 Paris

 is

 famous

 for

 its

 romantic

 ambiance

,

 historic

 architecture

,

 and

 vibrant

 cultural

 scene

,

 which

 attracts

 millions

 of

 visitors

 annually

.



5

.

 The

 city

 is

 home

 to

 iconic

 landmarks



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

,

 and

 there

 are

 many

 potential

 trends

 that

 we

 can

 expect

 to

 see

 in

 the

 years

 to

 come

.

 Here

 are

 some

 of

 the

 key

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 automation

 and

 AI

 applications

:

 AI

 is

 set

 to

 become

 even

 more

 integrated

 into

 our

 daily

 lives

,

 with

 more

 and

 more

 AI

 applications

 becoming

 widespread

.

 This

 could

 include

 everything

 from

 self

-driving

 cars

 to

 personalized

 content

 recommendations

 to

 even

 some

 of

 the

 most

 basic

 tasks

 we

 do

.



2

.

 Deep

 learning

 and

 reinforcement

 learning

:

 These

 are

 two

 of

 the

 most

 advanced

 forms

 of

 AI

,

 and

 they

 are

 likely

 to

 become

 even

 more

 important

 in

 the

 future

.

 Deep

 learning

 is

 a

 type

 of




In [6]:
llm.shutdown()