# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0811 19:23:40.067000 2466262 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 19:23:40.067000 2466262 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0811 19:23:49.295000 2466559 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 19:23:49.295000 2466559 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.80it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.10it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.10it/s]

Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.10it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.19it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lin. I’m a software engineer, and my major is computer science. I like to solve problems, which is my goal in life. I’m always learning. I used to have a dream to become a singer, but because of the pandemic, I decided to move on and become a software engineer. As a software engineer, I like to solve problems, which is my goal in life. I’m always learning, which is my goal. I hope to become a successful software engineer in the future. (1) My major is computer science. (2) I like to solve problems. (3) I used to have a dream
Prompt: The president of the United States is
Generated text:  a public office held by the head of state of the United States. The United States president is elected by the citizens of the United States, by a nationwide popular vote. The president of the United States is a member of the executive branch of the U.S. government, with the other branches consisting of the legislative branch and the judicial branch. In the even

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that serves as the political, cultural, and economic center of the country. It is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also famous for its rich history, including the French Revolution and the French Revolution Museum. The city is home to many famous museums, including the Musée d'Orsay and the Musée Rodin. Paris is a vibrant and diverse city with a rich cultural heritage that continues to attract visitors from around the world. It is a city that is constantly evolving and changing, with new developments and attractions being added

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots in factories to personalized medicine and virtual assistants. Additionally, AI will likely continue to be used for tasks that require human-like intelligence, such as language translation and emotional intelligence, and will be integrated into various industries and sectors. However, there are also potential risks and challenges associated with AI, including issues of bias, privacy, and security, and the need for careful consideration and regulation of these technologies. Overall, the future



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I am a [Your Job Title] at [Your Company]. My background in [Your Relevant Experience/Title] is [Your Relevant Experience], which includes [Your Relevant Experience], and I have [X amount of relevant experience]. I am enthusiastic about [Your Passion]. I am [X amount of relevant experience] with [Your Relevant Experience], and I have [X amount of relevant experience]. I am highly skilled in [Your Relevant Skills/Job Title], and I am known for [Your Achievements/Professional Achievements]. I am also a strong communicator and I am [X amount of relevant experience] with

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a historical and cultural center renowned for its rich history and vibrant cultural scene. Paris is known for its stunning architecture, delicious cuisine, and iconic landmark

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

],

 working

 on

 [

your

 job

 title

].

 What

 can

 you

 tell

 me

 about

 your

 experience

 with

 [

company

's

 product

/service

]

?



Hello

,

 my

 name

 is

 [

Name

].

 I

'm

 a

 [

job title

]

 at

 [

company

 name

],

 working

 on

 [

your

 job

 title

].

 What

 can

 you

 tell

 me

 about

 your

 experience

 with

 [

company

's

 product

/service

]?

 [

Tell

 me

 about

 your

 experience

 with

 the

 product

/service

].

 That

's

 a

 great

 experience

,

 thanks

!

 What

 do

 you

 do

 at

 work

?

 [

Tell

 me

 about

 your

 job

 duties

].

 That

 sounds

 like

 fun

 work

!

 Any

 hobbies

 or

 interests

?

 [

Tell

 me



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 historical

 and

 cultural

 center

 known

 as

 "

La

 Ville

-\

U

0

0

E

9

ro

ise

".

 It

 is

 the

 largest

 city

 in

 Europe

 and

 a

 major

 European

 economic

 and

 political

 center

.

 Paris

 is

 the

 French

 capital

 and

 was

 originally

 the

 capital

 of

 ancient

 France

 and

 the

 first

 capital

 of

 the

 French

 Empire

.

 It

 was

 renamed

 in

 

1

7

9

3

 to

 coincide

 with

 the

 opening

 of

 the

 Par

c

 Louis

 V

uit

ton

 in

 Paris

.

 Paris

 is

 the

 world

's

 largest

 and

 most

 popular

 tourist

 destination

,

 with

 over

 

1

6

 million

 visitors

 in

 

2

0

1

9

.

 Paris

 is

 also

 the

 center

 of

 art

,

 music

,

 fashion

,

 film

,

 and

 the

 sciences

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 unpredictable

,

 but

 here

 are

 some

 possible

 trends

 that

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

:



1

.

 Increased

 collaboration

 between

 humans

 and

 machines

:

 More

 and

 more

 AI

 will

 be

 integrated

 with

 human

 intelligence

,

 leading

 to

 more

 efficient

 and

 effective

 decision

-making

.

 This

 will

 likely

 involve

 more

 human

 oversight

 and

 collaboration

,

 allowing

 AI

 to

 make

 decisions

 based

 on

 complex

 data

 and

 human

 knowledge

.



2

.

 Self

-learning

 and

 self

-cor

rection

:

 AI

 systems

 will

 become

 more

 capable

 of

 learning

 and

 adapting

 to

 new

 situations

,

 without

 relying

 on

 pre

-program

med

 instructions

.

 This

 will

 enable

 AI

 to

 improve

 its

 performance

 over

 time

,

 making

 it

 more

 reliable

 and

 effective

.



3

.

 Greater

 use

 of




In [6]:
llm.shutdown()