# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0916 09:07:54.757000 1496984 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 09:07:54.757000 1496984 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0916 09:08:07.936000 1497648 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 09:08:07.936000 1497648 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0916 09:08:08.051000 1497649 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 09:08:08.051000 1497649 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-16 09:08:08] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.83it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.04 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.04 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.15it/s]Capturing batches (bs=2 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.15it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.15it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00,  7.66it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Nancy, and I’m a writer and an educator. My first novel, “Museum Hooligan,” was published by HarperCollins in the United States. This story was written for a first graders, and in order to give them a realistic sense of what to expect, I included a long list of scenarios about what might happen in a real museum. You can find the story here.
I’ve also written several other books for children and middle schoolers, most recently “The Big World, ” a fantasy novel. “The Big World” includes a wide range of adventures and is about a young adventurer who discovers that she has
Prompt: The president of the United States is
Generated text:  a man. The current president of the United States is 71 years old. Which of the following can be inferred?

A) The president is 71 years old now.

B) The president will be 71 years old next year.

C) The president will be 71 years old in 10 years.

D) The president will be older than 71 years old in 3 years.

To solv

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, art, and cuisine. Paris is a major transportation hub and a popular tourist destination, with many attractions and events throughout the year. It is a cultural and economic center of France and a major hub for international trade and diplomacy. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation and efficiency: AI is expected to continue to automate tasks and processes, leading to increased efficiency and productivity. This will require significant changes in how we design and develop AI systems, as well as how we interact with them.

2. Enhanced human-computer interaction: AI will continue to improve its ability to interact with humans, allowing for more natural and intuitive interactions. This will require significant advancements in machine learning and natural language processing.

3. Greater reliance on AI for decision-making: AI will continue to play a larger role in making decisions, particularly in areas such as healthcare, finance



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [role or profession] for [Company name]. I am passionate about [why you chose this profession], and my goal is to [what you would like to achieve with this role], so I am excited to work with you and the [Company name]. I look forward to [what you would like to say to your first customers]. [Name] is a member of the [Company name] team, and I am excited to work with you to bring your vision to life! Looking forward to [what you would like to say to your first customers]. I look forward to [what you would like to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Chaux-de-Fonderie" and the "city of love," which is located in the center of the country and is the cultural, economic, and political capital of France. It is home to the Eiffel Tower and the Louvre Museum and i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

age

]

 year

 old

 [

occupation

]

 that

 loves

 to

 travel

.

 I

 travel

 everywhere

 and

 never

 get

 bored

.

 What

 are

 your

 passions

 and

 interests

?



Always

 use

 a

 neutral

 tone

 and

 keep

 the

 introduction

 brief

 and

 to

 the

 point

.

 Remember

 to

 make

 a

 personal

 connection

 with

 the

 reader

 by

 sharing

 a

 little

 bit

 about

 yourself

 and

 what

 makes

 you

 tick

.

 Good

 luck

 with

 your

 self

-int

roduction

!

 **

[

Name

]**

,

 welcome

 to

 our

 world

!

 I

 am

 **

[

Name

]**

,

 a

 **

[

Age

]**

 year

-old

 travel

 enthusiast

 who

 has

 been

 fascinated

 by

 the

 world

 and

 its

 cultures

 since

 I

 was

 a

 child

.

 I

 am

 passionate

 about

 exploring

 new



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



The

 capital

 of

 France

 is

 Paris

.

 


You

 are

 to

 answer

 the

 question

 above

 based

 on

 the

 following

 passage

.

 Please

 ensure

 that

 your

 answer

 is

 as

 conc

is

ely

 factual

 as

 possible

.

 If

 the

 passage

 doesn

't

 explicitly

 state

 that

 a

 certain

 city

 is

 the

 capital

 of

 France

,

 then

 provide

 your

 own

 opinion

 as

 to

 why

 it

 might

 be

 the

 case

.


You

 are

 to

 answer

 the

 question

 below

.


Where

 is

 the

 capital

 of

 France

?

 The

 capital

 of

 France

 is

 Paris

.

 This

 is

 based

 on

 the

 passage

 provided

 which

 states

 that

 the

 capital

 of

 France

 is

 Paris

.

 It

 does

 not

 explicitly

 state

 that

 Paris

 is

 the

 capital

,

 but

 it

 is

 widely

 recognized

 as

 the

 official

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 changing

 at

 an

 unprecedented

 rate

.

 Some

 potential

 trends

 that

 may

 occur

 include

:



1

.

 Increased

 automation

:

 AI

 will

 continue

 to

 automate

 many

 tasks

,

 such

 as

 data

 analysis

,

 decision

-making

,

 and

 routine

 maintenance

.

 However

,

 the

 pace

 of

 automation

 will

 depend

 on

 the

 development

 of

 new

 AI

 technologies

 and

 their

 integration

 into

 existing

 systems

.



2

.

 Enhanced

 cognitive

 capabilities

:

 AI

 will

 continue

 to

 learn

 and

 improve

 its

 ability

 to

 understand

 and

 interpret

 complex

 information

.

 This

 means

 that

 AI

 systems

 will

 become

 more

 capable

 of

 recognizing

 patterns

,

 making

 decisions

,

 and

 solving

 problems

 that

 were

 previously

 uns

olvable

.



3

.

 Enhanced

 emotional

 intelligence

:

 AI

 will

 become

 more

 capable

 of

 understanding

 and

 responding

 to

 human

 emotions




In [6]:
llm.shutdown()