# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0917 03:54:49.632000 1282805 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0917 03:54:49.632000 1282805 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0917 03:54:57.786000 1283163 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0917 03:54:57.786000 1283163 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0917 03:54:57.808000 1283164 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0917 03:54:57.808000 1283164 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-17 03:54:58] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.35it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.57it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.57it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.57it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.75it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Trish and I am a 20 year old US Navy SEAL. In my spare time, I like to read, paint, play guitar and cook. I graduated with a degree in philosophy from the University of Notre Dame in 2009. I was born and raised in Massachusetts, USA.
I served in the 56th Infantry Division during Operation Desert Storm and participated in the anti-terrorist operation "CIA Nightline". I also served as a sniper for the 101st Airborne Division and was the deadliest sniper on the night of June 9, 2005, the day
Prompt: The president of the United States is
Generated text:  200 inches tall. If the president and his first two vice presidents were each 1/3 as tall as the president, what is the height of the president and his first two vice presidents combined?
To find the total height of the president and his first two vice presidents combined, we need to calculate the height of each of them and then sum them up.

1. The height of the president is given as 200 inches.


### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Skill] who has been [Number of Years] years in the field of [Field of Interest]. I am [Gender] and I have [Number of Children] children. I am [Occupation] and I am [Age]. I am [Gender] and I have [Number of Children] children. I am [Occupation] and I am [Age]. I am [Gender] and I have [Number of Children] children. I am [Occupation] and I am [Age]. I am [Gender] and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population of over 10 million people. The city is home to iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as numerous museums, theaters, and restaurants. Paris is known for its fashion industry, art scene, and cultural events, making it a popular destination for tourists and locals alike. The city is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a city of contrasts, with its modern architecture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some potential trends that are likely to shape the future of AI:

1. Increased automation and artificial intelligence: As AI technology continues to improve, we can expect to see more automation and artificial intelligence in our daily lives. This could include the development of more advanced robots and machines that can perform tasks that are currently done by humans, such as manufacturing, healthcare, and transportation.

2. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  ___________ and I'm a/an _________________. I come from a/an _____________. My background is ____________. I'm ____________. What brings you to this place?

I'm excited to meet you and to learn about you and your background. I'd love to get to know you better and to share my experiences and knowledge with you.

---

Remember, my name is **Your Name** and I'm a/an **Your Profession**. I come from a **Your Background**. My background includes **Your Education**. I'm **Your Experience**. What brings you to this place? What do you want

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the largest city in the country and serves as its political, cultural, and economic center.
Paris is often considered the "city of love" due to its famous romantic landmarks, including the Eiffel Tower, Louvre Museum, 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

].

 I

'm

 [

insert

 character

's

 age

],

 and

 I

 am

 [

insert

 character

's

 occupation

].

 I

'm

 [

insert

 character

's

 height

,

 weight

,

 gender

,

 etc

.

].

 I

 have

 [

insert

 character

's

 hobbies

,

 interests

,

 etc

.

].

 I

 enjoy

 [

insert

 character

's

 hobby

 or

 activity

].

 I

'm

 also

 [

insert

 character

's

 personal

 trait

 or

 characteristic

].

 I

 like

 [

insert

 character

's

 favorite

 food

,

 movie

,

 book

,

 etc

.

].

 And

 [

insert

 character

's

 favorite

 place

 or

 activity

].

 Thank

 you

 for

 asking

,

 and

 good

 luck

 with

 your

 journey

.

 ¡

H

asta

 luego

!

 [

insert

 character

's

 name

]



This

 is

 a

 neutral

 self

-int



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

 statement

 is

 true

 and

 complete

.

 Paris

 is

 the

 capital

 city

 of

 France

.

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 most

 populous

 city

 in

 the

 European

 Union

 by

 population

.

 Paris

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 Many

 people

 consider

 Paris

 to

 be

 one

 of

 the

 most

 beautiful

 and

 modern

 cities

 in

 the

 world

.

 Paris

 is

 home

 to

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 and

 other

 iconic

 landmarks

.

 It

 is

 also

 the

 home

 to

 the

 E

iff

ed

 and

 Ch

amps

-E

lys

ées

 districts

,

 as

 well

 as

 the

 Notre

-D

ame

 Cathedral

 and

 the

 Sac

ré

-C

œur

 Basil

ica

.

 The

 city

 is

 also



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 trends

:

1

. Increased

 focus

 on

 ethical

 AI

:

 With

 increasing

 awareness

 about

 the

 impact

 of

 AI

 on

 society

,

 there

 will

 likely

 be

 a

 greater

 emphasis

 on

 ethical

 AI

.

 This

 includes

 issues

 such

 as

 bias

,

 transparency

,

 accountability

,

 and

 privacy

.



2

.

 Integration

 of

 AI

 into

 various

 sectors

:

 AI

 is

 increasingly

 being

 integrated

 into

 various

 sectors

,

 from

 healthcare

 to

 finance

 to

 transportation

.

 As

 this

 integration

 continues

,

 we

 can

 expect

 to

 see

 more

 data

 being

 used

 to

 train

 AI

 models

 and

 more

 opportunities

 for

 AI

 to

 be

 used

 in

 more

 complex

 problem

-solving

 scenarios

.



3

.

 Rise

 of

 AI

-powered

 autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 already

 starting

 to

 become

 more

 prevalent




In [6]:
llm.shutdown()