# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0908 01:39:56.115000 2020756 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 01:39:56.115000 2020756 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0908 01:40:06.063000 2021573 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 01:40:06.063000 2021573 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0908 01:40:06.288000 2021572 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 01:40:06.288000 2021572 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-08 01:40:06] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.47it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.89 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.89 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=2 avail_mem=74.83 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=1 avail_mem=74.82 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=1 avail_mem=74.82 GB): 100%|██████████| 3/3 [00:00<00:00, 11.01it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex, and I'm a self-taught programmer. I'm learning Python for fun and to make money. I'm serious about my code, but sometimes it's difficult to type out. So, I'm asking for help with formatting.

Is it better to write code in single or double quotes? Here's an example:

```python
print('Hello, world!')
```

Is it better to write the same code like this:

```python
print("Hello, world!")
```

Is one format better than the other?

(I'm using my computer's terminal to run this code on the terminal. This is for educational purposes only
Prompt: The president of the United States is
Generated text:  traveling to Paris, France, to visit a museum. It takes 3 hours to get there by car, and 5 hours by train. What is the total amount of time that the president will spend traveling to the museum, taking the train, and visiting the museum? The president will spend 3 hours traveling to the museum.
The president will spend 5 hours on the train.
Therefore,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill or Hobby] enthusiast. I'm always looking for new and exciting things to do, and I'm always eager to learn new things. I'm always looking for ways to improve my skills and knowledge, and I'm always willing to share my knowledge with others. I'm a [Favorite Subject] lover, and I'm always eager to learn more about it. I'm a [Favorite Book] lover, and I'm always eager to read more books. I'm a [Favorite Music] lover, and I'm always eager to listen

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a major transportation hub, with the Eiffel Tower serving as a symbol of the city's importance in international trade. Paris is also known for its cuisine, with its famous dishes such as croissants, boudin, and escargot. The city is also home to many museums

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that could be expected in the future:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This could include issues such as bias, transparency, and accountability. AI developers will need to be more mindful of the potential impact of their work on society and the environment.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [Age] year-old computer science student at [University Name]. I have a keen interest in [My favorite field of interest]. I enjoy [reason why I like [My favorite field of interest]]. I am always excited to learn new things and explore new technologies. I am always seeking to make the world a better place. I am [if applicable, a [country/ethnic group] or [gender]]. What is your favorite [activity] or hobby? What do you like to do in your free time? I am always eager to learn more about this character from a neutral perspective and provide a short

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is located in the center of the country and is known for its rich history, diverse cultural heritage, and iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre Dame Cathedral. Par

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 a

 [

job

 title

]

 with

 [

number

]

 years

 of

 experience

.

 I

 started

 my

 career

 at

 [

company

 name

]

 in

 [

year

]

 and

 have

 always

 loved

 [

career

 goal

 or

 interest

].

 I

 am

 a

 very

 hard

 worker

,

 always

 looking

 to

 learn

 new

 things

,

 and

 love

 to

 stay

 up

-to

-date

 with

 the

 latest

 technologies

 and

 trends

.

 I

 am

 a

 great

 communicator

,

 always

 willing

 to

 share

 my

 knowledge

 and

 insights

 with

 others

.

 I

 am

 a

 dedicated

 and

 organized

 person

,

 and

 I

 pride

 myself

 on

 being

 able

 to

 manage

 multiple

 projects

 and

 tasks

 with

 ease

.

 I

 am

 passionate

 about

 [

career

 interest

]

 and

 I

 am

 looking

 forward

 to

 exploring

 more

 opportunities



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 beautiful

 can

als

,

 and

 vibrant

 arts

 scene

.


France

's

 capital

 city

,

 Paris

,

 is

 renowned

 for

 its

 iconic

 E

iff

el

 Tower

,

 picturesque

 can

als

,

 and

 thriving

 arts

 scene

.

 The

 city

 is

 a

 bustling

 hub

 of

 culture

 and

 innovation

,

 with

 numerous

 museums

,

 theaters

,

 and

 museums

 showcasing

 French

 art

 and

 design

.

 Its

 historical

 significance

 as

 a

 major

 city

 in

 France

 and

 a

 global

 center

 of

 power

,

 including

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 and

 the

 Ch

amps

-

É

lys

ées

,

 further

 solid

ify

 its

 status

 as

 the

 capital

.

 It

's

 a

 city

 that

 combines

 history

,

 modern



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 a

 range

 of

 trends

,

 including

:



1

.

 Increased

 automation

 and

 artificial

 intelligence

 in

 various

 industries

,

 such

 as

 healthcare

,

 manufacturing

,

 and

 transportation

.



2

.

 Enhanced

 personal

ization

 and

 customization

 of

 AI

 services

,

 allowing

 users

 to

 interact

 with

 them

 in

 a

 more

 intuitive

 and

 personalized

 way

.



3

.

 Improved

 transparency

 and

 accountability

 in

 AI

 systems

,

 as

 developers

 are

 encouraged

 to

 disclose

 their

 code

 and

 ensure

 that

 AI

 is

 developed

 and

 used

 eth

ically

.



4

.

 Expansion

 of

 AI

 to

 tackle

 complex

 problems

 beyond

 the

 domain

 of

 science

 and

 technology

,

 such

 as

 climate

 change

,

 poverty

,

 and

 geopolitical

 instability

.



5

.

 Increasing

 focus

 on

 AI

 ethics

 and

 transparency

 in

 the

 development

 and

 deployment




In [6]:
llm.shutdown()