# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0912 06:26:38.557000 1695307 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 06:26:38.557000 1695307 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0912 06:26:47.604000 1696033 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 06:26:47.604000 1696033 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0912 06:26:47.682000 1696034 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 06:26:47.682000 1696034 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-12 06:26:48] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.04 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.04 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.99it/s]Capturing batches (bs=2 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.99it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.99it/s]

Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00,  4.01it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mme Roseline. I am a senior lecturer at the University of Reading, where I teach Mathematics and Statistics. I have been working in the field of medical imaging since 2008, and I hold a PhD in Mathematical Imaging from Imperial College London. I am a member of the British Society of Medical Imaging and a former member of the Statistical Society of Australia. I am interested in nonparametric methods, machine learning and applications in medicine. In my research, I have worked on segmentation methods, image registration, inverse problems and registration of MR images.\nEmail: roseseline@reading.ac.uk\n\nI am a
Prompt: The president of the United States is
Generated text:  seeking a new term of office. How would you explain the term of office in this context? Please write a sentence to support your response. The term of office refers to the duration of the position or role that an individual holds, such as a president. In this case, the president

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and economic center with a rich history dating back to the Roman Empire and the French Revolution. It is a major transportation hub, with the Eiffel Tower serving as a symbol of the city's importance in global affairs. Paris is also known for its fashion industry, with designers such as Coco Chanel and Yves Saint Laurent being famous. The city is home to many museums, including the Musée

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the coming years:

1. Increased integration of AI into everyday life: AI is already being integrated into many aspects of our lives, from self-driving cars to voice assistants like Siri and Alexa. As the technology continues to advance, we can expect to see even more integration of AI into our daily routines, from smart homes to virtual assistants that can assist with tasks like grocery shopping or scheduling appointments.

2. Greater emphasis on ethical and responsible AI: As AI becomes more



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [your profession or profession theme]. I am currently working as a [your role] at [your company or organization]. I am passionate about [your interest or interest theme], and I am always looking to learn and grow. I am committed to [your mission or mission theme], and I am excited to contribute to the success of [your organization or company]. I am always looking for opportunities to grow and develop, and I am eager to be a part of a team that values and respects diversity. I am a [your character trait or character trait theme], and I am always willing to help and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located on the banks of the Seine River and has a rich history dating back to ancient times. Paris is the second-largest city in France and the third-largest in the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

S

alar

ied

 or

 Freel

ancer

]

 who

 loves

 to

 [

mention

 something

 you

 enjoy

 doing].

 I'm

 passionate about

 [mention

 something

 you're

 interested in

 or

 know

 a

 lot

 about

].

 I

'm

 looking

 for

 a

 [

aim

 for

]

 to

 help

 you

 [

explain

 what

 you

'd

 like

 to

 accomplish

 with

 this

 position

].

 I

'm

 always

 looking

 for

 new

 experiences

 and

 challenges to

 meet

 these

 goals.

 My background

 is [

mention

 your relevant

 experience or

 education

].

 Thanks

 for

 taking

 the

 time

 to

 chat

 with

 me

!

 How

 can

 I

 help

 you

?

 [

Your

 Name

]

 (

or

 whatever

 your

 full

 name

 is

)

 Looking

 forward

 to

 hearing

 from

 you

!

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It is

 the largest

 city

 in

 Europe

 by population

, with

 a

 population

 of

 over

 

2

.

5

 million

 people

.

 Paris

 is

 home

 to

 many

 iconic

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Palace

 of

 Vers

ailles

.

 The

 city

 is

 known

 for

 its

 historical

 significance

, romantic

 ambiance,

 and

 culinary excellence

, making

 it a

 popular tourist

 destination.

 Paris

 is also

 the capital

 of France

 and

 has a

 rich

 cultural

 and

 historical

 heritage

.

 The city

 is

 constantly evolving

 and

 becoming

 more

 diverse

 and

 exciting,

 with

 new

 developments

 and

 events

 taking

 place

 regularly

.

 The

 city

 is

 a

 world

-renowned

 hub for

 finance

, business

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and promising

, and

 it is

 set to

 continue to

 evolve

 and expand

 its applications

 in a

 number

 of ways

. Here

 are some

 possible trends

 in AI

 that could

 shape the

 future of

 technology

:



1.

 Increased automation

:

 As

 more and

 more jobs

 become automated

, AI

 will become

 an increasingly

 important tool

 in autom

ating

 routine

 tasks

.

 This

 could

 lead

 to

 a

 more

 efficient

,

 less

 human

,

 and

 more

 productive

 society

.



2

.

 Personal

ized

 AI

:

 AI

 will

 become

 more

 personalized

,

 with

 each

 person

 being

 able

 to

 receive

 a

 tailored

 recommendation

 based

 on

 their

 preferences

 and

 behaviors

.



3

.

 AI

 in

 healthcare

:

 AI

 will

 be

 used

 to

 improve

 the

 accuracy

 of

 medical

 diagnoses

 and

 treatments

,




In [6]:
llm.shutdown()