# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0903 01:31:47.755000 3222820 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 01:31:47.755000 3222820 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0903 01:31:56.660000 3223261 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 01:31:56.660000 3223261 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0903 01:31:56.672000 3223262 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 01:31:56.672000 3223262 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-03 01:31:57] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.44it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=62.39 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=62.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.36it/s]Capturing batches (bs=2 avail_mem=62.33 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.36it/s]Capturing batches (bs=1 avail_mem=62.32 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.36it/s]Capturing batches (bs=1 avail_mem=62.32 GB): 100%|██████████| 3/3 [00:00<00:00,  8.37it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Penny and I'm a professional, freelance writer. I produce and edit content for websites and other media. My specialties are at home, at work, and on the go. I get a lot of queries about my services and I'm excited to help make your writing success. How can I get started with my freelance writing services? Can you recommend any specific resources or tools that can help me improve my writing skills and provide better content for clients? I want to make sure my clients are able to trust me with their work, so I want to ensure I'm providing quality content. To get started, I recommend that you start by researching your
Prompt: The president of the United States is
Generated text:  proposing to fund a new federal agency to help reduce the cost of producing vaccines. If each vaccine costs $500 and the agency aims to reduce the cost by 25%, how much would it cost to produce 500,000 vaccines under this new plan?
To determine the cost of producing 500,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, art, and cuisine, and is a major center of politics, science, and culture in France. The city is also home to the French Parliament and the French government. Paris is a vibrant and dynamic city that continues to grow and evolve as a major global city. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends in AI include:

1. Increased use of AI in healthcare: AI is already being used in healthcare to diagnose and treat diseases, and it has the potential to revolutionize the field by improving diagnostic accuracy and personalized treatment plans.

2. Increased use of AI in transportation: AI is already being used in transportation to improve traffic flow, reduce congestion, and increase safety. As AI technology continues to advance, we can expect to see even more widespread adoption of this technology in the transportation industry.

3. Increased use of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm a [Type of Work] who specializes in [Your Area of Expertise]. I've been with [Your Company/Agency] for [X years] and have worked on [X projects]. I bring a strong background in [specific skills or expertise], and am eager to learn and grow in my field. I enjoy [Your interests or hobbies]. I'm excited to add my skills to your team and contribute to your success. 
The use of the word "Hello" and "my name" is neutral, while the use of "I'm" is neutral. The use of "neutral self-introduction

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital city of France and is known for its iconic landmarks, vibrant culture, and beautiful architecture. It is also home to many famous museums, such as the Louvre and the Musée d'Orsay. In addition to being a major city, Paris is also an i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

].

 I

'm

 a

 freelance

 writer

 with

 a

 love

 for

 writing

 descriptive

 prose

 that

 capt

iv

ates

 the

 reader

.

 I

 use

 various

 techniques

 to

 craft

 narratives

 and

 stories

 that

 engage

 and

 transport

 the

 reader

 to

 new

 worlds

.

 My

 writing

 style

 has

 been

 praised

 for

 its

 ability

 to

 evoke

 emotions

 and

 evoke

 a

 sense

 of

 place

.

 I

 am

 passionate

 about

 sharing

 my

 creative

 thoughts

 and

 ideas

 with

 the

 world

.

 I

'm

 looking

 forward

 to

 meeting

 you

.

 [

name

]

 ===

>



Please

 modify

 the

 given

 self

-int

roduction

 by

 incorporating

 the

 following

 additional

 details

:



1

.

 Personal

 background

 and

 experiences

 that

 set

 me

 apart

 from

 other

 writers

.


2

.

 Unique

 skills

 and

 techniques

 I

 possess

 that

 have

 helped

 me

 become



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 the

 country

 and

 has

 a

 rich

 history

 dating

 back

 to

 the

 medieval

 times

.

 The

 city

 is

 home

 to

 some

 of

 the

 world

's

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 known

 for

 its

 vibrant

 culture

,

 sophisticated

 society

,

 and

 urban

 living

.

 Paris

 is

 also

 home

 to

 many

 world

-ren

owned

 institutions

 and

 museums

,

 such

 as

 the Lou

vre

 and

 the

 Centre

 Pom

pid

ou

.

 With

 its

 diverse

 neighborhoods

,

 modern

 architecture

,

 and

 close

 proximity

 to

 neighboring

 cities

,

 Paris

 is

 one

 of

 the

 most

 interesting

 and

 dynamic

 cities

 in

 the

 world

.

 



The

 location

 of

 Paris

 in



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 shaped

 by

 a

 number

 of

 emerging

 trends

,

 including

:



1

.

 Increased

 integration

 of

 AI

 into

 all

 aspects

 of

 society

:

 This

 could

 include

 everything

 from

 personal

 assistants

 like

 Siri

 and

 Alexa

,

 to

 autonomous

 vehicles

,

 to

 smart

 homes

 and

 buildings

.

 AI

 could

 also

 be

 used

 to

 create

 more

 efficient

 and

 personalized

 healthcare

 systems

,

 and

 to

 improve

 the

 quality

 of

 education

 and

 job

 creation

.



2

.

 Artificial

 general

 intelligence

:

 This

 would

 be

 a

 more

 powerful

 form

 of

 AI

 that

 could

 perform

 tasks

 that

 would

 require

 more

 intelligence

 than

 currently

 known

,

 such

 as

 reasoning

,

 creativity

,

 and

 decision

-making

.



3

.

 Eth

ical

 AI

:

 As

 AI

 systems

 become

 more

 advanced

,

 there

 will

 likely

 be

 increased




In [6]:
llm.shutdown()