# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0911 21:48:49.755000 4072972 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 21:48:49.755000 4072972 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0911 21:48:59.923000 4073719 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 21:48:59.923000 4073719 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 21:48:59.992000 4073720 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 21:48:59.992000 4073720 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 21:49:00] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.69it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.89 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.89 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.93it/s]Capturing batches (bs=2 avail_mem=74.82 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.93it/s]

Capturing batches (bs=2 avail_mem=74.82 GB):  67%|██████▋   | 2/3 [00:00<00:00,  4.37it/s]Capturing batches (bs=1 avail_mem=74.82 GB):  67%|██████▋   | 2/3 [00:00<00:00,  4.37it/s]Capturing batches (bs=1 avail_mem=74.82 GB): 100%|██████████| 3/3 [00:00<00:00,  5.89it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  George. I like playing football with my friends. My best friend is Bob. We often play football at the school. One day I had an accident ( . ) It hurt me very much. I couldn't go to play football with my friends anymore. I was very sad. One day I found a new toy. I brought it to my friend Bob. He loved the toy. He told me that it would be great for me. What can we learn from the story? A) It's important for people to have friends. B) People can't play football with friends after accidents. C) It's hard to get back your
Prompt: The president of the United States is
Generated text:  seeking his successor, and a popular candidate is named Mitt Romney. The president wants to convey a message of unity by choosing the person's slogan: "It's not about who you know, but about what you do." Let's represent the president's slogan as a function \(f(x) = 100\cos(\pi x) - 50\), where \(x\) is the rank of the candidate in the order of popularity. The functio

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] my skills and knowledge. I'm excited to learn and grow with you. How can I help you today? [Name] [Company name] [Job title] [Number of years] [Company name] [Job title] [Number of years] [Company name] [Job title] [Number of years] [Company name] [Job title]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting the world's largest museums, theaters, and opera houses. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is home to many famous French artists, writers, and musicians, and is known for its rich history and cultural heritage. Paris is a vibrant and dynamic city with a rich cultural and artistic heritage, and is a major center of European politics and diplomacy. The city is also known for its fashion industry

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends that could be expected in the future:

1. Increased automation and artificial intelligence: As AI becomes more advanced, it is likely to become more prevalent in many industries, including manufacturing, healthcare, transportation, and finance. This could lead to increased automation and the creation of new jobs, but it could also lead to job displacement for some workers.

2. Improved privacy and security: As AI becomes more advanced, there will be an increased need for privacy and security measures to protect



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am an AI. What can I do for you? Let's get started! I'm here to help you find the information you need. Whether you need help with [insert something like writing, researching, or something else], I'll do my best to assist you. Let's chat and explore how we can help you! Let's get started! [Insert any appropriate introductions or background information, such as [insert your profession or experience, etc.]] Let's get to know each other and find out what we can do together. [Insert any additional information, such as [insert any skills, education,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the eastern part of the country and capital of the Languedoc-Roumanie region. It is the largest city in France by population and is known for its romantic architecture, historical sites, an

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 [

Your

 Profession

].

 I

 am

 a

 [

Your

 Age

],

 [

Your

 Gender

]

 who

 live

 in

 [

Your

 Location

].

 I

 am

 passionate

 about

 [

Your

 Passion

].

 I

 believe

 that

 [

Your

 Passion

]

 is

 a

 crucial

 aspect

 of

 our

 world

,

 and

 I

 am

 committed

 to

 [

Your

 Contribution

].

 I

 am

 a

 [

Your

 Character

 Trait

],

 and

 I

 am

 always

 willing

 to

 [

Your

 Goal

]. I

 hope

 to

 be

 a

 [

Your

 Goal

]

 in

 the

 future

.

 I

 am

 proud

 to

 be

 [

Your

 Profession

]

 and

 I

 look

 forward

 to

 [

Your

 Opportunities

]

 with

 open

 arms

.

 [

Your

 Name

]

 has

 always

 been

 passionate

 about

 [

Your

 Passion

],

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 fact

ually

 correct

 and

 provides

 the

 complete

 answer

 to

 the

 question

.

 It

 clearly

 defines

 the

 capital

 city

 of

 France

,

 which

 is

 Paris

.

 If

 you

 need

 any

 additional

 information

 or

 have

 another

 question

,

 feel

 free

 to

 ask

!

 



1

.

 The

 capital

 of

 France

 is

 Paris

.


2

.

 Paris

 is

 the

 largest

 city

 in

 France

.


3

.

 The

 population

 of

 Paris

 is

 over

 

2

.

3

 million

.


4

.

 Paris

 is

 a

 major

 cultural

 and

 economic

 center

 in

 Western

 Europe

.


5

.

 The

 city

's

 architecture

,

 including

 the

 E

iff

el

 Tower

,

 Mont

mart

re

,

 and

 the

 Lou

vre

 Museum

,

 are

 well

-known

 worldwide

.



If

 you

 need

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 rapidly

 evolving

 field

 with

 endless

 possibilities

 for

 improvement

 and

 innovation

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Enhanced

 Inter

pre

table

 AI

:

 One

 of

 the

 biggest

 trends

 in

 AI

 is

 the

 development

 of

 more

 interpre

table

 models

.

 This

 means

 that

 AI

 systems

 will

 be

 able

 to

 explain

 their

 decisions

 and

 reasoning

 processes

.

 This

 will

 make

 it

 easier

 to

 understand

 and

 trust

 AI

 systems

 and

 reduce

 the

 risk

 of

 bias

 and

 transparency

 issues

.



2

.

 Autonomous

 Systems

:

 Autonomous

 systems

 will

 become

 more

 prevalent

 in

 the

 future

,

 with

 the

 ability

 to

 make

 decisions

 and

 actions

 without

 human

 intervention

.

 This

 will

 lead

 to

 significant

 improvements

 in

 safety

,

 efficiency

,

 and

 decision

-making

 processes

.



3

.

 Improved




In [6]:
llm.shutdown()