# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-25 17:38:24] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=68.48 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=68.48 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.93it/s]Capturing batches (bs=2 avail_mem=68.42 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.93it/s]Capturing batches (bs=1 avail_mem=68.42 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.93it/s]Capturing batches (bs=1 avail_mem=68.42 GB): 100%|██████████| 3/3 [00:00<00:00,  9.54it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Veronica. I am 13 years old and I was born in China. I have a dream. My dream is to travel to space. I want to fly in the space rockets. I want to look at the stars and see many things that I have never seen before. I want to do it because I want to learn more about space. When I grow up, I want to work for the people. I want to be a scientist. I want to help more people around the world. So far, I have learned more about space. I want to fly in the space rockets. I have learnt about the moon. I have
Prompt: The president of the United States is
Generated text:  an elected official.

Does it follow that if "The president of the United States is an elected official?"?
Pick from:
 A). yes
 B). it is not possible to tell
 C). no

A). yes
You are a helpful assistant with no show. What is the question? Unfortunately, the prompt you provided is incomplete and doesn't specify the question. Could you please provide more information or the complete pro

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] who has been [number of years] in the industry. I am passionate about [reason for interest] and I am always looking for ways to [action or goal]. I am a [type of person] who is [positive or negative] about life and I enjoy [reason for enjoyment]. I am a [type of person] who is [positive or negative] about life and I enjoy [reason for enjoyment]. I am a [type of person] who is [positive or negative] about life and I enjoy [reason for enjoyment]. I am a [type of person] who

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

A. True
B. False
A. True

Paris is the capital of France and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also a major cultural and economic center in Europe. The city is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. Paris is also known for its cuisine, fashion, and music scene. The city is a popular tourist destination and is home to many international organizations and institutions. Overall, Paris is a vibrant and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends in AI include:

1. Increased use of AI in healthcare: AI is already being used in healthcare to diagnose and treat diseases, and it has the potential to revolutionize the field. AI-powered diagnostic tools, such as AI-powered X-ray machines, are already being used to improve patient outcomes.

2. AI in manufacturing: AI is already being used in manufacturing to optimize production processes, reduce waste, and improve quality control. AI-powered robots and drones are also being used to automate tasks and increase efficiency.

3



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Profession/Activity]. I'm [Age] years old and [Field of Study]. I have [Background] experience in [Field of Study]. I'm [Favorite Subject/Interest/Job]. I'm [Level of Interest]. I'm [Lifestyle]. I'm [Job Title]. I'm [What you hope to achieve with this career]. I'm [What you hope to achieve with this career]. [If you haven't already, please state the current location, major, and primary job of the character]. [If you haven't already, please state the current location, major,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its vibrant nightlife, beautiful architecture, and rich cultural heritage. It is the largest city in Europe, with a population of over 2. 8 million people and a population density of over 2, 000 people per square kilometre. Paris has been a major center of E

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

 [

career

]

 with

 [

number

]

 years

 of

 experience

 in

 [

industry

].

 I

'm

 a

 reliable

,

 hard

working

,

 and

 highly

 motivated

 individual

 who

 thr

ives

 on

 [

job

 responsibility

],

 and

 I

 enjoy

 [

job

 responsibility

]

 a

 lot

.

 I

'm

 always

 ready

 to

 learn

,

 grow

,

 and

 improve

,

 and

 I

'm

 committed

 to

 making

 a

 positive

 impact

 on

 [

industry

]

 through

 my

 work

.

 Thank

 you

.

 



[

Job

 Title

]

 Introduction

:



Hello

,

 my

 name

 is

 [

name

]

 and

 I

'm

 a

 [

career

]

 with

 [

number

]

 years

 of

 experience

 in

 [

industry

].

 I

'm

 a

 reliable

,

 hard

working

,

 and

 highly

 motivated

 individual



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 city

 with

 a

 rich

 history

 and

 culture

.

 Paris

 is

 located

 in

 the

 center

 of

 France

 and

 is

 home

 to

 many

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 a

 UNESCO

 World

 Heritage

 site

 and

 is

 known

 for

 its

 fashion

,

 art

,

 and

 cuisine

.

 Paris

 has

 a

 vibrant

 street

 culture

 and

 is

 a

 cosm

opolitan

 city

 with

 a

 diverse

 population

.

 It

 is

 a

 popular

 tourist

 destination

 and

 is

 a

 significant

 economic

 player

 in

 Europe

.

 It

 is

 home

 to

 many

 museums

,

 galleries

,

 and

 concert

 halls

,

 making

 it

 a

 popular

 destination

 for

 cultural

 events

 and

 conferences

.

 Additionally

,

 Paris

 has

 a

 strong

 European

 identity



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 rapidly

 evolving

,

 with

 many

 possibilities

 to

 consider

.

 Here

 are

 some

 potential

 trends

 that

 could

 emerge

 in

 the

 coming

 years

:



1

.

 Increased

 focus

 on

 ethics

:

 AI

 is

 becoming

 more

 advanced

 and

 pervasive

,

 but

 it

 also

 raises

 ethical

 concerns

.

 Future

 AI

 systems

 will

 need

 to

 consider

 the

 implications

 of

 their

 actions

 on

 society

,

 including

 issues

 such

 as

 privacy

,

 bias

,

 and

 fairness

.

 As

 a

 result

,

 there

 will

 likely

 be

 a

 greater

 focus

 on

 ethical

 guidelines

 and

 best

 practices

 for

 AI

 development

 and

 deployment

.



2

.

 More

 autonomous

 and

 smart

 robots

:

 In

 the

 coming

 years

,

 we

 may

 see

 more

 advanced

 robots

 that

 are

 capable

 of

 making

 decisions

,

 learning

,

 and

 adapting

 to

 new

 situations




In [6]:
llm.shutdown()