# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-26 20:06:52] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.90it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.89it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.92 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.92 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=2 avail_mem=72.86 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=1 avail_mem=72.85 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=1 avail_mem=72.85 GB): 100%|██████████| 3/3 [00:00<00:00, 10.30it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lisa. I am a teacher. I have a new class. It's the first day. I'm telling the students what to do. They are in fourth grade. I'm very happy to see them. I'm getting excited about teaching this class. The first day is important. It is called the first day of school. It's called the first day of school because it always begins the day. I know some of you have been here before. There will be some of you who did not have a first day of school. But that doesn't bother me. I will continue to do this every year. The first day is always
Prompt: The president of the United States is
Generated text:  a highly publicized office, with many people trying to influence the decisions made by the office. To make the job easier, the president can adopt a number of different strategies. One common strategy that the president can use is a so-called "billionaire plan." The concept behind this strategy is that the president can hire people to buy and sell bonds and

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short description of your personality or skills]. What do you like to do in your free time? I enjoy [insert a hobby or activity you enjoy]. What's your favorite book or movie? I love [insert a favorite book or movie]. What's your favorite hobby? I love [insert a hobby you enjoy]. What's your favorite place to go? I love [insert a favorite place you've been to]. What's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in the country and is known for its rich history, beautiful architecture, and vibrant culture. Paris is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also known for its annual festivals and events, including the Eiffel Tower Festival and the Parisian Carnival. Paris is a popular tourist destination and is a major economic and cultural center in France. It is also home to many international organizations and institutions, including the French Academy of Sciences and the French National Library. The city is known for its cuisine, including French cuisine, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and improve the quality of care. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection, risk management, and portfolio optimization. As AI technology continues to improve, we can expect to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: ... um, maybe not me. I'm just a character in a story, you know. Do you have any questions? I'm... well, I'm a character in a story. No, really, that's not quite right either. How about you? What's your name? That's great, and that's a very concise way to describe you. Let's call me Alex, then. Nice to meet you. What brings you to this place?

Hi, I'm Alex, and I'm just a character in this story. I wanted to join you and be an actor or director. What do you think, and how

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Défense," and is the cultural, political, and economic center of France. It is located on the Mediterranean coast, in the Upper Rhône Valley, and is the largest city in Europe by population. Paris has a rich history, including ancient ruins and medieval castles, and is k

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 fictional

 character

's

 name

 here

].

 I

 am

 a

 [

insert

 a

 characteristic

 of

 the

 character

 here

,

 for

 example

,

 "

out

going

,"

 "

w

itty

,"

 "

amb

itious

,"

 "

lo

ving

,"

 etc

.

].

 I

 am

 a

 [

insert

 one

 of

 the

 following

:

 [

insert

 a

 profession

],

 [

insert

 a

 hobby

],

 [

insert

 a

 public

 image

],

 or

 [

insert

 a

 personal

 interest

 or

 hobby

].

 I

 am

 a

 [

insert

 one

 of

 the

 following

:

 [

insert

 an

 age

 range

],

 [

insert

 a

 social

 class

],

 or

 [

insert

 a

 gender

].

 I

 am

 a

 [

insert

 a

 genre

 or

 type

 of

 writing

].

 I

 am

 [

insert

 one

 of

 the

 following

:

 [

insert



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 the

 largest

 and

 most

 populous

 city

 in

 the

 country

.

 It

 is

 also

 known

 as

 "

La

 Ville

 Bl

anche

"

 because

 of

 its

 snow

-c

apped

 Mont

mart

re

 hill

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 famous

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 E

iff

el

 Tower

.

 The

 city

 is

 also

 home

 to

 many

 world

-ren

owned

 museums

,

 theaters

,

 and

 cultural

 institutions

.

 Paris

 is

 a

 popular

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.

 The

 French

 language

,

 French

 cuisine

,

 and

 French

 culture

 are

 also

 a

 major

 part

 of

 Paris

.

 



France

's

 capital

 city

 is

 Paris

.

 It

 is

 home

 to

 many

 world

-ren



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 several

 trends

 are

 likely

 to

 shape

 its

 development

 and

 evolution

.

 Here

 are

 some

 of

 the

 most

 likely

 future

 trends

 in

 AI

:



1

.

 Increased

 AI

 integration

 with

 other

 technologies

:

 AI

 will

 become

 more

 integrated

 with

 other

 technologies

,

 such

 as

 sensors

,

 drones

,

 and

 robotics

.

 This

 integration

 will

 allow

 AI

 systems

 to

 better

 understand

 and

 interact

 with

 the

 physical

 world

 around

 us

.



2

.

 AI

 will

 become

 more

 autonomous

:

 Autonomous

 AI

 systems

 will

 become

 more

 advanced

 and

 capable

,

 with

 the

 ability

 to

 make

 decisions

 and

 take

 actions

 without

 human

 intervention

.

 This

 could

 lead

 to

 a

 greater

 reliance

 on

 AI

 in

 decision

-making

 and

 robotics

.



3

.

 AI

 will

 become

 more

 ethical

 and

 transparent

:




In [6]:
llm.shutdown()