# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-03 06:00:14] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.34it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=27.90 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=27.90 GB):   5%|▌         | 1/20 [00:00<00:06,  2.79it/s]Capturing batches (bs=120 avail_mem=27.26 GB):   5%|▌         | 1/20 [00:00<00:06,  2.79it/s]Capturing batches (bs=112 avail_mem=27.10 GB):   5%|▌         | 1/20 [00:00<00:06,  2.79it/s]Capturing batches (bs=112 avail_mem=27.10 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.43it/s]Capturing batches (bs=104 avail_mem=26.31 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.43it/s]Capturing batches (bs=96 avail_mem=25.97 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.43it/s] 

Capturing batches (bs=96 avail_mem=25.97 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.52it/s]Capturing batches (bs=88 avail_mem=25.58 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.52it/s]Capturing batches (bs=80 avail_mem=25.27 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.52it/s]Capturing batches (bs=72 avail_mem=25.00 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.52it/s]Capturing batches (bs=72 avail_mem=25.00 GB):  40%|████      | 8/20 [00:00<00:00, 13.58it/s]Capturing batches (bs=64 avail_mem=24.84 GB):  40%|████      | 8/20 [00:00<00:00, 13.58it/s]

Capturing batches (bs=56 avail_mem=24.83 GB):  40%|████      | 8/20 [00:00<00:00, 13.58it/s]Capturing batches (bs=56 avail_mem=24.83 GB):  50%|█████     | 10/20 [00:00<00:00, 14.90it/s]Capturing batches (bs=48 avail_mem=24.80 GB):  50%|█████     | 10/20 [00:00<00:00, 14.90it/s]Capturing batches (bs=40 avail_mem=24.78 GB):  50%|█████     | 10/20 [00:00<00:00, 14.90it/s]Capturing batches (bs=32 avail_mem=24.78 GB):  50%|█████     | 10/20 [00:00<00:00, 14.90it/s]

Capturing batches (bs=32 avail_mem=24.78 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.52it/s]Capturing batches (bs=24 avail_mem=24.65 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.52it/s]Capturing batches (bs=16 avail_mem=24.63 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.52it/s]Capturing batches (bs=16 avail_mem=24.63 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.78it/s]Capturing batches (bs=12 avail_mem=22.63 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.78it/s]Capturing batches (bs=8 avail_mem=21.15 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.78it/s] 

Capturing batches (bs=8 avail_mem=21.15 GB):  85%|████████▌ | 17/20 [00:01<00:00, 16.68it/s]Capturing batches (bs=4 avail_mem=21.15 GB):  85%|████████▌ | 17/20 [00:01<00:00, 16.68it/s]Capturing batches (bs=2 avail_mem=21.14 GB):  85%|████████▌ | 17/20 [00:01<00:00, 16.68it/s]Capturing batches (bs=1 avail_mem=21.14 GB):  85%|████████▌ | 17/20 [00:01<00:00, 16.68it/s]Capturing batches (bs=1 avail_mem=21.14 GB): 100%|██████████| 20/20 [00:01<00:00, 19.80it/s]Capturing batches (bs=1 avail_mem=21.14 GB): 100%|██████████| 20/20 [00:01<00:00, 14.70it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dan and I am currently a Senior at the University of the West of England. I am a mathematics student studying statistics and physics. I will be attending university in the fall of 2023 in the United Kingdom. I have a keen interest in programming and learning new things. I also enjoy spending time with my family and friends and trying new foods. I am a true self-made man who has been working hard to achieve my goals. I believe in the power of knowledge and I am eager to continue learning and grow as an individual. I am excited to have the opportunity to share my knowledge and passion with others! What is your
Prompt: The president of the United States is
Generated text:  a title of honor that is not commonly given. President Bush, the president of the United States from 1989 to 2009, was elected president in a special election where two-thirds of the nation's voters cast their vote. The actual number of votes cast in this election was 517,020,4

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and economic center that plays a significant role in French politics and society. It is also a popular tourist destination, known for its rich history, art, and cuisine. The city is home to many famous French artists, writers, and musicians, and is considered one of the most beautiful cities in the world. Paris is a city of contrasts, with its modern architecture and historical landmarks

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that are likely to shape the future of AI:

1. Increased focus on ethical AI: As more people become aware of the potential risks and ethical concerns associated with AI, there is likely to be an increased focus on ethical AI. This could include things like ensuring that AI systems are transparent, accountable, and fair, and that they are used in a way that is consistent with human values and principles.

2. Greater use of AI in healthcare: AI is already being used in healthcare to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am [Age] years old, [Gender] and [Occupation/Profession]. I come from [Place]. I enjoy [Favorite Activity/Interest]. And, of course, I have a [Hobby/Professional Skill] that I enjoy playing with [Person]. What brings you to this world?

Hello, my name is [Name]. I am [Age] years old, [Gender] and [Occupation/Profession]. I come from [Place]. I enjoy [Favorite Activity/Interest]. And, of course, I have a [Hobby/Professional Skill] that I enjoy playing with [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in France and the third-largest city in the European Union. It is known for its rich history, art, architecture, and fashion, and is home to the Eiffel Tower, Louvre Museum, and the Notre-Dame Cathedral. The city is also known for its cuisine, with dishes like beignets 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 Name

]

 and

 I

 am

 [

Last

 Name

].

 I

'm

 here

 to

 help

 with

 [

specific

 project

 or

 service

].

 Thank

 you

 for

 considering

 me

 for

 this

 opportunity

.

 Let

's

 get

 started

!

 Let

 me

 know

 if

 there

's

 anything

 I

 can

 help

 you

 with

.


Your

 intro

 should

 be

 brief

,

 friendly

,

 and

 informative

,

 capturing

 the

 essence

 of

 your

 character

's

 role

 and

 expertise

.

 Use

 a

 convers

ational

 tone

 and

 make

 sure

 to

 convey

 your

 enthusiasm

 and

 passion

 for

 your

 work

.

 Consider

 the

 audience

 for

 your

 intro

 and

 tailor

 it

 accordingly

.

 Good

 luck

 with

 your

 self

-int

roduction

!

 #

Self

Int

roduction

 #

Char

ter

 #

Project

Start

 #

Att

ent

ive

 #

Friendly

 #

Professional

 #



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

  



The

 answer

 is

 Paris

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 different

 trends

 and

 factors

,

 but

 some

 of

 the

 most

 likely

 ones

 include

:



1

.

 Increased

 reliance

 on

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 to

 help

 diagnose

 and

 treat

 diseases

,

 and

 this

 trend

 is

 likely

 to

 continue

.

 AI

-powered

 systems

 may

 become

 more

 accurate

 and

 efficient

 at

 predicting

 the

 course

 of

 diseases

,

 as

 well

 as

 helping

 to

 identify

 early

 signs

 of

 illness

.



2

.

 AI

 in

 manufacturing

:

 AI

 is

 already

 being

 used

 to

 automate

 manufacturing

 processes

 and

 improve

 efficiency

,

 and

 this

 trend

 is

 likely

 to

 continue

.

 AI

-powered

 systems

 may

 be

 able

 to

 predict

 manufacturing

 needs

,

 optimize

 supply

 chains

,

 and

 improve

 product

 quality

.



3

.

 AI




In [6]:
llm.shutdown()