# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-12 23:09:48] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.65it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=120 avail_mem=74.65 GB):   5%|▌         | 1/20 [00:00<00:03,  5.18it/s]

Capturing batches (bs=112 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=104 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=104 avail_mem=74.64 GB):  20%|██        | 4/20 [00:00<00:01, 13.29it/s]Capturing batches (bs=96 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:01, 13.29it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:01, 13.29it/s]Capturing batches (bs=80 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:01, 13.29it/s]

Capturing batches (bs=80 avail_mem=74.62 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.48it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.48it/s]Capturing batches (bs=64 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.48it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.48it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 18.66it/s]Capturing batches (bs=48 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 18.66it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 18.66it/s]

Capturing batches (bs=32 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 18.66it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.22it/s]Capturing batches (bs=24 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.22it/s]Capturing batches (bs=16 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.22it/s]Capturing batches (bs=12 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.22it/s]

Capturing batches (bs=12 avail_mem=74.58 GB):  80%|████████  | 16/20 [00:00<00:00, 19.71it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 19.71it/s] Capturing batches (bs=4 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 19.71it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:00<00:00, 19.71it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  95%|█████████▌| 19/20 [00:01<00:00, 21.63it/s]Capturing batches (bs=1 avail_mem=74.56 GB):  95%|█████████▌| 19/20 [00:01<00:00, 21.63it/s]Capturing batches (bs=1 avail_mem=74.56 GB): 100%|██████████| 20/20 [00:01<00:00, 19.17it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  J. I'm from the Philippines and I'm a marketing professional. I have been working with social media and marketing for two years now. I have been working for digital marketing agency in the Philippines for the last 1 year and I have been successfully working with brands such as Mobiles, KFC, and Google. 

I want to apply for a position as a marketer at a major advertising agency, but I'm not sure how to start looking for that job. I have done research, but I'm still not sure about the process. Can you provide me with some tips on how to approach job hunting and ensure that I stand out
Prompt: The president of the United States is
Generated text:  54 years old this year. He was 48 years old when he became president in 1960. How old will he be in 2040?

To determine the president's age in 2040, we need to calculate the number of years between 1960 and 2040 and then find his age at that time.

First, we calculate the number of years from 1960 to 2

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] who has been [number of years] in the industry. I am passionate about [reason for passion], and I am always looking for ways to [action or goal]. I am a [character trait or quality] who is always [description of a trait or quality]. I am [character description], and I am [character trait or quality]. I am [character description], and I am [character trait or quality]. I am [character description], and I am [character trait or quality]. I am [character description], and I am [character trait or quality]. I am [character

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

This statement is accurate and brief, capturing the essential information about the capital city's name and its role in French politics and culture. It provides a clear and concise overview of the capital's importance in French society and government. 

To further elaborate on this statement, it could

Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased automation: As AI continues to improve, it is likely to become more efficient and capable of performing tasks that were previously done by humans. This could lead to a significant increase in automation in various industries, including manufacturing, transportation, and healthcare.

2. AI ethics and privacy: As AI becomes more integrated into our daily lives, there will be increasing concerns about its impact on society. This includes issues such as bias, privacy, and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name] and I am [insert age]. I'm an [insert occupation], [insert favorite hobby] and [insert notable achievement].
Tell me a little bit about yourself, like what you're passionate about, what kind of person you are, and what motivates you. How do you handle difficult situations? I'm always looking for new things to do and feel like learning new things every day. I'm also a bit of a perfectionist and try to make sure I'm always up to date with the latest trends and technologies in the field of [insert relevant field]. Lastly, I'm an [insert profession] and have

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the Seine River in the center of the country.

What is the capital of France?

Paris, the seat of government and the cultural and commercial center of France, is located on the Se

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 your

 full

 name

]

 and

 I

 am

 a

 [

insert

 your

 age

]

 year

 old

,

 [

insert

 your

 occupation

]

 and

 [

insert

 your

 nationality

].

 I

'm

 an

 [

insert

 your

 profession

]

 and

 [

insert

 your

 position

]

 and

 I

'm

 [

insert

 your

 profession

]

 and

 [

insert

 your

 profession

].

 I

'm

 [

insert

 your

 profession

]

 and

 [

insert

 your

 profession

].

 I

'm

 [

insert

 your

 profession

]

 and

 [

insert

 your

 profession

].

 I

'm

 [

insert

 your

 profession

]

 and

 [

insert

 your

 profession

].

 I

'm

 [

insert

 your

 profession

]

 and

 [

insert

 your

 profession

].

 I

'm

 [

insert

 your

 profession

]

 and

 [

insert

 your

 profession

].

 I

'm

 [

insert

 your

 profession



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

 and

 the

 Eternal

 City

.

 It

 is

 the

 largest

 city

 in

 France

,

 with

 an

 estimated

 population

 of

 over

 

2

 million

 people

 in

 

2

0

2

1

.

 The

 city

 is

 famous

 for

 its

 stunning

 architecture

,

 particularly

 the

 E

iff

el

 Tower

,

 and

 its

 historical

 heritage

.

 Paris

 is

 home

 to

 numerous

 museums

,

 art

 galleries

,

 and

 cultural

 institutions

,

 and

 is

 also

 a

 major

 financial

 center

,

 with

 the

 E

iff

el

 Tower

 being

 the

 world

's

 tallest

 structure

.

 Its

 rich

 history

,

 arts

 scene

,

 and

 cosm

opolitan

 culture

 have

 made

 Paris

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

,

 attracting

 visitors

 from

 around

 the

 world

.

 It

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 diverse

,

 and

 it

's

 hard

 to

 predict

 exactly

 where

 the

 technology

 will

 lead

.

 However

,

 there

 are

 several

 possible

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 focus

 on

 ethical

 considerations

:

 As

 AI becomes

 more integrated

 into

 our

 daily

 lives

,

 there

 will

 be

 increased

 pressure

 to

 ensure

 that

 AI

 is

 developed

 and

 used

 in

 a

 way

 that

 is

 fair

 and

 respectful

 of

 human

 rights

 and

 values

.

 This

 will

 likely

 lead

 to

 a

 greater

 emphasis

 on

 ethical

 AI

 design

 and

 development

,

 as

 well

 as

 greater

 regulation

 of

 AI

 systems

.



2

.

 Integration

 with

 other

 technologies

:

 AI

 is

 becoming

 increasingly

 integrated

 with

 other

 technologies

,

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

)

 and




In [6]:
llm.shutdown()