# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-30 15:48:29] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.60it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.60it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:07,  2.59it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:07,  2.59it/s]

Capturing batches (bs=120 avail_mem=76.81 GB):  10%|█         | 2/20 [00:00<00:05,  3.29it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  10%|█         | 2/20 [00:00<00:05,  3.29it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.70it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.70it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.70it/s] 

Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.80it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.80it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.80it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:01<00:02,  6.80it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  40%|████      | 8/20 [00:01<00:01, 10.98it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:01, 10.98it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:01, 10.98it/s]

Capturing batches (bs=48 avail_mem=76.77 GB):  40%|████      | 8/20 [00:01<00:01, 10.98it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:01<00:00, 14.30it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:01<00:00, 14.30it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  55%|█████▌    | 11/20 [00:01<00:00, 14.30it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  55%|█████▌    | 11/20 [00:01<00:00, 14.30it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  70%|███████   | 14/20 [00:01<00:00, 16.96it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00, 16.96it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00, 16.96it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 16.88it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 16.88it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 16.88it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 16.88it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 16.88it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 21.07it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 12.77it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sara Johnson. I am 30 years old. I have a beautiful smile and I love to travel. I live in the city centre of the city. I like to go shopping, eat out and have fun. I have a passion for cats and I have a cat named Mandy. Mandy is a big feline and she is my best friend. I have lived with Mandy since the day I was born. I love her and she loves me. How long has Sara Johnson been living with Mandy? Answer this question: How long has Sara Johnson been living with Mandy? To answer this question, I will follow
Prompt: The president of the United States is
Generated text:  seeking to reduce the national debt by $1 trillion over the next 10 years. If he achieves this goal by increasing the debt-to-GDP ratio to 1.35, what would be the new national debt? Assume the current national debt is $14.85 trillion and the debt-to-GDP ratio was 1.25 in the previous year. To determine the new national debt after the president reduces the national debt by $1 trillio

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill or Hobby] enthusiast who enjoys [mention a hobby or interest]. I'm also a [Skill or Hobby] lover who enjoys [mention a hobby or interest]. I'm a [Skill or Hobby] enthusiast who enjoys [mention a hobby or interest]. I'm a [Skill or Hobby] enthusiast who enjoys [mention a hobby or interest]. I'm a [Skill or Hobby] enthusiast who enjoys [mention a hobby or interest]. I'm a [Skill or Hobby] enthusiast who enjoys [mention a hobby or interest]. I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that is known for its iconic Eiffel Tower and its rich history and culture. It is also a major financial and business center, and is home to many of the world's most famous museums and landmarks. Paris is a vibrant and diverse city with a rich cultural heritage that has been shaped by its history and its role as a major European city for centuries. The city is also known for its delicious cuisine, including its famous croissants and its many traditional French dishes. Paris is a city that is constantly evolving and changing, with new developments and attractions being added to the city's impressive list of landmarks and attractions.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased integration of AI into everyday life: AI is already being integrated into our daily lives, from voice assistants like Siri and Alexa to self-driving cars. As AI becomes more integrated into our daily lives, we can expect to see even more widespread adoption.

2. AI becoming more autonomous: As AI becomes more integrated into our daily lives, we can expect to see more autonomous vehicles and robots that can operate without human intervention.

3. AI becoming more ethical and responsible: As AI becomes more integrated



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [job title or interest]. I'm passionate about [mention something specific about yourself or your work]. I enjoy spending time [mention a hobby, such as reading, hiking, or playing music]. I'm always looking for ways to [mention something about my passions or interests that you'd like to hear about]. What's your favorite hobby or pastime and how do you find it so enjoyable? I'm excited to hear your responses! Let's chat! #self-introduction #Characterised #Pleasant #Fictional #Career

Hello, my name is [Name] and I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  full of exciting possibilities and potential challenges. Here are some possible trends and future developments in AI:

1

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

 am

 a

 [

role

]

 who

 has

 been

 [

number

 of

 years

]

 in

 the

 industry

.


I

've

 always

 loved

 [

career

 field

]

 and

 have

 been

 passionate

 about

 it

 since

 I

 was

 a

 child

.

 My

 biggest

 strength

 is

 [

strength

],

 and

 I

 have

 always

 tried

 to

 [

achievement

]

 it

.


I

 have

 a

 lot

 of

 energy

 and

 creativity

,

 and

 am

 always

 trying

 to

 push

 the

 boundaries

 of

 what

's

 possible

 in

 my

 work

.

 I

 am

 always

 looking

 for

 new

 opportunities

 and

 trying

 to

 learn

 new

 skills

 to

 keep

 up

 with

 the

 fast

-paced

 nature

 of

 the

 industry

.


I

'm

 a

 team

 player

,

 and

 I

 thrive

 on

 working

 with

 others

,

 both

 on

 and

 off



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 most

 populous

 city

 in

 the

 European

 Union

.

 It

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

 and

 is

 known

 as

 the

 “

City

 of

 Love

”

 for

 its

 romantic

 and

 artistic

 attractions

.

 The

 city

 is

 home

 to

 many

 important

 historical

 and

 cultural

 landmarks

,

 including

 Notre

-D

ame

 Cathedral

,

 the

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 a

 major

 center

 of

 business

,

 finance

,

 and

 tourism

,

 and

 is

 known

 for

 its

 food

,

 fashion

,

 and

 art

 scenes

.

 The

 city

 is

 a

 UNESCO

 World

 Heritage

 site

 and

 is

 home

 to

 many

 notable

 landmarks

 and

 museums

.

 



France

's

 capital

 city

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

 and

 we

 can

 expect

 a

 wide

 range

 of

 developments

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 trends

 we

 can

 expect

:



1

.

 AI

 will

 continue

 to

 improve

 in

 accuracy

 and

 efficiency

,

 with

 the

 goal

 of

 becoming

 more

 human

-like

 in

 decision

-making and

 problem

-solving

.



2

.

 AI

 will

 become

 more

 prevalent

 in

 areas

 such

 as

 healthcare

,

 transportation

,

 and

 manufacturing

,

 with

 the

 goal

 of

 improving

 quality

 of

 life

 and

 reducing

 costs

.



3

.

 AI

 will

 continue

 to

 be

 used

 for

 educational

 purposes

,

 with

 the

 goal

 of

 providing

 students

 with

 a

 better

 understanding

 of

 complex

 concepts

 and

 allowing

 them

 to

 develop

 critical

 thinking

 skills

.



4

.

 AI

 will

 continue

 to

 be

 used

 for

 commercial

 applications

,

 with




In [6]:
llm.shutdown()