# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-02 13:11:18] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=6.20 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=6.20 GB):   5%|▌         | 1/20 [00:00<00:06,  2.98it/s]Capturing batches (bs=120 avail_mem=6.10 GB):   5%|▌         | 1/20 [00:00<00:06,  2.98it/s]Capturing batches (bs=120 avail_mem=6.10 GB):  10%|█         | 2/20 [00:00<00:03,  4.96it/s]Capturing batches (bs=112 avail_mem=6.09 GB):  10%|█         | 2/20 [00:00<00:03,  4.96it/s]Capturing batches (bs=104 avail_mem=6.09 GB):  10%|█         | 2/20 [00:00<00:03,  4.96it/s]

Capturing batches (bs=104 avail_mem=6.09 GB):  20%|██        | 4/20 [00:00<00:01,  8.35it/s]Capturing batches (bs=96 avail_mem=6.08 GB):  20%|██        | 4/20 [00:00<00:01,  8.35it/s] Capturing batches (bs=88 avail_mem=6.02 GB):  20%|██        | 4/20 [00:00<00:01,  8.35it/s]Capturing batches (bs=88 avail_mem=6.02 GB):  30%|███       | 6/20 [00:00<00:01, 11.21it/s]Capturing batches (bs=80 avail_mem=6.02 GB):  30%|███       | 6/20 [00:00<00:01, 11.21it/s]Capturing batches (bs=72 avail_mem=6.02 GB):  30%|███       | 6/20 [00:00<00:01, 11.21it/s]

Capturing batches (bs=64 avail_mem=6.01 GB):  30%|███       | 6/20 [00:00<00:01, 11.21it/s]Capturing batches (bs=64 avail_mem=6.01 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.72it/s]Capturing batches (bs=56 avail_mem=6.01 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.72it/s]Capturing batches (bs=48 avail_mem=6.00 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.72it/s]Capturing batches (bs=48 avail_mem=6.00 GB):  55%|█████▌    | 11/20 [00:00<00:00, 15.67it/s]Capturing batches (bs=40 avail_mem=6.00 GB):  55%|█████▌    | 11/20 [00:00<00:00, 15.67it/s]

Capturing batches (bs=32 avail_mem=5.99 GB):  55%|█████▌    | 11/20 [00:01<00:00, 15.67it/s]Capturing batches (bs=32 avail_mem=5.99 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.50it/s]Capturing batches (bs=24 avail_mem=5.99 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.50it/s]Capturing batches (bs=16 avail_mem=5.98 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.50it/s]

Capturing batches (bs=16 avail_mem=5.98 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.32it/s]Capturing batches (bs=12 avail_mem=5.98 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.32it/s]Capturing batches (bs=8 avail_mem=5.97 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.32it/s] Capturing batches (bs=4 avail_mem=5.97 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.32it/s]Capturing batches (bs=4 avail_mem=5.97 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.57it/s]Capturing batches (bs=2 avail_mem=5.96 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.57it/s]Capturing batches (bs=1 avail_mem=5.95 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.57it/s]Capturing batches (bs=1 avail_mem=5.95 GB): 100%|██████████| 20/20 [00:01<00:00, 14.27it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Katniss. I am 16 years old and I live in the fictitious world of Apes. I am a sweet girl who loves to play with my friends and help out the people who are in need. I also like to read books and watch movies and have been a big fan of Jane Austen's books for a long time. I've also been a fan of the Twilight series for a long time, and I enjoy reading and watching the movies. I love to cook and make yummy food. I'm planning to become a doctor when I grow up. 
Based on the above article, answer a question. Which
Prompt: The president of the United States is
Generated text:  a political office, and the position is typically filled by a person who has been nominated by the party in power at the time. Who was the last person to hold the office of president?
The last person to hold the office of president was Donald J. Trump.
Donald J. Trump was the President of the United States from January 20, 2017 to January 20, 2021.
The other options are not co

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. The city is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a major tourist destination and is known for its fashion industry, with Paris Fashion

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and efficient AI systems.

2. Enhanced machine learning capabilities: AI is likely to become more powerful and capable, with the ability to learn from large amounts of data and make more accurate predictions and decisions.

3. Increased focus on ethical and social implications: As AI becomes more integrated with human society, there will be increased focus on ethical and social implications, including issues such as bias, privacy



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I am a [Insert Occupation] with experience in [Insert field or area of expertise]. I am a [insert age] year old, and I am [insert nationality]. I have been working in [insert field or area of expertise] for [insert number of years] years, and I have been involved in [insert number of projects or achievements]. I am passionate about [insert interests or hobbies]. I love to [insert hobbies or activities]. I am known for [insert achievements or accomplishments], and I am [insert personality traits]. I have a deep respect for [insert profession or area of study], and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a historical city located on the Mediterranean coast, known for its iconic Eiffel Tower, Notre Dame Cathedral, and vibrant French culture. 

Note: This statement is based on the fa

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

],

 and

 I

'm

 a [

insert

 occupation

]

 with

 a

 passion

 for

 [

insert

 hobby

 or

 interest

].

 I

've

 always

 been

 fascinated

 by

 [

insert

 something

 that

 interests

 you

],

 and

 I

'm

 always

 up

 for

 learning

 new

 things

 and

 trying

 new

 things

.

 Whether

 it

's

 [

insert

 a

 specific

 skill

 or

 hobby

],

 or

 [

insert

 another

 specific

 thing

],

 I

'm

 constantly

 looking

 for

 new

 challenges

 and

 opportunities

 to

 expand

 my

 knowledge

 and

 skills

.

 Thank

 you

 for

 considering

 me

 for

 a

 position

.

 



Remember

 to

 keep

 the

 tone

 neutral

 and

 friendly

,

 without

 sounding

 overly

 enthusiastic

 or

 boast

ful

.

 Use

 your

 character's

 name

 in

 your

 introduction

 and

 throughout

 the

 rest

 of

 the

 text

 to

 give

 them



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Al

gebra

 is

 a

 branch

 of

 mathematics

 that

 deals

 with

 the

 study

 of

 equations

 involving

 one

 or

 more

 variables

.

 



Al

gebra

ic

 equations

 are

 mathematical

 expressions

 that

 are

 equivalent

 to

 zero

.

 



Examples

 of

 algebra

ic

 equations

 include

:



1

)

 

2

x

 +

 

3

 =

 

7




2

)

 x

^

2

 -

 

5

x

 +

 

6

 =

 

0




3

)

 

3

y

 +

 

4

z

 =

 

1

2





Al

gebra

 is

 fundamental

 to

 many

 other

 branches

 of

 mathematics

 and

 has

 been

 used

 in

 various

 fields

,

 including

 physics

,

 engineering

,

 and

 economics

.

 



In

 contrast

,

 geometry

 is

 a

 branch

 of

 mathematics

 that

 deals

 with

 the

 study



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

,

 and

 here

 are some

 possible trends

:



1

.

 AI

 will

 continue

 to

 evolve

 and

 improve

,

 with

 more

 sophisticated

 algorithms

 and

 machine

 learning

 models

 being

 developed

.



2

.

 AI

 will

 become

 more

 integrated

 into

 everyday

 life

,

 with

 more

 people

 using

 AI

-powered

 devices

 and

 services

.



3.

 AI

 will

 become

 more

 personalized

,

 with

 more

 advanced

 techniques

 and

 algorithms

 being

 used

 to

 personalize

 the

 user

 experience

.



4

.

 AI

 will

 continue

 to

 integrate

 with

 other

 technologies

,

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

)

 and

 the

 Internet

 of

 Things

 (

Io

T

),

 to

 create

 even

 more

 advanced

 applications

.



5

.

 AI

 will

 become

 more

 ethical

 and

 responsible

,

 with

 more

 people

 using

 AI

 in

 a

 responsible




In [6]:
llm.shutdown()