# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-15 11:33:32] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.93it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=21.20 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=21.20 GB):   5%|▌         | 1/20 [00:00<00:03,  4.98it/s]Capturing batches (bs=120 avail_mem=21.09 GB):   5%|▌         | 1/20 [00:00<00:03,  4.98it/s]Capturing batches (bs=112 avail_mem=21.09 GB):   5%|▌         | 1/20 [00:00<00:03,  4.98it/s]Capturing batches (bs=112 avail_mem=21.09 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.84it/s]Capturing batches (bs=104 avail_mem=21.08 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.84it/s]Capturing batches (bs=96 avail_mem=21.08 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.84it/s] 

Capturing batches (bs=96 avail_mem=21.08 GB):  25%|██▌       | 5/20 [00:00<00:01, 13.45it/s]Capturing batches (bs=88 avail_mem=21.07 GB):  25%|██▌       | 5/20 [00:00<00:01, 13.45it/s]Capturing batches (bs=80 avail_mem=21.06 GB):  25%|██▌       | 5/20 [00:00<00:01, 13.45it/s]Capturing batches (bs=80 avail_mem=21.06 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.40it/s]Capturing batches (bs=72 avail_mem=21.06 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.40it/s]Capturing batches (bs=64 avail_mem=21.05 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.40it/s]

Capturing batches (bs=64 avail_mem=21.05 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.48it/s]Capturing batches (bs=56 avail_mem=21.05 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.48it/s]

Capturing batches (bs=48 avail_mem=21.04 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.48it/s]

Capturing batches (bs=48 avail_mem=21.04 GB):  55%|█████▌    | 11/20 [00:01<00:01,  6.08it/s]Capturing batches (bs=40 avail_mem=21.04 GB):  55%|█████▌    | 11/20 [00:01<00:01,  6.08it/s]

Capturing batches (bs=32 avail_mem=21.03 GB):  55%|█████▌    | 11/20 [00:02<00:01,  6.08it/s]

Capturing batches (bs=32 avail_mem=21.03 GB):  65%|██████▌   | 13/20 [00:03<00:02,  2.46it/s]Capturing batches (bs=24 avail_mem=21.01 GB):  65%|██████▌   | 13/20 [00:03<00:02,  2.46it/s]

Capturing batches (bs=24 avail_mem=21.01 GB):  70%|███████   | 14/20 [00:03<00:02,  2.72it/s]Capturing batches (bs=16 avail_mem=68.70 GB):  70%|███████   | 14/20 [00:03<00:02,  2.72it/s]Capturing batches (bs=16 avail_mem=68.70 GB):  75%|███████▌  | 15/20 [00:03<00:01,  3.23it/s]Capturing batches (bs=12 avail_mem=68.69 GB):  75%|███████▌  | 15/20 [00:03<00:01,  3.23it/s]

Capturing batches (bs=8 avail_mem=68.69 GB):  75%|███████▌  | 15/20 [00:03<00:01,  3.23it/s] Capturing batches (bs=8 avail_mem=68.69 GB):  85%|████████▌ | 17/20 [00:03<00:00,  4.41it/s]Capturing batches (bs=4 avail_mem=68.68 GB):  85%|████████▌ | 17/20 [00:03<00:00,  4.41it/s]Capturing batches (bs=2 avail_mem=68.68 GB):  85%|████████▌ | 17/20 [00:03<00:00,  4.41it/s]Capturing batches (bs=1 avail_mem=68.67 GB):  85%|████████▌ | 17/20 [00:03<00:00,  4.41it/s]

Capturing batches (bs=1 avail_mem=68.67 GB): 100%|██████████| 20/20 [00:03<00:00,  6.85it/s]Capturing batches (bs=1 avail_mem=68.67 GB): 100%|██████████| 20/20 [00:03<00:00,  5.30it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ali and I am an IT Consultant. My background is in software development and I am passionate about technology and innovation. I enjoy helping people understand technology and solutions to problems. I have experience in Python, SQL, and Java. I have worked as a developer, system administrator, and consultant for over 10 years. What are some of the most effective strategies or techniques for improving my skills in Python and SQL? Here are some strategies and techniques to improve your skills in Python and SQL:

1. Practice regularly: The more you practice, the better you will become. Try to code on a regular basis, even if it is just a
Prompt: The president of the United States is
Generated text:  a military commander who must report to the Senate for his approval. His approval is required before a new budget is signed. The current budget is $100 billion in total. The president can choose to spend up to $100 billion, but he must report to the Sen

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What brings you to this company? I'm drawn to [insert a short description of the reason why you're interested in this company]. What do you do for a living? I'm a [insert a short description of your job role]. I'm always looking for new challenges and opportunities to grow. What do you do for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. The city is also famous for its fashion industry, art, and cuisine. Paris is a cultural and economic hub of France and a major tourist destination. It is home to many world-renowned museums, theaters, and art galleries. The city is also known for its nightlife, with many bars and clubs offering a wide range of entertainment options. Paris is a city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence. This could lead to more efficient and effective AI systems that can better understand and respond to human emotions and behaviors.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and validation of AI systems, as well as increased regulation and oversight of AI development and deployment.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I am a [character's occupation] who has been [number of years] in the industry. I am [age] years old. I have always been passionate about [topic of interest] and I strive to always learn new things. I enjoy [job-related activities] and have always been a [personality trait] person. If you have any questions or need advice, I'm here to help. [Your Name] is looking forward to chatting with you! Sure, here's a neutral self-introduction for a fictional character based on your description:

---

Hey there! I'm [Your Name], a [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in France and the third-largest in the world by population. Paris is home to the Eiffel Tower, Louvre Museum, Notre Dame Cathedral, the Champs-Elysées, and many other famous landmarks. It is a cultural

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 ____

 and

 I

'm

 a

/an

 ____

.


As

 an

 AI

 language

 model

,

 I

 don

't

 have

 personal

 names

 or

 identities

,

 but

 I

 can

 create

 a

 fictional

 self

-int

roduction

 for

 you

 based

 on

 your

 request

.

 Here

's

 a

 possible

 introduction

:



Hello

,

 my

 name

 is

 [

Your

 Name

]

 and

 I

'm

 a

/an

 [

Your

 Profession

/

Role

].

 I

'm

 always

 looking

 for

 ways

 to

 help

 people

,

 whether

 that

's

 by

 answering

 their

 questions

,

 providing

 information

,

 or

 even

 just

 being

 there

 to

 listen

.

 I

'm

 always

 here

 to

 assist

 with

 any

 questions

 or

 concerns

 you

 might

 have

,

 and

 I

'm

 always

 available

 to

 help

 whenever

 you

 need

 me

.

 If

 you

 have

 any

 questions

 or

 need



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Please

 provide

 the

 answer

 in

 French

.

 La

 capit

ale

 de

 la

 France

 est

 Paris

.

 



Note

:

 I

 did

 not

 produce

 any

 copy

 text

 for

 the

 French

 text

 provided

.

 This

 appears

 to

 be

 an

 English

 text

 about

 Paris

,

 France

.

 The

 sentence

 structure

 and

 wording

 have

 been

 kept

 the

 same

 to

 maintain

 the

 original

 meaning

 and

 purpose

 of

 the

 text

.

 The

 use

 of

 "

I

 did

 not

 produce

 any

 copy

 text

 for

 the

 French

 text

"

 is

 a

 common

 practice

 in

 translation

 tasks

,

 especially

 when

 dealing

 with

 official

 or

 historical

 documents

,

 as

 the

 translation

 is

 not

 intended

 to

 be

 copied

 or

 paraph

r

ased

.

 



I

 have

 also

 noted

 that

 I

 did

 not

 produce

 any

 copy

 text



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 proliferation

 of

 applications

 in

 various

 industries

,

 and

 the

 incorporation

 of

 more

 sophisticated

 algorithms

 and

 machine

 learning

 techniques

 into

 existing

 systems

.

 This

 trend

 is

 expected

 to

 continue

 as

 the

 need

 for

 more

 advanced

 and

 accurate

 AI

 systems

 becomes

 more

 apparent

.

 In

 addition

,

 we

 may

 see

 a

 trend

 towards

 greater

 use

 of

 AI

 in

 areas

 such

 as

 healthcare

,

 finance

,

 and

 transportation

,

 as

 well

 as

 the

 development

 of

 more

 intelligent

 and

 autonomous

 robots

 and

 drones

.

 AI

 will

 likely

 also

 continue

 to

 be

 used

 in

 a

 more

 ethical

 and

 responsible

 way

,

 with

 greater

 focus

 on

 issues

 such

 as

 privacy

,

 bias

,

 and

 accountability

.

 Overall

,

 the

 future

 of

 AI

 is

 likely

 to

 be

 one

 of




In [6]:
llm.shutdown()