# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-12 15:30:36] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.62it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:04,  4.00it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:04,  4.00it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:04,  4.00it/s]Capturing batches (bs=104 avail_mem=76.80 GB):   5%|▌         | 1/20 [00:00<00:04,  4.00it/s]

Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01,  9.48it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01,  9.48it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01,  9.48it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 12.12it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 12.12it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 12.12it/s]

Capturing batches (bs=72 avail_mem=76.79 GB):  40%|████      | 8/20 [00:00<00:00, 13.17it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  40%|████      | 8/20 [00:00<00:00, 13.17it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  40%|████      | 8/20 [00:00<00:00, 13.17it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 12.48it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 12.48it/s]

Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 12.48it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:01<00:00, 13.45it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00, 13.45it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00, 13.45it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  70%|███████   | 14/20 [00:01<00:00, 14.11it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00, 14.11it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00, 14.11it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 12.61it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 12.61it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 12.61it/s]Capturing batches (bs=4 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 13.78it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 13.78it/s]

Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 13.78it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 13.14it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Maximilian B. Weis and I am an assistant professor in the Department of Mathematical Sciences at the University of Akron. As a mathematics graduate student, I studied under the direction of Dr. Robert A. Ford at the University of Wisconsin-Madison. I have a PhD in Mathematics and have been teaching at the University of Akron since 2017.\nMy primary research interest is in the field of dynamical systems and its applications to economics and operations research.\nI have taught the following courses at the University of Akron during my time there:\nDynamical Systems - Spring 2017\nGeneral Dynamical
Prompt: The president of the United States is
Generated text:  3 feet 3 inches tall. The vice president of the United States is 2 feet 11 inches tall. If a man can walk 5 feet per minute, how many minutes does it take for him to walk from his home to the vice president's office? To determine how long it takes for the president to walk to the vice presi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Gender] [Occupation]. I'm a [Occupation] who has always been passionate about [What interests you about your occupation]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [What is your favorite hobby or activity]? I'm always up for a good challenge and love to explore new places and experiences. I'm a [What is your greatest strength or weakness?]. I'm a [What is your greatest achievement so far?]. I'm a [What is your dream job?]. I'm a [What is your favorite

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that is both a cultural and political center of France. The city is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, AI will likely continue to be used for tasks such as fraud detection, cybersecurity, and environmental monitoring, as well as for tasks such as language translation and image recognition. As AI becomes more integrated into our daily lives, we can expect to see a greater emphasis on ethical considerations and the development of responsible AI systems. Finally, AI will likely continue to evolve and change as new



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm a [Your Profession] with [Your Education] in [Your Major]. I've always been fascinated by the world of technology, especially in the field of [Your Field of Expertise]. My journey in the industry has been full of challenges, but my passion never wavers. I am always looking for ways to improve my skills and learn from others. My love for innovation and technology has driven me to always stay up-to-date with the latest advancements. I'm a [Your Personality] who is always seeking to learn and grow, and I believe that every day is an opportunity to make a difference. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is located on the Île de France on the River Seine. It is known as the "city of love" and is home to the Eiffel Tower and the Louvre Museum. The city is also famous for i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

 [

character

]

 who

 has

 been

 in

 love

 with

 [

love

 interest

]

 for

 [

time

]

 years

.

 I

 love

 [

love

 interest

]

 and

 I

 want

 to

 take

 the

 next

 step

 in

 our

 relationship

 by

 [

step

]

 in

 the

 future

.

 What

's

 your

 name

 and

 what

's

 your

 relationship

 with

 [

love

 interest

]

?


[

Your

 Name

]

 [

Character

]

 is

 a

 [

character

]

 who

 has

 been

 in

 love

 with

 [

love

 interest

]

 for

 [

time

]

 years

.

 They

 love

 [

love

 interest

]

 and

 want

 to

 take

 the

 next

 step

 in

 their

 relationship

 by

 [

step

]

 in

 the

 future

.

 What

's

 your

 name

 and

 what

's

 your

 relationship



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



What

 is

 the

 capital

 of

 France

?

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 poised

 to

 be

 a

 revolutionary

 leap

 forward

 that

 will

 transform

 many

 aspects

 of

 our

 lives

.

 Some

 of

 the

 potential

 trends

 that

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

 include

:



1

.

 Increased

 integration

 of

 AI

 into

 everyday

 technologies

:

 AI

 will

 continue

 to

 become

 more

 integrated

 into

 the

 fabric

 of

 our

 daily

 lives

,

 from

 smartphones

 and

 smart

 homes

 to

 virtual

 assistants

 and

 other

 specialized

 AI

-powered

 tools

.



2

.

 Em

phasis

 on

 ethical

 and

 responsible

 AI

:

 As

 AI

 becomes

 more

 integrated

 into

 society

,

 there

 will

 be

 a

 growing

 emphasis

 on

 ethical

 and

 responsible

 AI

 development

.

 This

 means

 that

 we

 will

 need

 to

 ensure

 that

 AI

 systems

 are

 developed

 and

 used

 in

 a

 way

 that

 is

 fair

,

 transparent

,




In [6]:
llm.shutdown()