# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-09 16:20:26] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.38it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=25.38 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=25.38 GB):   5%|▌         | 1/20 [00:00<00:08,  2.21it/s]Capturing batches (bs=120 avail_mem=25.12 GB):   5%|▌         | 1/20 [00:00<00:08,  2.21it/s]Capturing batches (bs=112 avail_mem=24.89 GB):   5%|▌         | 1/20 [00:00<00:08,  2.21it/s]

Capturing batches (bs=112 avail_mem=24.89 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.55it/s]Capturing batches (bs=104 avail_mem=24.72 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.55it/s]Capturing batches (bs=96 avail_mem=24.39 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.55it/s] Capturing batches (bs=96 avail_mem=24.39 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.94it/s]Capturing batches (bs=88 avail_mem=24.15 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.94it/s]

Capturing batches (bs=80 avail_mem=23.84 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.94it/s]

Capturing batches (bs=80 avail_mem=23.84 GB):  35%|███▌      | 7/20 [00:01<00:01,  6.55it/s]Capturing batches (bs=72 avail_mem=23.64 GB):  35%|███▌      | 7/20 [00:01<00:01,  6.55it/s]

Capturing batches (bs=72 avail_mem=23.64 GB):  40%|████      | 8/20 [00:01<00:02,  5.44it/s]Capturing batches (bs=64 avail_mem=23.05 GB):  40%|████      | 8/20 [00:01<00:02,  5.44it/s]Capturing batches (bs=56 avail_mem=22.79 GB):  40%|████      | 8/20 [00:01<00:02,  5.44it/s]Capturing batches (bs=56 avail_mem=22.79 GB):  50%|█████     | 10/20 [00:01<00:01,  7.24it/s]Capturing batches (bs=48 avail_mem=22.47 GB):  50%|█████     | 10/20 [00:01<00:01,  7.24it/s]Capturing batches (bs=40 avail_mem=20.74 GB):  50%|█████     | 10/20 [00:01<00:01,  7.24it/s]

Capturing batches (bs=32 avail_mem=20.47 GB):  50%|█████     | 10/20 [00:01<00:01,  7.24it/s]Capturing batches (bs=32 avail_mem=20.47 GB):  65%|██████▌   | 13/20 [00:01<00:00, 10.51it/s]Capturing batches (bs=24 avail_mem=20.47 GB):  65%|██████▌   | 13/20 [00:01<00:00, 10.51it/s]Capturing batches (bs=16 avail_mem=20.46 GB):  65%|██████▌   | 13/20 [00:01<00:00, 10.51it/s]Capturing batches (bs=16 avail_mem=20.46 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.44it/s]Capturing batches (bs=12 avail_mem=20.46 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.44it/s]

Capturing batches (bs=8 avail_mem=20.45 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.44it/s] Capturing batches (bs=4 avail_mem=20.45 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.44it/s]Capturing batches (bs=4 avail_mem=20.45 GB):  90%|█████████ | 18/20 [00:02<00:00, 14.55it/s]Capturing batches (bs=2 avail_mem=20.44 GB):  90%|█████████ | 18/20 [00:02<00:00, 14.55it/s]Capturing batches (bs=1 avail_mem=20.44 GB):  90%|█████████ | 18/20 [00:02<00:00, 14.55it/s]Capturing batches (bs=1 avail_mem=20.44 GB): 100%|██████████| 20/20 [00:02<00:00,  9.56it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ana. I like to play tennis and I love to cook. I also enjoy reading and listening to music. What are your hobbies? Enjoy your day! Sure, I'm happy to share my hobbies and interests with you! My favorite hobbies include:

1. **Playing Tennis**: Tennis is a great sport that I enjoy both to entertain and to exercise. It's not just about winning but about the satisfaction of getting close to my opponent and being satisfied with the outcome.

2. **Cooking**: I love trying new recipes and creating my own dishes. Cooking is not just about food preparation but also about experimentation, trying different flavors, and
Prompt: The president of the United States is
Generated text:  a political office. What is the person in charge of a country? I'm sorry, but I can't answer this question. This might be a political question, but I don't have enough context or information to answer it. If you could provide more context or clarify your question, I may be abl

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and is known for its rich history, art, and cuisine. The city is also home to many international organizations and institutions, including the French Academy of Sciences and the European Parliament. Paris is a vibrant and dynamic city with a rich cultural and historical heritage. The city is also known for its diverse population and its role as a major economic and political center in Europe. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, with more sophisticated algorithms and machine learning techniques being developed to improve diagnosis, treatment, and patient care.

2. AI in finance: AI is already being used in finance to improve risk management, fraud detection, and trading algorithms



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [occupation/role]! I bring a unique blend of creativity, intelligence, and a passion for [topic or subject] to my work. My expertise lies in [specific skills or areas], and I believe I can contribute valuable insights and solutions to your challenges. Let's collaborate to create something truly remarkable! [Optional: Mention any particular skills or experiences you bring to the table, or any specific projects you have recently completed.] [Your Name] [Your Profession] [Your Specialization] [Project Title] [Your Achievements] [Your Motivation or Why You're Interested in Working with

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the largest city in the country and the seat of the French government and administration. The city is known for its rich history, architecture, cuisine,

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Character

's

 Name

].

 I

 have

 always

 been

 passionate

 about

 [

what

 you

 like

 to

 do

 or

 read

].

 I

 love

 spending

 time

 outdoors

,

 hiking

,

 camping

,

 and

 exploring

 new

 places

.

 I

 also

 enjoy

 [

writing

,

 painting

,

 writing

 about

 my

 experiences

,

 etc

.

].

 I

 recently

 started

 reading

 [

book

 or

 series

].

 I

 enjoy

 my

 life

 and

 am

 always

 looking

 for

 new

 experiences

 and

 opportunities

 to

 grow

.

 How

 can

 I

 get

 to

 know

 you

 better

?

 Please

 provide

 some

 context

 about

 yourself

 and

 your

 interests

.

 As

 a

 new

 friend

,

 I

 am

 excited

 to

 get

 to

 know

 you

 better

.

 I

 am

 a

 [

occupation

 or

 hobby

]

 who

 enjoys

 [

activities

,

 hobbies

,

 etc

.

].



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 an

 iconic

 city

 known

 for

 its

 rich

 history

,

 diverse

 culture

,

 and

 striking

 architecture

.

 It

 is

 located

 on

 the

 Se

ine

 River

 in

 the

 Lo

ire

 Valley

,

 and

 is

 home

 to

 numerous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 has

 a

 thriving

 arts

 and

 culture

 scene

,

 and

 is

 home

 to

 numerous

 museums

,

 theaters

,

 and

 restaurants

 that

 draw

 visitors

 from

 around

 the

 world

.

 The

 city

 also

 plays

 a

 prominent

 role

 in

 France

’s

 political

,

 economic

,

 and

 cultural

 life

,

 and

 is

 often

 referred

 to

 as

 the

 “

city

 of

 a

 thousand

 islands

”

 due

 to

 its

 many

 islands

,

 such



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 significant

 advancements

 in

 several

 areas

,

 including

:



1

.

 Increased

 automation

 and

 robotics

:

 AI

 technology

 is

 becoming

 more

 advanced

,

 and

 it

 is

 expected

 to

 automate

 many

 tasks

 that

 are

 currently

 performed

 by

 humans

,

 including

 manufacturing

,

 healthcare

,

 and

 retail

.

 This

 will

 lead

 to

 increased

 automation

 in

 industries

 and

 create

 new

 jobs

 in

 AI

-related

 fields

.



2

.

 Improved

 decision

-making

 and

 prediction

:

 AI

 will

 continue

 to

 become

 more

 sophisticated

,

 leading

 to

 more

 accurate

 predictions

 and

 decisions

 based

 on

 data

.

 This

 will

 enable

 organizations

 to

 make

 better

-in

formed

 decisions

 and

 improve

 their

 outcomes

.



3

.

 Enhanced

 privacy

 and

 security

:

 AI

 systems

 will

 need

 to

 be

 designed

 and

 implemented

 with

 greater

 consideration

 of

 privacy




In [6]:
llm.shutdown()