# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-29 17:18:21] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.24it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=72.82 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=72.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.15it/s]Capturing batches (bs=120 avail_mem=72.71 GB):   5%|▌         | 1/20 [00:00<00:03,  5.15it/s]

Capturing batches (bs=112 avail_mem=72.71 GB):   5%|▌         | 1/20 [00:00<00:03,  5.15it/s]Capturing batches (bs=112 avail_mem=72.71 GB):  15%|█▌        | 3/20 [00:00<00:01, 11.39it/s]Capturing batches (bs=104 avail_mem=72.70 GB):  15%|█▌        | 3/20 [00:00<00:01, 11.39it/s]Capturing batches (bs=96 avail_mem=72.70 GB):  15%|█▌        | 3/20 [00:00<00:01, 11.39it/s] Capturing batches (bs=96 avail_mem=72.70 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.88it/s]Capturing batches (bs=88 avail_mem=72.69 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.88it/s]

Capturing batches (bs=80 avail_mem=72.68 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.88it/s]Capturing batches (bs=80 avail_mem=72.68 GB):  35%|███▌      | 7/20 [00:00<00:01,  8.69it/s]Capturing batches (bs=72 avail_mem=72.45 GB):  35%|███▌      | 7/20 [00:00<00:01,  8.69it/s]Capturing batches (bs=64 avail_mem=72.13 GB):  35%|███▌      | 7/20 [00:00<00:01,  8.69it/s]

Capturing batches (bs=64 avail_mem=72.13 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.91it/s]Capturing batches (bs=56 avail_mem=71.92 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.91it/s]Capturing batches (bs=48 avail_mem=71.88 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.91it/s]Capturing batches (bs=48 avail_mem=71.88 GB):  55%|█████▌    | 11/20 [00:01<00:01,  8.73it/s]Capturing batches (bs=40 avail_mem=71.78 GB):  55%|█████▌    | 11/20 [00:01<00:01,  8.73it/s]Capturing batches (bs=32 avail_mem=71.77 GB):  55%|█████▌    | 11/20 [00:01<00:01,  8.73it/s]

Capturing batches (bs=32 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.97it/s]Capturing batches (bs=24 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.97it/s]Capturing batches (bs=16 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.97it/s]Capturing batches (bs=16 avail_mem=71.76 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.17it/s]Capturing batches (bs=12 avail_mem=71.76 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.17it/s]Capturing batches (bs=8 avail_mem=71.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.17it/s] 

Capturing batches (bs=4 avail_mem=71.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.17it/s]Capturing batches (bs=4 avail_mem=71.75 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.72it/s]Capturing batches (bs=2 avail_mem=71.74 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.72it/s]Capturing batches (bs=1 avail_mem=71.74 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.72it/s]Capturing batches (bs=1 avail_mem=71.74 GB): 100%|██████████| 20/20 [00:01<00:00, 11.65it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ted. I am a young man. I have a great imagination and I love to read books. My favorite book is "The Giver" by Lois Lowry. It's a book about a society where everyone lives in an environment that is completely controlled by government and a few people who are chosen to lead the society. 

The book is a very peaceful book, and it gives people a clear view of what life is like in the government-controlled society. The government is in charge of all the actions that people take, such as buying, selling, and spending money. The people are told what to do, and if they don't
Prompt: The president of the United States is
Generated text:  a high-ranking leader of the country. Most people think that the president's job is to ensure that the country runs smoothly and to make major decisions. However, the role of the president is not easy. It is important to understand that the president has to make a lot of difficult decisions. Because the president has 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm passionate about [job title] and I'm always looking for ways to [job title] at [company name]. What excites you about your job? I'm always looking for ways to [job title] at [company name]. What do you enjoy doing in your free time? I enjoy [job title] and I love [job title]. What do you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling city with a rich cultural heritage and is a major tourist destination. It is also known for its cuisine, including its famous croissants and its famous French fries. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also home to a diverse population, with French, English, and other languages spoken. Paris is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the most likely trends that could shape the future of AI:

1. Increased automation: As AI continues to advance, we can expect to see more and more automation in our daily lives. This could include the automation of tasks such as data entry, customer service, and administrative work. As AI becomes more advanced, we may see even more automation in areas such as manufacturing and transportation.

2. Enhanced privacy: As AI becomes more advanced, we can expect to see more and more privacy concerns.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm a [X] with over [Y] years of experience in [Z] industry, and I enjoy [M] activities. I'm always looking for new opportunities to [N]. How can I be a valuable asset to you? Feel free to add any personal anecdotes or examples to make your self-introduction more engaging and relatable. I look forward to meeting you! [Your Name] With [X] years of experience in [Z] industry, and a passion for [M], I have a solid background in [Z] and have worked in [Y] roles. I enjoy [M]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the center of the country and is the seat of government and the largest city. Paris is known for its art, cuisine, and architecture, as well as its rich history and culture. Paris is also the birthplace of many famous figures, including the writer Ernest Hemin

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 Name

]

 and

 I

'm

 [

Last

 Name

],

 a

 [

type

 of

 occupation

]

 with [

number

 of years

 of experience

]

 years

 of

 experience

 in

 the

 [

industry

 or

 field

].

 I

'm

 currently

 working

 as

 a

 [

title

 or

 position

]

 at

 [

company

 name

].

 I

'm

 passionate

 about

 [

something

 I

 love

 to

 do

 or

 have

 accomplished

].

 I

 am

 a

 [

character

 trait

 or

 hobby

]

 and

 I

'm

 always

 up

 for

 learning

 new

 things

.

 I

'm

 always

 looking

 for

 opportunities

 to

 grow

 and

 improve

 myself

.

 My

 main

 goal

 is

 to

 become

 [

career

 goal

 or

 accomplishment

].

 I

'm

 [

character

 trait

 or

 hobby

]

 and

 always

 strive

 to

 do

 my

 best

.

 I

'm

 a

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 on

 the

 Se

ine

 River

 and

 known

 for

 its

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 



In

 addition

 to

 being

 the

 nation

's

 cultural

 and

 political

 center

,

 Paris also

 has a

 rich history

, featuring

 museums

 like the

 Louvre

 and

 the Mus

ée

 d

'

Or

say

,

 as

 well

 as

 the

 Arc

 de Tri

omphe

.

 The

 city

 also

 has

 a lively

 nightlife and

 a

 wide

 variety

 of

 restaurants

, as

 well as

 a large

 population

 of

 tourists

 who

 flock

 to

 the

 area

 for

 cultural

 experiences

 and

 attractions

.

 



Paris

 is

 also

 known

 for

 its

 gastr

onomy

,

 with

 its

 famous

 bou

quets

 of

 sa

us



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 highly

 unpredictable

,

 but

 there

 are

 a

 few

 potential

 areas

 of

 development

 that

 are

 likely

 to

 be

 significant

:



1

.

 Autonomous

 vehicles

:

 With

 the

 increasing

 number

 of

 driver

less

 vehicles

 on

 the

 road

,

 autonomous

 driving

 technologies

 are

 likely

 to

 become

 more

 common

.

 This

 could

 result

 in

 a

 revolution

 in

 transportation

,

 as

 self

-driving

 cars

 could

 reduce

 traffic

 congestion

,

 improve

 safety

,

 and

 decrease

 the

 need

 for

 human

 drivers

.



2

.

 Personal

ized

 healthcare

:

 AI

 is

 already

 being

 used

 to

 analyze

 patient

 data

 and

 identify

 patterns

 and

 trends

,

 and

 there

's

 potential

 for

 even

 more

 personalized

 healthcare

 through

 AI

 technologies

.



3

.

 Enhanced

 education

:

 AI

 can

 be

 used

 to

 analyze

 student

 performance

 data

 and

 identify




In [6]:
llm.shutdown()