# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-19 14:07:18] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.93 GB):   5%|▌         | 1/20 [00:00<00:03,  5.84it/s]Capturing batches (bs=120 avail_mem=71.83 GB):   5%|▌         | 1/20 [00:00<00:03,  5.84it/s]

Capturing batches (bs=112 avail_mem=71.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.84it/s]Capturing batches (bs=104 avail_mem=71.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.84it/s]Capturing batches (bs=104 avail_mem=71.82 GB):  20%|██        | 4/20 [00:00<00:01, 14.40it/s]Capturing batches (bs=96 avail_mem=71.81 GB):  20%|██        | 4/20 [00:00<00:01, 14.40it/s] Capturing batches (bs=88 avail_mem=71.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.40it/s]Capturing batches (bs=80 avail_mem=71.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.40it/s]

Capturing batches (bs=80 avail_mem=71.80 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.79it/s]Capturing batches (bs=72 avail_mem=71.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.79it/s]Capturing batches (bs=64 avail_mem=71.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.79it/s]Capturing batches (bs=56 avail_mem=71.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.79it/s]Capturing batches (bs=56 avail_mem=71.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.78it/s]Capturing batches (bs=48 avail_mem=71.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.78it/s]Capturing batches (bs=40 avail_mem=71.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.78it/s]

Capturing batches (bs=32 avail_mem=71.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.78it/s]Capturing batches (bs=32 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.18it/s]Capturing batches (bs=24 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.18it/s]Capturing batches (bs=16 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.18it/s]

Capturing batches (bs=12 avail_mem=71.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.18it/s]Capturing batches (bs=12 avail_mem=71.75 GB):  80%|████████  | 16/20 [00:01<00:00, 13.48it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 13.48it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 13.48it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 13.48it/s]

Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 13.48it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 17.40it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 16.68it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ehab. I'm a computer scientist working on algorithms for high-dimensional data. I study computer science and mathematics at Bar Ilan University, and I've been working on deep learning and computer graphics for the last 3 years.

My research interests include:

  * Combinatorial optimization problems
  * Computational geometry
  * Computational geometric algorithms
  * Geometric and geometric optimization
  * Data mining and information retrieval

My thesis project at Bar Ilan University is "A constrained geometric optimization problem". The problem is a variation of the combinatorial optimization problem, which aims to find the minimum cost assignment problem for the constraint set
Prompt: The president of the United States is
Generated text:  trying to decide whether to attend a press conference that is scheduled to start at 9:00 AM local time. He knows that the local time zone is UTC-5. On the same day, the president watches a news report th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French National Radio and Television Broadcasting Company. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is known for its fashion, art, and cuisine, and is a major economic center in Europe. It is also home to many important historical sites and landmarks, including the Palace of Versailles and the Champs-Élysées. Paris is a city of contrasts, with its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the most likely trends that could be expected in the future:

1. Increased automation: As AI becomes more advanced, it is likely to become more capable of performing tasks that were previously done by humans. This could lead to a significant increase in automation in industries such as manufacturing, transportation, and healthcare.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, it is likely to become even more sophisticated and capable



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert fictional character's name]. I am [insert fictional character's age and nationality]. I grew up in [insert fictional character's hometown] and I attended [insert fictional character's school's name]. I am a [insert fictional character's profession] with a passion for [insert fictional character's hobbies or interests]. I love to [insert fictional character's personal interests or hobbies]. I am known for [insert fictional character's personality traits or qualities]. I am always [insert fictional character's personality traits or qualities]. I am [insert fictional character's personality traits or qualities]. I am [insert fictional character's personality traits or qualities].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. It is a historical city with many icon

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

 here

].

 I

 am

 [

insert

 role

 here

],

 an

 [

insert

 profession

 or

 position

 here

].

 I

 have

 been

 working

 at

 this

 company

 [

insert

 company

 name

]

 for

 [

insert

 number

 of

 years

]

 years

.

 I

 work

 in

 the

 [

insert

 department

 or

 area

 of

 the

 company

 here

].

 My

 name

 is

 [

insert

 name

 here

].

 I

 am

 a

 [

insert

 profession

 or

 role

 here

].

 And

 I

 would

 like

 to

 give

 you

 a

 little

 introduction

,

 but

 please

 do

 not

 see

 it

 as

 a

 compliment

.

 I

 really

 like

 your

 company

 and

 the

 people

 here

.

 I

 am

 very

 impressed

 by

 your

 company

 culture

,

 the

 variety

 of

 projects

 you

 do

,

 the

 great

 jobs

 you

 do

,

 the

 friendly



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 third

-largest

 in

 the

 European

 Union

.

 It

 is

 home

 to

 many

 of

 the

 country

’s

 cultural

 landmarks

 and

 world

-ren

owned

 attractions

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Se

ine

 River

.

 The

 city

 is

 also

 known

 for its

 fashion

 industry

 and

 is

 home

 to

 the

 annual

 E

iff

el

 Tower

 Fashion

 Week

.

 Paris

 is

 a

 major

 tourist

 destination

 and

 a

 popular

 European

 capital

 for

 business

,

 politics

,

 and

 leisure

.

 The

 city

 was

 founded

 in

 the

 

1

2

th

 century

 and

 has

 since

 become

 a

 center

 of

 culture

,

 business

,

 and

 politics

 in

 the

 world

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 exciting

 possibilities

 and

 possibilities

 for

 growth

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 AI

 will

 become

 more

 accessible

 to

 the

 general

 public

.

 AI

-powered

 tools

 like

 virtual

 assistants

,

 chat

bots

,

 and

 voice

 assistants

 are

 becoming

 increasingly

 popular

,

 and

 there

 is

 a

 growing

 demand

 for

 them

.

 This

 will

 mean

 that

 AI

 will

 become

 more

 accessible

 to

 people

 of

 all

 ages

 and

 backgrounds

,

 making

 it

 easier

 for

 people

 to

 integrate

 AI

 into

 their

 daily

 lives

.



2

.

 AI

 will

 become

 more

 integrated

 into

 various

 industries

.

 AI

 is

 already

 being

 integrated

 into

 many

 industries

,

 such

 as

 healthcare

,

 transportation

,

 finance

,

 and

 manufacturing

,

 but

 there

 is

 still

 a

 lot

 of

 potential

 for




In [6]:
llm.shutdown()