# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-11 03:12:16] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.70it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:07<02:13,  7.03s/it]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:07<02:13,  7.03s/it]

Capturing batches (bs=120 avail_mem=76.81 GB):  10%|█         | 2/20 [00:07<00:54,  3.01s/it]Capturing batches (bs=112 avail_mem=76.81 GB):  10%|█         | 2/20 [00:07<00:54,  3.01s/it]

Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:07<00:29,  1.75s/it]Capturing batches (bs=104 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:07<00:29,  1.75s/it]Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:07<00:29,  1.75s/it] Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:07<00:12,  1.25it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:07<00:12,  1.25it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:07<00:12,  1.25it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:07<00:06,  2.10it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:07<00:06,  2.10it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:07<00:06,  2.10it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:08<00:04,  2.61it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:08<00:04,  2.61it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:08<00:03,  3.07it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:08<00:03,  3.07it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:08<00:03,  3.07it/s]

Capturing batches (bs=40 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:08<00:01,  4.50it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:08<00:01,  4.50it/s]

Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:08<00:01,  4.50it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  70%|███████   | 14/20 [00:08<00:01,  4.63it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:08<00:01,  4.63it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:09<00:01,  3.20it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:09<00:01,  3.20it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:09<00:01,  3.12it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:09<00:01,  3.12it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:09<00:01,  3.12it/s]Capturing batches (bs=4 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:10<00:00,  4.57it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:10<00:00,  4.57it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:10<00:00,  4.57it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:10<00:00,  1.98it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Nima. I’m a software engineer, designer, and a visual artist. I’m a computer programmer with a love of the digital and visual arts. I’m also an avid gamer, and enjoy balancing my programming work with my gaming, spending as much time as possible on the computer as I can.
I am not a prolific writer, and have not published any works of art, but I have a collection of works that I’m passionate about. I’m not a member of any organization, and I have not been asked to contribute to any academic or other professional publications.
I believe that creativity and making art is a way to express emotions
Prompt: The president of the United States is
Generated text:  a very important person. He is like the boss of the country. But he is also very important to other people. He is the leader of the country and the leader of the people. Other people's lives are very important to him. The president is the person who has the most power in the United States. He

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short description of your personality or skills]. What do you like to do in your free time? I enjoy [insert a short description of your hobbies or interests]. What's your favorite book or movie? I love [insert a short description of your favorite book or movie]. What's your favorite hobby? I love [insert a short description of your favorite hobby]. What's your favorite place to go? I love [insert a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and music, and is a major tourist destination. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly together. It is a city that has been a hub of culture and commerce for centuries, and continues to be a major cultural and economic center of France. The city is also

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to a more human-like experience for users.

2. Greater reliance on data: AI will become more data-driven, with more data being collected and analyzed to improve its performance. This could lead to more accurate and reliable predictions and recommendations.

3. Increased ethical considerations: As AI becomes more integrated into our lives, there will be increased ethical considerations around its use. This could lead to more stringent regulations and guidelines to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [Occupation] with a passion for [Objective]. I am known for [Summary of Character's Characteristics, Examples, Achievements, etc.]. I love [Favorite Activity/Interest/Religion], and I strive to [Motivational Goal or Personal Journey]. I am always looking for [What I seek in a Career or Experience]. I'm here to provide [Your Purpose, Benefits, or Additional Information]. I'm excited to meet you. What's your name, and what do you do for a living?
Hello! My name is [Name] and I am a [Occupation].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic Eiffel Tower and vibrant fashion scene. 

(Note: As of 2023, Paris has a population of approximately 2.1 million people and is the largest city in the world by population. ) 

Does the statement accurately reflect the fac

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

__

 and

 I

 am

 a

/an

 __

________

__

_.

 I

 am

 a

/an

 __

________

_

 and

 I

 love

 to

 __

________

_.

 If

 you

 could

 share

 a

 goal

 with

 me

,

 what

 would

 it

 be

?

 Please

 provide

 the

 character

's

 name

,

 profession

,

 and

 their

 love

 for

 a

 specific

 goal

 in

 the

 following

 sentence

:

 "

Hello

,

 my

 name

 is

 [

name

]

 and

 I

 am

 a

/an

 [

profession

].

 I

 am

 a

/an

 [

profession

]

 and

 I

 love

 to

 [

love

 goal

].

 If

 you

 could

 share

 a

 goal

 with

 me

,

 what

 would

 it

 be

?"

 I

 don

't

 want

 to

 hear

 any

 personal

 information

 about

 the

 character

.

 I

 only

 want

 a

 neutral

 self

-int

roduction



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 known

 for

 its

 rich

 history

 and

 unique

 culture

,

 and

 is

 a

 major

 economic

 and

 political

 center

 in

 Europe

.

 It

 has

 a

 diverse

 population

 and

 is

 home

 to

 various

 cultural

 institutions

,

 including

 the

 M

airie

 of

 Saint

-L

ouis

 (

Summer

 Palace

).

 Additionally

,

 Paris

 is

 known

 for

 its

 fashion

 industry

,

 with

 iconic

 fashion

 bout

iques

 and

 designers

 such

 as

 Pierre

 Card

in

 and

 Paul

 Po

ire

t

.

 Overall

,

 Paris

 is

 a

 vibrant

 and

 dynamic

 city

 with

 a

 rich

 cultural

 heritage

.

 Paris

 has

 been

 described

 as

 "

the

 most

 beautiful

 place



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 dynamic

,

 and

 it

 is

 difficult

 to

 predict

 with

 certainty

.

 However

,

 some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 patient

 outcomes

 and

 reduce

 costs

.

 As

 the

 technology

 becomes

 more

 advanced

,

 we

 can

 expect

 to

 see

 even

 more

 widespread

 adoption

 of

 AI

 in

 healthcare

 in

 the

 coming

 years

.



2

.

 Emer

gence

 of

 self

-driving

 cars

:

 As

 AI

 technology

 improves

,

 we

 are

 likely

 to

 see

 the

 development

 of

 fully

 autonomous

 vehicles

.

 This

 could

 lead

 to

 significant

 changes

 in

 traffic

 patterns

 and

 public

 transportation

 systems

,

 as

 well

 as

 a

 reduction

 in

 accidents

 and

 fatalities

.



3

.

 Integration

 of

 AI

 into

 various




In [6]:
llm.shutdown()