# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-14 12:33:53] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.21it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:04,  4.55it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:04,  4.55it/s]

Capturing batches (bs=120 avail_mem=76.82 GB):  10%|█         | 2/20 [00:00<00:04,  4.29it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  10%|█         | 2/20 [00:00<00:04,  4.29it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  10%|█         | 2/20 [00:00<00:04,  4.29it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:02,  7.06it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:02,  7.06it/s] 

Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:02,  7.06it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 10.10it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 10.10it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 10.10it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  30%|███       | 6/20 [00:00<00:01, 10.10it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.06it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.06it/s]

Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.06it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.06it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:00<00:00, 16.97it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:00<00:00, 16.97it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00, 16.97it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:01<00:00, 16.97it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.38it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.38it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.38it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.38it/s]Capturing batches (bs=4 avail_mem=76.74 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.13it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.13it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.13it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 15.10it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elena, I'm 22 years old, and I'm from the United States. I work as a teacher and spend my time teaching kids. I have a very sweet personality, and I love to play sports. I'm really creative and enjoy painting, reading, and writing. I also love to cook. What are some of your hobbies and interests? Elena, thanks for taking the time to chat with me. Yes, I'd love to chat with you too! Hobbies and interests can vary greatly depending on your personality and interests. However, I'd be happy to have a conversation about your hobbies and interests in a relaxed setting
Prompt: The president of the United States is
Generated text:  a very important person. He helps to make sure that the country is safe and that the people get their fair share of what they need to eat. He is also the leader of the country and the leader of the people. He is responsible for making sure that the people are able to travel to other countries to visit their friends and famil

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" and the "City of Light". It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, art, and culture, and is a major tourist destination. It is also a major center for business, finance, and government. The city is home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a vibrant and dynamic city with a rich cultural heritage that continues to thrive today. The city is also home to many international organizations and institutions, including the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with other technologies: AI is likely to become more integrated with other technologies, such as sensors, cameras, and machine learning algorithms, to enable more sophisticated and accurate applications.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be a greater need for privacy and security measures to protect user data and prevent misuse of AI systems.

3. Greater focus on ethical considerations: As AI systems become more complex and sophisticated, there will be a greater need for ethical considerations to ensure that AI systems are used in a responsible and ethical manner.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name] and I am a/an [insert profession/education/other identifying characteristic] from [insert location]. I am a/an [insert age] year old, [insert gender] and I am of [insert height] inches. I love [insert hobby or interest] and I am always looking for new adventures and experiences. I am [insert personality trait or characteristic] and I always strive to be the best version of myself. I believe in [insert personal belief or motto] and I am always open to learning and evolving as a person. I am a/an [insert occupation] who is always [insert strength, endurance

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city and a major French-speaking and Francophone city in Europe. It is situated in the north-central region of the country and is the administrative and cultural center

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Name

]

 from

 [

City

].

 I

'm

 a

 [

job

 title

]

 at

 [

Company

].

 I

'm

 excited

 to

 meet

 you

 and

 help

 you

 achieve

 your

 [

goals

/

amb

itions

].

 I

'm

 eager

 to

 learn

 from

 you

 and

 share

 my

 knowledge

 with

 you

.

 And

,

 I

'm

 always

 willing

 to

 help

,

 no

 matter

 what

.

 I

'm

 ready

 to

 get

 started

.

 How

 can

 I

 help

 you

?

 Let

's

 get

 to

 know

 each

 other

 and

 see

 what

 we

 can

 do

 together

!

 [

Name

]

 is

 looking

 forward

 to

 learning

 more

 about

 you

 and

 exploring

 opportunities

 together

.

 I

 look

 forward

 to

 meeting

 you

 and

 getting

 to

 know

 you

 better

.

 How

 can

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 of

 love

 and

 of

 culture

.

 Located

 on

 the

 Se

ine

 River

,

 Paris

 is

 a

 historic

 city

 famous

 for

 its

 museums

,

 monuments

,

 and

 many

 of

 the

 world

's

 most

 famous

 landmarks

,

 including

 the

 Lou

vre

,

 the

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 Known

 for

 its

 beauty

 and

 romance

,

 Paris

 is

 also

 a

 major

 commercial

 and

 financial

 center

,

 hosting

 the

 E

iff

el

 Tower

 and

 the

 Notre

 Dame

 Cathedral

.

 Despite

 its

 fame

,

 Paris

 remains

 a

 relatively

 small

 and

 young

 city

,

 with

 a

 population

 of

 about

 

2

.

2

 million

 people

.

 Its

 cuisine

 is

 influenced

 by

 its

 French

 heritage

 and

 its

 proximity

 to

 the

 ocean

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 significant

 trends

 and

 developments

.

 Here

 are

 some

 potential

 trends

:



1

.

 Increased

 integration

:

 AI

 will

 become

 more

 integrated

 into

 our

 lives

,

 from

 the

 devices

 we

 use

 to

 the

 services

 we

 access

.

 This

 integration

 will

 likely

 involve

 more

 complex

 interactions

 between

 humans

 and

 machines

,

 as

 well

 as

 between

 different

 AI

 systems

.



2

.

 Adv

ancements

 in

 machine

 learning

:

 There

 will

 be

 continued

 progress

 in

 machine

 learning

,

 particularly

 in

 areas

 such

 as

 natural

 language

 processing

,

 computer

 vision

,

 and

 speech

 recognition

.

 This

 will

 allow

 AI

 systems

 to

 become

 more

 sophisticated

 and

 able

 to

 solve

 complex

 problems

 more

 effectively

.



3

.

 Autonomous

 agents

:

 Autonomous

 agents

 will

 become

 more

 common

 in

 our

 daily




In [6]:
llm.shutdown()