# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-18 21:17:21] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.25it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:06,  3.13it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:06,  3.13it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:06,  3.13it/s]Capturing batches (bs=104 avail_mem=76.80 GB):   5%|▌         | 1/20 [00:00<00:06,  3.13it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 10.28it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 10.28it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 10.28it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 10.28it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 14.46it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 14.46it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 14.46it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 14.46it/s]

Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 16.00it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 16.00it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 16.00it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 16.00it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 18.23it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 18.23it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 18.23it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:01<00:00, 18.23it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 18.42it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 18.42it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 18.42it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 18.42it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 18.42it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 22.05it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 17.13it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kip. I am a young man from California who has been learning to read since I was 7 years old. I have a passion for learning and studying. I love to travel and take trips that challenge my mind and open my eyes to different cultures. I find that reading can help me to better understand and learn about the world around me. How can I improve my reading skills and my understanding of reading? Can you suggest some resources or activities I can do to enhance my reading and learning abilities? Additionally, do you have any tips or advice for someone who wants to improve their reading comprehension and writing skills? Let me know! K
Prompt: The president of the United States is
Generated text:  trying to decide whether to lead a war on Russia or not. The president says, "If we are leading a war on Russia, we will not lose any military personnel," and "If we are not leading a war on Russia, we will have more military personnel." How many military person

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [insert your profession or role here]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert your profession or role here], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert your profession or role here], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert your profession or role here], and I'm excited to meet

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. 

(Note: The statement provided is a factual statement about Paris, not a fictional one.) 

Facts about Paris:

1. The capital of France, Paris is the largest city in Europe by population.
2. It is home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral.
3. The city is known for its rich history, art, and cuisine.
4. Paris is also a major transportation hub, with many famous landmarks and transportation options. 

Note: The statement provided is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI will continue to automate tasks that are currently done by humans, such as data analysis, decision-making, and routine maintenance. This will lead to increased efficiency and productivity, but it will also create new jobs that are not yet created.

2. Enhanced human interaction: AI will continue to improve the way we interact with machines, making it easier to communicate with them and to understand their behavior. This will lead to more natural and intuitive interactions between humans and machines.

3. AI will become more integrated with the physical world: AI will become more integrated with the physical world,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm an [age] year old [occupation], and I'm a friendly and helpful person. I love spending my time with friends and family, and I enjoy reading books and traveling. I'm always looking for new experiences and am always open to learning new things. I have a passion for helping people, and I'm always looking for ways to make a difference. Thank you for asking! I'd love to hear about some of your hobbies or interests. [Name]! Let me know if you'd like me to share any of my hobbies or interests. I'm happy to share! I'm an [age]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

That's a true fact. Paris is the capital city of France. It is located in the northwestern part of the country and is the largest city by population in the European Union. It is also one of the most famous cities in the world a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

occupation

]

 who

 has

 been

 living

 in

 [

city

 or

 country

]

 for

 [

number

 of

 years

].

 I

'm

 currently

 in

 my

 [

age

]

th

 season

 of

 life

.

 I

've

 always

 been

 passionate

 about

 [

thing

],

 and

 I

'm

 dedicated

 to

 always

 [

do

 something

].

 I

'm

 always

 looking

 for

 new

 experiences

 and

 challenges

,

 and

 I

'm

 always

 eager

 to

 learn

 and

 grow

.

 I

 enjoy

 [

thing

]

 and

 I

'm

 always

 seeking

 new

 ways

 to

 improve

 myself

 and

 expand

 my

 hor

izons

.

 I

'm

 a

 [

character

 type

]

 with

 a

 sense

 of

 [

positive

 attribute

]

 and

 I

'm

 always

 looking

 to

 make

 a

 difference

 in

 the

 world

.

 Thanks



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 and

 most

 populous

 city

 in

 the

 country

,

 with

 an

 estimated

 population

 of

 

2

.

1

 million

 inhabitants

 as

 of

 

2

0

2

1

.

 It

 is

 located

 in

 the

 Î

le

-de

-F

rance

 region

 of

 France

 and

 is

 known

 for

 its

 rich

 history

,

 art

,

 and

 culture

.

 Paris

 is

 home

 to

 many

 iconic

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 as

 well

 as

 its

 vibrant

 nightlife

 and

 fashion

 scene

.

 The

 city

 is

 also

 home

 to

 many

 cultural

 institutions

,

 including

 the

 Metropolitan

 Museum

 of

 Art

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Pom

pid

ou

 Center

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 extremely

 promising

 and

 will

 continue

 to

 evolve

 rapidly

.

 Here

 are

 some

 potential

 trends

 in

 AI

 that

 are

 currently

 in

 the

 early

 stages

 of

 development

,

 but

 are

 likely

 to

 become

 more

 common

 over

 time

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 more

 and

 more

 people

 become

 aware

 of

 the

 potential

 risks

 and

 ethical

 concerns

 surrounding

 AI

,

 there

 will

 be

 an

 increased

 focus

 on

 developing

 AI

 that

 is

 more

 ethical

 and

 responsible

.

 This

 will

 involve

 developing

 AI

 that

 is

 designed

 to

 minimize

 harm

 and

 maximize

 benefits

 for

 individuals

 and

 society

 as

 a

 whole

.



2

.

 Improved

 privacy

 and

 data

 protection

:

 As

 more

 data

 becomes

 available

,

 there

 will

 be

 a

 need

 to

 develop

 technologies

 that

 can

 protect

 individual

 privacy

 and

 data




In [6]:
llm.shutdown()