# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-19 03:01:52] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.90it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.93 GB):   5%|▌         | 1/20 [00:00<00:03,  5.65it/s]Capturing batches (bs=120 avail_mem=71.83 GB):   5%|▌         | 1/20 [00:00<00:03,  5.65it/s]

Capturing batches (bs=112 avail_mem=71.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.65it/s]Capturing batches (bs=104 avail_mem=71.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.65it/s]Capturing batches (bs=104 avail_mem=71.82 GB):  20%|██        | 4/20 [00:00<00:01, 14.18it/s]Capturing batches (bs=96 avail_mem=71.81 GB):  20%|██        | 4/20 [00:00<00:01, 14.18it/s] Capturing batches (bs=88 avail_mem=71.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.18it/s]Capturing batches (bs=80 avail_mem=71.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.18it/s]

Capturing batches (bs=80 avail_mem=71.80 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.53it/s]Capturing batches (bs=72 avail_mem=71.80 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.53it/s]Capturing batches (bs=64 avail_mem=71.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.53it/s]Capturing batches (bs=56 avail_mem=71.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.53it/s]Capturing batches (bs=56 avail_mem=71.79 GB):  50%|█████     | 10/20 [00:00<00:00, 19.57it/s]Capturing batches (bs=48 avail_mem=71.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.57it/s]Capturing batches (bs=40 avail_mem=71.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.57it/s]

Capturing batches (bs=32 avail_mem=71.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.57it/s]Capturing batches (bs=32 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.99it/s]Capturing batches (bs=24 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.99it/s]Capturing batches (bs=16 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.99it/s]Capturing batches (bs=12 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.99it/s]

Capturing batches (bs=12 avail_mem=71.76 GB):  80%|████████  | 16/20 [00:00<00:00, 20.18it/s]Capturing batches (bs=8 avail_mem=71.75 GB):  80%|████████  | 16/20 [00:00<00:00, 20.18it/s] Capturing batches (bs=4 avail_mem=71.75 GB):  80%|████████  | 16/20 [00:00<00:00, 20.18it/s]Capturing batches (bs=2 avail_mem=71.74 GB):  80%|████████  | 16/20 [00:00<00:00, 20.18it/s]Capturing batches (bs=1 avail_mem=71.74 GB):  80%|████████  | 16/20 [00:00<00:00, 20.18it/s]Capturing batches (bs=1 avail_mem=71.74 GB): 100%|██████████| 20/20 [00:00<00:00, 23.55it/s]Capturing batches (bs=1 avail_mem=71.74 GB): 100%|██████████| 20/20 [00:00<00:00, 20.13it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Maria and I am writing a biography of a deceased family member. The subject's name is Alvin. I need assistance with creating the first few sentences of the biography. Can you help me?

Certainly! Writing a biography of a deceased family member can be a deeply personal and emotional process, so it's important to find a way to connect the biography with the individual's life in a meaningful way. Here are a few sentences that could be included in your biography:

1. **Introduction**: Write a brief introduction to your book, setting the scene and introducing the key events of Alvin's life, including how he came to be and what
Prompt: The president of the United States is
Generated text:  seeking endorsements for a new political campaign. The campaign staff has compiled the following data to help him:

1. The average annual income of all candidates is $150,000.
2. The average annual income of the first candidate is $100,000.
3. The average annual i

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm a [Number] year old, [Gender] and [Country]. I'm [Number] of [Number] years old, [Gender] and [Country]. I'm [Number] of [Number] years old, [Gender] and [Country]. I'm [Number] of [Number] years old, [Gender] and [Country]. I'm [Number] of [Number] years old, [Gender] and [Country]. I'm [Number] of [Number] years old, [Gender] and [Country]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to ancient times. Paris is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. The city is also known for its fashion industry, with many famous designers and boutiques. Paris is a popular tourist destination, with millions of visitors each year. It is a major hub for international trade and diplomacy, with the French government and emb

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions. This could lead to more efficient and effective use of resources, as well as better decision-making in various industries.

3. Greater emphasis on



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I'm a/an [Your Profession/Field of Work] with [Your Education/Experience] in the field of [Your Main Profession/Field of Work]. I have [Your Age/Current Age] years old, [Your Gender] and I'm [Your Job Title]. I'm here to learn, grow, and [Your Main Profession/Field of Work]. I'm always looking for new experiences and opportunities to improve myself, and I'm looking forward to making a positive impact on the world through [Your Main Profession/Field of Work]. It's my dream to [Your Main Profession/Field of Work

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, and it is the largest city in both the country and Europe. 

A. True B. False

To determine whether the statement "The capital of France is Paris, and it is the largest city in both the country and Europe" is true or false, we will f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Sarah

,

 and

 I

'm

 a

 writer

.

 I

 enjoy

 writing

 short

 stories

 and

 novels

,

 and

 I

 spend

 a

 lot

 of

 time

 thinking

 about

 ideas

 and

 characters

.

 I

 love

 to

 experiment

 with

 different

 genres

 and

 styles

,

 and

 I

'm

 always

 learning

 and

 growing

.

 I

'm

 also

 a

 great

 listener

,

 and

 I

 enjoy

 meeting

 people

 and

 talking

 about

 my

 work

 with

 them

.

 In

 short

,

 I

'm

 a

 creative

 writer

 who

 loves

 to

 explore

 the

 world

 of

 writing

.

 



Any

 questions

 about

 my

 writing

?

 Let

 me

 know

!

 :)

 



---



**

Sarah

's

 Character

 Profile

:

 A

 Cur

ious

 Freel

ance

 Writer

**



Hello

,

 my

 name

 is

 Sarah

,

 and

 I

'm

 a

 freelance

 writer

 who

 enjoys

 writing

 short



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Does

 this

 next

 sentence

 follow

,

 given

 the

 above

 sentence

?

 "

Paris

 has

 no

 capital

 city

."



OPTIONS

:

 

1

.

 yes

 

2

.

 it

 is

 not

 possible

 to

 tell

 

3

.

 no

 

1

.

 yes





The

 given

 sentence

 "

Paris

 has

 no

 capital

 city

"

 does

 not

 logically

 follow

 from

 the

 given

 information

.

 The

 original

 statement

 is

 that

 Paris

 is

 the

 capital

 of

 France

,

 which

 is

 accurate

.

 Therefore

,

 Paris

 does

 have

 a

 capital

 city

.

 So

,

 the

 correct

 option

 is

 "

no

".

 However

,

 the

 original

 sentence

 is

 a

 factual

 statement

 that

 directly

 contrad

icts

 the

 statement

 in

 the

 second

 option

.

 This

 answer

 would

 typically

 be

 rejected

 in

 answer

 choices

 if

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 speculative

,

 but

 here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 technology

:



1

.

 Increased

 accuracy

 and

 reliability

:

 As

 AI

 continues

 to

 improve

,

 it

 is

 likely

 to

 become

 even

 more

 accurate

 and

 reliable

.

 This

 means

 that

 we

 can

 expect

 to

 see

 a

 range

 of

 applications

 for

 AI

,

 from

 self

-driving

 cars

 to

 personalized

 medicine

 to

 virtual

 assistants

.



2

.

 Automation

 of

 routine

 tasks

:

 AI

 is

 already

 being

 used

 to

 automate

 routine

 tasks

,

 from

 repetitive

 office

 work

 to

 customer

 service

 to

 administrative

 tasks

.

 As

 AI

 becomes

 more

 sophisticated

,

 we

 can

 expect

 to

 see

 a

 larger

 portion

 of

 these

 tasks

 be

 automated

,

 freeing

 up

 more

 time

 for

 humans

 to

 focus

 on

 more

 complex

 and

 creative

 tasks

.






In [6]:
llm.shutdown()