# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-14 10:20:23] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.24it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  4.94it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  4.94it/s]Capturing batches (bs=120 avail_mem=76.82 GB):  10%|█         | 2/20 [00:00<00:03,  5.03it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  10%|█         | 2/20 [00:00<00:03,  5.03it/s]

Capturing batches (bs=104 avail_mem=76.81 GB):  10%|█         | 2/20 [00:00<00:03,  5.03it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:01,  9.29it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01,  9.29it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01,  9.29it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01,  9.29it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 13.97it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 13.97it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 13.97it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 13.97it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 16.92it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 16.92it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 16.92it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 16.92it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.10it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.10it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.10it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:01<00:00, 19.10it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 19.13it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 19.13it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 19.13it/s]

Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 19.13it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 19.13it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 22.79it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 16.95it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aisha. I'm 15 years old. I live in Paris. My favorite activity is reading. I love stories. I like to imagine the characters and how they would feel. My favorite place is a great bookshop. I like to browse and find new books to read. I like reading about science fiction. I also like to read about fantasy, and I like it because there are more adventures. I enjoy reading a book that has a good plot. I love meeting characters in a book. I think that's one of the best parts about reading. Reading books is relaxing for me, but it is also exciting and dangerous
Prompt: The president of the United States is
Generated text:  trying to decide whether to continue using the 100-dollar bill or the $1 bill. The president estimates that the 100-dollar bill is worth $10,000,000, while the $1 bill is worth $10,000. If the president decides to use the $1 bill instead of the 100-dollar bill, how much money will he save? To determine how much money the president 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I'm always up for a challenge and love to try new things. What's your favorite book or movie? I love to read and watch movies, but I also enjoy trying new things. What's your favorite place

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant French culture. 

This statement encapsulates the key points about Paris, including its historical significance, notable landmarks, and cultural attractions. It provides a brief overview of the city's importance and appeal to visitors. 

To ensure accuracy, it should be updated to reflect any recent developments or changes in Paris' status as the capital. For example, if Paris has been renamed to "Lyon" in the past, this would be reflected in the statement. 

The statement should be clear and concise, allowing readers to quickly grasp the essence of Paris' importance

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations and tasks. This could lead to more complex and sophisticated AI systems that can perform tasks that are currently beyond the capabilities of humans.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be increased concerns about privacy and security. AI systems will need to be designed with privacy and security in mind, and there will be a need for robust privacy and security measures to protect user data.

3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [specific occupation or career]. I enjoy [the thing that you would find interesting about yourself]. What is your profession or field of work? [Name] enjoys [reason for interest] and spends most of their time [specific activities or tasks]. I'm [job role] and I'm excited to learn more about you. What are your hobbies or interests outside of work? [Name] is [specific hobby or interest] and loves [explanation for why this is a hobby or interest]. I'm glad to meet you, and I hope to have a great conversation with you. [Name] [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its historical and cultural landmarks, beautiful beaches, and modern architecture. 

Paris is France's largest city and the heart of the French Riviera, famous for its iconic landmarks such as Notre-Dame Cath

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 a

 [

Age

]

 year

 old

 girl

 who

 is

 passionate

 about

 [

Your

 Interest

].

 I

 am

 an

 [

Occup

ation

]

 who

 loves

 [

Your

 Hobby

/

Interest

].

 I

 am

 always

 ready

 to

 share

 my

 experiences

,

 thoughts

,

 and

 feelings

 with

 anyone

 who

 is

 interested

 in

 listening

.

 I

 am

 confident

,

 independent

,

 and

 always

 up

 for

 learning

 and

 exploring

 new

 things

.

 My

 infectious

 enthusiasm

 for

 [

Your

 Interest

]

 and

 my

 love

 for

 [

Your

 Hobby

/

Interest

]

 make

 me

 a

 great

 listener

 and

 friend

.

 I

 hope

 to

 have

 the

 opportunity

 to

 connect

 with

 more

 people

 like

 myself

 and

 make

 new

 friends

.

 [

Your

 Name

]

 [

Your

 Address

]

 [

Your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 by

 population

 and

 the

 seat

 of

 the

 French

 government

,

 where

 the

 French

 Parliament

 meets

.

 The

 city

 is

 known

 for

 its

 architecture

,

 museums

,

 and

 world

-ren

owned

 museums

 like

 the

 Lou

vre

 and

 the

 É

v

ry

-M

ont

mart

re

 district

.

 Paris

 also

 has

 a

 rich

 cultural

 history

,

 with

 many

 well

-known

 artists

,

 writers

,

 and

 musicians

.

 It

 has

 a

 major

 airport

 and

 is

 located

 on

 the

 French

 Riv

iera

,

 which

 offers

 excellent

 weather

 and

 easy

 access

 to

 tourist

 attractions

.

 The

 city

 is

 known

 for

 its

 beauty

,

 from

 the

 majestic

 E

iff

el

 Tower

 to

 the

 narrow

 streets

 and

 historic

 architecture

 of

 the

 

1

9

th

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 a

 rapid

 expansion

 of

 its

 applications

 and

 capabilities

,

 with

 a

 number

 of

 potential

 trends

 shaping

 the

 course

 of

 AI

 development

.

 Here

 are

 some

 of

 the

 key

 trends

 that

 are

 likely

 to

 be

 significant

 in

 the

 near

 future

:



1

.

 Increased

 Integration

 with

 Other

 Technologies

:

 AI

 is

 already

 being

 integrated

 with

 other

 technologies

,

 such

 as

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 It

 is

 expected

 that

 this

 integration

 will

 continue

,

 with

 more

 and

 more

 applications

 being

 developed

 that

 leverage

 these

 technologies

.



2

.

 Enhanced

 Transparency

 and

 Explain

ability

:

 As

 AI

 systems

 become

 more

 complex

,

 it

 is

 becoming

 increasingly

 important

 for

 them

 to

 be

 transparent

 and

 explain

able

.

 This

 will

 require




In [6]:
llm.shutdown()