# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-29 03:26:58] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.71it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=20.27 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=20.27 GB):   5%|▌         | 1/20 [00:00<00:09,  1.94it/s]Capturing batches (bs=120 avail_mem=68.75 GB):   5%|▌         | 1/20 [00:00<00:09,  1.94it/s]Capturing batches (bs=112 avail_mem=68.74 GB):   5%|▌         | 1/20 [00:00<00:09,  1.94it/s]

Capturing batches (bs=112 avail_mem=68.74 GB):  15%|█▌        | 3/20 [00:00<00:04,  3.93it/s]Capturing batches (bs=104 avail_mem=68.26 GB):  15%|█▌        | 3/20 [00:00<00:04,  3.93it/s]Capturing batches (bs=96 avail_mem=68.25 GB):  15%|█▌        | 3/20 [00:00<00:04,  3.93it/s] Capturing batches (bs=96 avail_mem=68.25 GB):  25%|██▌       | 5/20 [00:01<00:02,  5.94it/s]Capturing batches (bs=88 avail_mem=68.25 GB):  25%|██▌       | 5/20 [00:01<00:02,  5.94it/s]

Capturing batches (bs=88 avail_mem=68.25 GB):  30%|███       | 6/20 [00:01<00:03,  4.23it/s]Capturing batches (bs=80 avail_mem=68.24 GB):  30%|███       | 6/20 [00:01<00:03,  4.23it/s]Capturing batches (bs=80 avail_mem=68.24 GB):  35%|███▌      | 7/20 [00:01<00:02,  4.91it/s]Capturing batches (bs=72 avail_mem=68.24 GB):  35%|███▌      | 7/20 [00:01<00:02,  4.91it/s]

Capturing batches (bs=64 avail_mem=68.23 GB):  35%|███▌      | 7/20 [00:01<00:02,  4.91it/s]Capturing batches (bs=64 avail_mem=68.23 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.78it/s]Capturing batches (bs=56 avail_mem=68.23 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.78it/s]Capturing batches (bs=48 avail_mem=68.22 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.78it/s]Capturing batches (bs=40 avail_mem=68.22 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.78it/s]Capturing batches (bs=40 avail_mem=68.22 GB):  60%|██████    | 12/20 [00:01<00:00, 10.09it/s]Capturing batches (bs=32 avail_mem=68.21 GB):  60%|██████    | 12/20 [00:01<00:00, 10.09it/s]

Capturing batches (bs=24 avail_mem=68.21 GB):  60%|██████    | 12/20 [00:01<00:00, 10.09it/s]Capturing batches (bs=24 avail_mem=68.21 GB):  70%|███████   | 14/20 [00:02<00:00, 10.47it/s]Capturing batches (bs=16 avail_mem=68.20 GB):  70%|███████   | 14/20 [00:02<00:00, 10.47it/s]Capturing batches (bs=12 avail_mem=68.20 GB):  70%|███████   | 14/20 [00:02<00:00, 10.47it/s]Capturing batches (bs=12 avail_mem=68.20 GB):  80%|████████  | 16/20 [00:02<00:00, 11.49it/s]Capturing batches (bs=8 avail_mem=68.19 GB):  80%|████████  | 16/20 [00:02<00:00, 11.49it/s] 

Capturing batches (bs=4 avail_mem=68.19 GB):  80%|████████  | 16/20 [00:02<00:00, 11.49it/s]Capturing batches (bs=2 avail_mem=68.18 GB):  80%|████████  | 16/20 [00:02<00:00, 11.49it/s]Capturing batches (bs=2 avail_mem=68.18 GB):  95%|█████████▌| 19/20 [00:02<00:00, 14.40it/s]Capturing batches (bs=1 avail_mem=68.18 GB):  95%|█████████▌| 19/20 [00:02<00:00, 14.40it/s]Capturing batches (bs=1 avail_mem=68.18 GB): 100%|██████████| 20/20 [00:02<00:00,  8.57it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Anh, and I'm a 35-year-old psychologist with a passion for mental health and wellness. I have a bachelor's degree in psychology from California State University, Long Beach, and a master's degree in psychology from the University of Texas at Austin. I have been practicing as a psychologist for over ten years and have worked with individuals of all ages and backgrounds. I am dedicated to using evidence-based practices and research to help people feel more in control of their lives. I enjoy incorporating mindfulness and other self-care techniques into my practice. My books, "The Psychology of ADHD" and "The Human Heart," are published by Av
Prompt: The president of the United States is
Generated text:  30 years older than the president of Brazil, and the president of Brazil is half the age of the president of France. If the president of France is currently 300 years old, what will be the president of France's age in 5 years?

To determine the pr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Gender] [Occupation]. I am a [Occupation] who has been [Number of Years] years in the industry. I have a passion for [Industry] and I am always looking for ways to [Industry] myself. I am a [Industry] expert and I have a [Number of Projects] project in the industry. I am a [Industry] expert and I have a [Number of Projects] project in the industry. I am a [Industry] expert and I have a [Number of Projects] project in the industry. I am a [Industry

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a major transportation hub and a major tourist destination, with its famous landmarks and museums attracting millions of visitors annually. Paris is also known for its cuisine, with its famous dishes such as croissants, pastries, and cheese. The city is a major economic and financial center, with its numerous

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more natural and intuitive interactions between humans and machines.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be increased concerns about privacy and security. There will be a need for more robust privacy and security measures to protect against potential misuse of AI technology.

3. Greater automation and efficiency: AI is likely to become more integrated with human intelligence, leading to greater automation and efficiency in various industries



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [occupation]!

I'm constantly learning new things and exploring new experiences, which is why I find myself always on the go. Whether I'm running errands or taking a break at the gym, I'm always trying to stay sharp and adapt to whatever the day throws at me.

I'm very happy to share that I'm a fan of [substance], and I've been using it for a couple of years now, so I have a strong understanding of its effects and how to use it effectively.

I've been involved in various activities and interests, including [list of interests], and I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Options:
A). yes
B). no

A). yes
You are a helpful assistant with no show] [Format your answer as a question. The question should be specific enough to be considered a factual statement, but not so specifi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 I

 am

 

3

5

 years

 old

,

 I

 am

 a

 [

profession

].

 I

'm

 from

 [

location

].

 I

 speak

 [

language

].


As

 an

 AI

 language

 model

,

 I

 don

't

 have

 personal

 experiences

 or

 emotions

,

 but

 I

 can

 create

 a

 short

 and

 neutral

 self

-int

roduction

 for

 any

 fictional

 character

.

 The

 character

 could

 be

 a

 person

,

 a

 team

,

 a

 product

,

 a

 concept

,

 or

 a

 topic

.

 Let

 me

 know

 if

 you

 have

 a

 specific

 character

 in

 mind

,

 and

 I

 can

 tailor

 my

 response

 accordingly

.

 If

 you

're

 not

 sure

,

 just

 go

 ahead

 and

 share

 the

 character

 you

'd

 like

 me

 to

 introduce

,

 and

 I

'll

 be

 happy

 to

 write

 a

 short



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 often

 referred

 to

 as

 the

 "

City

 of

 Light

"

 due

 to

 its

 rich

 cultural

 heritage

 and

 vibrant

 atmosphere

.

 Paris

 is

 the

 eighth

-largest

 city

 in

 Europe

 and

 is

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 a

 major

 economic

 and

 political

 center

 in

 France

 and

 plays

 a

 significant

 role

 in

 the

 country

's

 identity

 and

 culture

.

 Paris

 is

 an

 important

 city

 in

 French

 society

,

 hosting

 major

 events

 such

 as

 the

 E

ly

see

 Palace

 and

 the

 World

 Cup

.

 Its

 status

 as

 the

 capital

 is

 recognized

 internationally

 and

 its

 influence

 extends

 beyond

 the

 city

 to

 other

 parts

 of

 France

.

 Paris

 has

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 possibilities

 and

 possibilities

 abound

.

 Here

 are

 some

 potential

 trends

 in

 AI

 that

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

:



1

.

 Increased

 accuracy

:

 With

 the

 advent

 of

 machine

 learning

 and

 deep

 learning

,

 the

 accuracy

 of

 AI

 systems

 is

 expected

 to

 improve

.

 This

 means

 that

 AI

 will

 become

 more

 capable

 of

 performing

 tasks

 that

 were

 previously

 considered

 impossible

,

 such

 as

 diagn

osing

 diseases

 and

 predicting

 stock

 prices

.



2

.

 Integration

 with

 natural

 language

 processing

:

 AI

 systems

 will

 become

 more

 integrated

 with

 natural

 language

 processing

,

 allowing

 them

 to

 understand

 and

 interpret

 human

 language

 in

 new

 and

 exciting

 ways

.

 This

 could

 lead

 to

 more

 sophisticated

 chat

bots

,

 virtual

 assistants

,

 and

 even

 language

 translation

 systems

.






In [6]:
llm.shutdown()