# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-30 03:52:56] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.30it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:04,  4.29it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:04,  4.29it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:04,  4.29it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.41it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.41it/s]

Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.41it/s] 

Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.11it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.11it/s]

Capturing batches (bs=88 avail_mem=76.79 GB):  30%|███       | 6/20 [00:01<00:03,  4.24it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  30%|███       | 6/20 [00:01<00:03,  4.24it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  30%|███       | 6/20 [00:01<00:03,  4.24it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  40%|████      | 8/20 [00:01<00:01,  6.10it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:01,  6.10it/s]

Capturing batches (bs=56 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:01,  6.10it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:01<00:01,  7.89it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:01<00:01,  7.89it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:01<00:01,  7.89it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:01<00:00,  9.84it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00,  9.84it/s]

Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00,  9.84it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:01<00:00,  9.84it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.13it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.13it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.13it/s] 

Capturing batches (bs=4 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.13it/s]

Capturing batches (bs=4 avail_mem=76.74 GB):  90%|█████████ | 18/20 [00:02<00:00, 11.36it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:02<00:00, 11.36it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:02<00:00, 11.36it/s]

Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:02<00:00, 10.59it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:02<00:00,  8.37it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Carissa, and I'm a lawyer. I have extensive experience with real estate, estate planning, and corporate law. I specialize in helping individuals and businesses navigate complex legal issues in the areas of tax, fiduciary duty, contract, and corporate structure. I am also committed to providing helpful, professional, and thoughtful legal advice. I strive to help my clients achieve their goals and best interests. How can Carissa assist individuals and businesses in navigating complex legal issues in the areas of tax, fiduciary duty, contract, and corporate structure? Carissa can help individuals and businesses navigate complex legal issues in the areas of tax, fid
Prompt: The president of the United States is
Generated text:  now trying to secure a deal for the partial cancellation of the debt ceiling.
That’s why the United States does a little bit of research before filing a bill, as is their custom.
Let’s go through the steps.
First, the Presi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest in the industry]. I'm always looking for ways to [action or goal]. I'm a [reason for interest in the industry] and I'm always eager to learn and grow. I'm a [reason for interest in the industry] and I'm always eager to learn and grow. I'm a [reason for interest in the industry] and I'm always eager to learn and grow. I'm a [reason

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the third-largest city in the world by population. The city is known for its rich history, art, and cuisine. It is also home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a popular tourist destination and a major economic center in Europe. It is also a cultural hub for France and the world. The city is home to many important institutions such as the French Academy of Sciences and the French Parliament. It is a major center for research and innovation in science, technology, and culture. Paris is a city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some potential trends that are likely to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we can expect to see even more widespread use of AI in healthcare, with more sophisticated algorithms and machine learning techniques being developed to improve diagnosis, treatment, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve risk management, fraud



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  ___________ and I am a/an ___________. My ___________ is located in __________. I am ___________ and I am ___________________. I believe in ____________. I am a/an ___________, I have ___________. My most significant achievement is ___________________. I am a/an ___________, and I have a/an ___________ job. If anyone would like to meet me, they can come to ___________. I'm a/an ___________. I'm ___________. 

Remember to use a neutral and appropriate tone throughout your introduction. It is important that you do not make any personal attacks or negative remarks about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the 19th-largest city in the world and is one of the world’s most populous cities, with a population of over 6 million people. It is the largest city in metropolitan France and one of the lar

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 a

 young

 woman

 with

 a

 desire

 to

 pursue

 my

 passions

.

 I

 am

 creative

 and

 ambitious

,

 and

 I

 am

 constantly

 learning

 and

 growing

.

 I

 love

 to

 travel

,

 be

 outdoors

,

 and

 explore

 new

 places

.

 I

 am

 eager

 to

 connect

 with

 people

 and

 help

 them

 grow

 their

 own

 dreams

.

 I

 believe

 that

 everyone

 has

 something

 special

,

 and

 I

 want

 to

 help

 others

 discover

 their

 own

 paths

 to

 happiness

 and

 fulfillment

.

 I

 am

 open

 to

 new

 experiences

 and

 different

 ways

 of

 living

.

 Thank

 you

 for

 having

 me

!

 Can

 you

 summarize

 the

 character

's

 core

 values

 and

 beliefs

?

 Based

 on

 the

 fictional

 character

's

 self

-int

roduction

,

 it

 seems

 that

 the

 character

 values

 creativity

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 of

 light

 and

 culture

,

 where

 the

 E

iff

el

 Tower

 stands

 as

 a

 testament

 to

 its

 beauty

 and

 significance

.

 It

 is

 a

 bustling

 met

ropolis

 of

 over

 

2

.

5

 million

 people

 and

 is

 known

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 Lou

vre

 Museum

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 a

 beloved

 destination

 for

 tourists

,

 thanks

 to

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 delicious

 cuisine

.

 The

 city

's

 culture

,

 particularly

 its

 fashion

 and music

 scenes

,

 are

 also

 celebrated

 throughout

 the

 world

.

 Overall

,

 Paris

 is

 a

 city

 of

 innovation

,

 creativity

,

 and

 cultural

 enrichment

.

 Paris

 is

 located

 on

 the

 Î

le

-de

-F

rance

 region

 and

 has



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 continue

 to

 evolve

 and

 develop

 in

 exciting

 ways

.

 Here

 are

 some

 potential

 trends

 that

 are

 expected

 to

 shape

 the

 field

 in

 the

 coming

 years

:



1

.

 Increased

 integration

 of

 AI

 with

 other

 technologies

:

 As

 AI

 becomes

 more

 prevalent

 in

 our

 daily

 lives

,

 we

 can

 expect

 to

 see

 even

 more

 integration

 with

 other

 technologies

 such

 as

 voice

 recognition

,

 natural

 language

 processing

,

 and

 machine

 learning

.

 This

 will

 likely

 lead

 to

 even

 more

 complex

 and

 diverse

 AI

 systems

 that

 can

 work

 alongside

 humans

 in

 a

 wide

 variety

 of

 applications

.



2

.

 Enhanced

 human

-A

I

 collaboration

:

 As

 AI

 becomes

 more

 capable

,

 we

 can

 expect

 to

 see

 increased

 collaboration

 between

 humans

 and

 AI

.

 This

 could

 lead

 to

 more

 efficient




In [6]:
llm.shutdown()