# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-14 02:50:17] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.09s/it]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=22.18 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=22.18 GB):   5%|▌         | 1/20 [00:00<00:04,  3.97it/s]Capturing batches (bs=120 avail_mem=20.94 GB):   5%|▌         | 1/20 [00:00<00:04,  3.97it/s]Capturing batches (bs=112 avail_mem=20.94 GB):   5%|▌         | 1/20 [00:00<00:04,  3.97it/s]

Capturing batches (bs=112 avail_mem=20.94 GB):  15%|█▌        | 3/20 [00:00<00:02,  5.93it/s]Capturing batches (bs=104 avail_mem=20.93 GB):  15%|█▌        | 3/20 [00:00<00:02,  5.93it/s]

Capturing batches (bs=104 avail_mem=20.93 GB):  20%|██        | 4/20 [00:00<00:02,  5.49it/s]Capturing batches (bs=96 avail_mem=20.91 GB):  20%|██        | 4/20 [00:00<00:02,  5.49it/s] 

Capturing batches (bs=96 avail_mem=20.91 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.34it/s]Capturing batches (bs=88 avail_mem=20.90 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.34it/s]Capturing batches (bs=88 avail_mem=20.90 GB):  30%|███       | 6/20 [00:01<00:02,  4.84it/s]Capturing batches (bs=80 avail_mem=20.90 GB):  30%|███       | 6/20 [00:01<00:02,  4.84it/s]

Capturing batches (bs=80 avail_mem=20.90 GB):  35%|███▌      | 7/20 [00:01<00:03,  4.09it/s]Capturing batches (bs=72 avail_mem=20.90 GB):  35%|███▌      | 7/20 [00:01<00:03,  4.09it/s]

Capturing batches (bs=72 avail_mem=20.90 GB):  40%|████      | 8/20 [00:02<00:03,  3.13it/s]Capturing batches (bs=64 avail_mem=68.72 GB):  40%|████      | 8/20 [00:02<00:03,  3.13it/s]Capturing batches (bs=64 avail_mem=68.72 GB):  45%|████▌     | 9/20 [00:02<00:02,  3.73it/s]Capturing batches (bs=56 avail_mem=68.72 GB):  45%|████▌     | 9/20 [00:02<00:02,  3.73it/s]

Capturing batches (bs=56 avail_mem=68.72 GB):  50%|█████     | 10/20 [00:02<00:02,  4.56it/s]Capturing batches (bs=48 avail_mem=68.71 GB):  50%|█████     | 10/20 [00:02<00:02,  4.56it/s]Capturing batches (bs=48 avail_mem=68.71 GB):  55%|█████▌    | 11/20 [00:02<00:01,  5.35it/s]Capturing batches (bs=40 avail_mem=68.70 GB):  55%|█████▌    | 11/20 [00:02<00:01,  5.35it/s]Capturing batches (bs=32 avail_mem=68.69 GB):  55%|█████▌    | 11/20 [00:02<00:01,  5.35it/s]

Capturing batches (bs=32 avail_mem=68.69 GB):  65%|██████▌   | 13/20 [00:02<00:00,  7.30it/s]Capturing batches (bs=24 avail_mem=68.21 GB):  65%|██████▌   | 13/20 [00:02<00:00,  7.30it/s]Capturing batches (bs=16 avail_mem=68.20 GB):  65%|██████▌   | 13/20 [00:02<00:00,  7.30it/s]Capturing batches (bs=16 avail_mem=68.20 GB):  75%|███████▌  | 15/20 [00:02<00:00,  8.64it/s]Capturing batches (bs=12 avail_mem=68.20 GB):  75%|███████▌  | 15/20 [00:02<00:00,  8.64it/s]

Capturing batches (bs=8 avail_mem=68.19 GB):  75%|███████▌  | 15/20 [00:02<00:00,  8.64it/s] Capturing batches (bs=8 avail_mem=68.19 GB):  85%|████████▌ | 17/20 [00:02<00:00, 10.30it/s]Capturing batches (bs=4 avail_mem=68.18 GB):  85%|████████▌ | 17/20 [00:02<00:00, 10.30it/s]Capturing batches (bs=2 avail_mem=68.18 GB):  85%|████████▌ | 17/20 [00:02<00:00, 10.30it/s]Capturing batches (bs=2 avail_mem=68.18 GB):  95%|█████████▌| 19/20 [00:02<00:00, 12.06it/s]Capturing batches (bs=1 avail_mem=68.18 GB):  95%|█████████▌| 19/20 [00:02<00:00, 12.06it/s]

Capturing batches (bs=1 avail_mem=68.18 GB): 100%|██████████| 20/20 [00:03<00:00,  6.62it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alysia and I am a former high school senior, now a 24 year old widow with a child. I am really good at writing. I don't have a particular genre but I tend to write shorter stories, novellas, and sometimes non-fiction. I write in a variety of styles and are always looking for new and interesting stories to write. I am very good at identifying the main themes and the challenges of writing stories. I am also very good at editing and revising my stories so that they are polished. If you are an aspiring writer, I am willing to help you learn to write better stories. My favorite
Prompt: The president of the United States is
Generated text:  a post that comes with a long history. Initially, it was appointed by the head of the executive branch. In the modern era, the president is both appointed and elected, as opposed to the traditional system which required the head of the executive branch to be chosen by the legislature. The term of a president is t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Type of Vehicle] with [Number of Wheels] wheels. I'm [Favorite Color] and I love [Favorite Food]. I'm [Favorite Book] and I enjoy [Favorite Activity]. I'm [Favorite Movie] and I love [Favorite Music]. I'm [Favorite Sport] and I play [Favorite Game]. I'm [Favorite Animal] and I love [Favorite Pet]. I'm [Favorite Place] and I love [Favorite Thing to Do]. I'm [Favorite Person] and I admire [Favorite Person's Character]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a city with a rich history and diverse culture. It is the largest city in France and the second-largest city in the European Union, with a population of over 2.7 million people. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles. It is also a popular tourist destination, with many visitors coming to explore its historical and cultural attractions. Paris is a city that has played a significant role in French history and continues to be a major cultural and economic center in the country. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced interactions between humans and machines.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, privacy, and accountability.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, it is likely to be used in even more areas of healthcare, including diagnosis, treatment



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Age] year old. I started playing the piano when I was [Age] and I've been playing since then. I love [What I like to play or play at home] and I love to [What I enjoy doing]. I'm always ready to learn new things and I'm always looking for ways to improve my skills. I enjoy spending time with my friends and playing music with them. I love the feeling of playing music and I'm really happy when I can share it with others. How would you describe your personality and what makes you tick? As a neutral self-introduction, I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the European Union, and is known for its rich history, art, architecture, and cuisine. It is also a major hub for finance, technology, and luxury goods, as well as being the birthplace of several fa

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 computer

 science

 student

 major

ing

 in

 [

Major

].

 I

 am

 currently

 [

Year

 of

 Grad

uation

]

 and

 I

 am

 currently

 [

Favorite

 Hobby

].

 I

 am

 an

 [

Age

]

 year

 old

 person

.

 I

 am

 looking

 forward

 to

 [

What

 the

 Person

 Wants

 to

 Be

/

Do

].


Sure

!

 Here

's

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

:



---



I

'm

 [

Your

 Name

],

 a

 [

Your

 Major

]

 student

 major

ing

 [

Your

 Year

 of

 Grad

uation

]

 at

 [

Your

 University

].

 I

'm

 an

 [

Your

 Age

]

 year

 old

 person

,

 and

 I

 currently

 focus

 on

 [

Your

 Favorite

 Hobby

].

 My

 journey

 as

 a

 [

Your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 world

-f

amous

 city

 renowned

 for

 its

 rich

 history

,

 famous

 art

 museums

,

 and

 iconic

 landmarks

.

 It

 is

 also

 known

 for

 its

 vibrant

 culture

,

 delicious

 cuisine

,

 and

 annual

 festivals

.

 Paris

 is

 a

 popular

 tourist

 destination

,

 known

 for

 its

 beauty

,

 architecture

,

 and

 world

-class

 cuisine

,

 making

 it

 a

 must

-

visit

 city

 for

 visitors

 to

 France

.

 It

 is

 also

 a

 UNESCO

 World

 Heritage

 site

 and

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 Mont

mart

re

.

 Additionally

,

 Paris

 is

 known

 for

 its

 extensive

 French

 language

 schools

 and

 cultural

 institutions

.

 It

 is

 a

 great

 place

 to

 experience

 France

’s

 rich

 cultural

 heritage

 and

 enjoy



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 largely

 dependent

 on

 a

 number

 of

 factors

,

 including

 technological

 advancements

,

 economic

 development

,

 and

 societal

 shifts

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 AI

 integration

 into

 healthcare

:

 With

 the

 increasing

 demand

 for

 precision

 medicine

,

 AI

 is

 likely

 to

 play

 a

 key

 role

 in

 improving

 the

 accuracy

 and

 efficacy

 of

 medical

 diagnoses

 and

 treatments

.



2

.

 AI

-driven

 autonomous

 vehicles

:

 Autonomous

 vehicles

 have

 the

 potential

 to

 reduce

 accidents

 and

 improve

 safety

 on

 the

 roads

,

 but

 they

 also

 raise

 ethical

 concerns

 about

 privacy

 and

 responsibility

.



3

.

 AI

 in

 manufacturing

:

 AI

 is

 being

 used

 in

 manufacturing

 to

 optimize

 production

 processes

,

 improve

 efficiency

,

 and

 reduce

 waste

,

 but

 it

 is

 also

 raising

 concerns




In [6]:
llm.shutdown()