# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0915 11:01:45.518000 3374321 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 11:01:45.518000 3374321 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0915 11:01:54.769000 3375014 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 11:01:54.769000 3375014 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0915 11:01:54.769000 3375013 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 11:01:54.769000 3375013 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 11:01:55] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.12it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.37it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.37it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.37it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  9.99it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Nicholas and I am a filmmaker based in Birmingham, England. My work focuses on the exploration of everyday experiences and the emergence of a language that can describe them. While engaging with the visual and verbal aspects of the experience, my aim is to produce work that is emotionally resonant, socially relevant and economically viable. I take my inspiration from a multitude of sources, and believe that it is important to consider the complexities of the human experience in order to produce work that is engaging and thought-provoking.

I have been working with social media platforms like Facebook and Twitter for over a decade, but I believe that online sharing can be a powerful tool
Prompt: The president of the United States is
Generated text:  seeking to appoint a new Secretary of Education. Each of the eight schools in the country has proposed five candidates, and the president must choose five of them to interview. In how many different

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill/Ability] who has been [Number of Years] years in the [Field/Industry] industry. I'm passionate about [Why I love my job]. I'm always looking for ways to [What I'm trying to improve]. I'm a [What I'm good at]. I'm [What I'm looking forward to doing next]. I'm [What I'm looking forward to doing next]. I'm [What I'm looking forward to doing next]. I'm [What I'm looking forward to doing next]. I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and art galleries. Paris is a popular tourist destination, attracting millions of visitors each year. The city is also known for its rich history, including the influence of French colonialism and the influence of various European cultures. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is a major hub for international business and trade, with many international corporations and institutions

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and the impact of AI on society.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, we may see even more widespread use of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm a professional writer and editor with over 10 years of experience. I have a degree in creative writing from [Your University], and have published my work in numerous magazines and anthologies. I'm always looking for new ideas and creative challenges to push myself. I'm a great collaborator and have a talent for bringing ideas to life. Thank you! [Your Name] 1. [Your Name]: Welcome, [Recipient's Name]! I'm thrilled to have you as my editor. I've been looking to get your feedback on my latest work and I'm excited to hear your thoughts on it

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It also has a rich cultural history and is a major economic center, with a thriving transportation ne

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 [

Your

 Age

].

 I

'm

 [

Your

 Profession

]

 and

 I

'm

 excited

 to

 meet

 you

 all

.

 As

 you

 can

 see

,

 I

'm

 quite

 simple

 and

 straightforward

,

 and

 I

 strive

 to

 keep

 things

 simple

 and

 straightforward

.

 I

'm

 an

 avid

 reader

,

 and

 I

 enjoy

 exploring

 different

 cultures

 and

 learning

 about

 the

 world

.

 I

'm

 always

 looking

 for

 new

 ways

 to

 make

 the

 world

 a

 better

 place

,

 and

 I

'm

 happy

 to

 share

 my

 thoughts

 and

 ideas

 with

 you

.

 I

'm

 excited

 to

 meet

 you

 and

 to

 help

 you

 on

 your

 journey

 to

 understanding

 and

 apprec

iating

 the

 world

 around

 us

.

 Thanks

 for

 taking

 the

 time

 to

 meet

 me

.

 [

Your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 vibrant

 cultural

 scene

.

 It

 is

 the

 largest

 city

 in

 the

 country

 and

 the

 seat

 of

 government

 for

 the

 French

 Republic

.

 The

 city

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

,

 which

 are

 UNESCO

 World

 Heritage

 sites

.

 Paris

 is

 also

 a

 major

 hub

 for

 French

 culture

 and

 cuisine

,

 hosting

 world

-ren

owned

 landmarks

 such

 as

 the

 Mou

lin

 Rouge

,

 the

 Ch

amps

-

É

lys

ées

,

 and

 the

 Mont

mart

re

 neighborhood

.

 The

 city

 is

 also

 home

 to

 numerous

 cultural

 institutions

 and

 festivals

,

 including

 the

 famous

 Opera

 House

 and

 the

 annual



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 many

 factors

,

 including

 advances

 in

 hardware

,

 software

,

 and

 data

,

 as

 well

 as

 changes

 in

 how

 AI

 is

 used

 and

 integrated

 into

 society

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

 that

 are

 currently

 being

 discussed

 and

 debated

:



1

.

 Greater

 emphasis

 on

 ethical

 and

 responsible

 AI

:

 With

 more

 and

 more

 AI

 systems

 becoming

 involved

 in

 decision

-making

,

 there

 is

 a

 growing

 emphasis

 on

 ensuring

 that

 AI

 systems

 are

 fair

,

 transparent

,

 and

 accountable

.

 This

 could

 involve

 developing

 more

 ethical

 guidelines

 and

 standards

 for

 AI

 development

 and

 deployment

,

 as

 well

 as

 more

 rigorous

 testing

 and

 evaluation

 processes

.



2

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 As

 AI

 becomes

 more

 integrated

 into




In [6]:
llm.shutdown()