# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0915 23:10:23.791000 880953 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 23:10:23.791000 880953 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0915 23:10:33.621000 881595 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 23:10:33.621000 881595 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0915 23:10:33.639000 881596 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 23:10:33.639000 881596 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 23:10:34] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.00it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.04 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.04 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.02it/s]Capturing batches (bs=2 avail_mem=71.98 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.02it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.02it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00,  5.23it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Paul. I work for the local government and manage my colleagues at the council. I am responsible for the maintenance, repairs and management of all the council buildings. I also manage the health and safety of council staff. I have worked in public administration for many years and I have received various training and qualifications in health and safety and building maintenance. How can I be a good person to work with in this role at the council?
As someone who has worked in public administration for many years and has received various training and qualifications in health and safety and building maintenance, I can say that it is important to always prioritize safety and to demonstrate a strong work
Prompt: The president of the United States is
Generated text:  visiting a small town in need. The president has a budget of $500 for a gift. Each gift costs $20. If the president wants to buy a gift for each of 50 people in the town, how many gifts 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and historical center with a rich history dating back to ancient times. It is also a major economic and financial center, with a thriving fashion industry and a large number of international companies. The city is known for its cuisine, including its famous French cuisine, and its annual Eiffel Tower Festival. Paris is a popular tourist destination, with millions of visitors each year. It is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there is a growing emphasis on developing AI that is designed to be ethical and responsible. This could mean developing AI that is designed to minimize harm to individuals, or that is designed to be transparent and accountable.

2. Integration of AI with other technologies: AI is likely to become more integrated with other technologies, such as the Internet of Things (IoT) and the Internet of Things (IoT).



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jane, and I'm a writer from Los Angeles. I'm a bit of an introvert by nature, and I prefer to work in my own space, perhaps in my cozy home studio. I love to create, and I often find myself lost in my own mind while writing, even when I'm in a rush. I think my unique style is perfect for the creative space, and I'm always looking for new ways to express myself through writing. Thank you for the opportunity to share a little bit of myself with you. What kind of projects do you work on most often? As an introvert, my writing often leads to more

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the historical and cultural center of the country. 

This statement is factual and provides the most current information about Paris, including its name, historical context, and cultural significance. It is concise a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 ____

_.

 I

 come

 from

 the

 world

 of

 _____

,

 and

 I

 am

 a

 ____

_.

 I

'm

 a

 ___

 student

 at

 the

 ___

 University

.

 



And

 what

 do

 you

 think

 of

 me

?

 How

 would

 you

 describe

 my

 personality

?

 



And

 what

 is

 your

 favorite

 part

 of

 your

 job

?

 



I

 look

 forward

 to

 meeting

 you

,

 and

 I

 hope

 we

 can

 have

 a

 great

 conversation

.

 



I

'd

 love

 to

 hear

 from

 you

!

 



(

If

 you

 have

 any

 questions

 or

 you

 want

 to

 tell

 me

 more

 about

 your

 character

,

 feel

 free

 to

 ask

!

 )

 



Please

 let

 me

 know

 if

 you

 have

 any

 questions

.

 



Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

.

 I

'm



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 city

 that

 serves

 as

 the

 country

’s

 official

 capital

,

 seat

 of

 government

 and

 cultural

 and

 political

 center

.

 Paris

 is

 known

 as

 the

 City

 of

 Love

 and

 was

 the

 birth

place

 of

 many

 influential

 figures

 such

 as

 the

 Enlightenment

 philosophers

 and

 writers

,

 the

 French

 Revolution

 leaders

,

 and

 the

 painter

 Eug

ène

 Del

acro

ix

.

 The

 city

 is

 renowned

 for

 its

 rich

 history

,

 art

,

 culture

,

 and

 cuisine

,

 and

 it

 is

 one

 of

 the

 most

 popular

 tourist

 destinations

 in

 the

 world

.

 Paris

 is

 a

 bustling

 and

 cosm

opolitan

 city

,

 with

 a

 diverse

 population

 of

 over

 

2

 million

 people

 and

 over

 

5

0

0

,

 

0

0

0

 tourist

 attractions

,

 including

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 continued

 development

 and

 innovation

,

 with

 many

 new

 technologies

 and

 applications

 emerging

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 transparency

:

 AI

 systems

 are

 likely

 to

 become

 more

 transparent

,

 allowing

 users

 to

 see

 how

 the

 system

 arrived

 at

 its

 decisions

.

 This

 would

 be

 particularly

 important

 in

 areas

 like

 healthcare

 and

 finance

,

 where

 trust

 is

 crucial

.



2

.

 Improved

 accuracy

:

 AI

 is

 likely

 to

 get

 even

 more

 accurate

,

 thanks

 to

 advances

 in

 machine

 learning

 and

 deep

 learning

.

 This

 could

 lead

 to

 more

 precise

 and

 reliable

 predictions

,

 applications

,

 and

 solutions.



3

.

 Custom

ization

:

 AI

 systems

 are

 likely

 to

 become

 more

 customizable

,

 allowing




In [6]:
llm.shutdown()