# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0905 20:01:37.551000 830690 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 20:01:37.551000 830690 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0905 20:01:48.353000 831347 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 20:01:48.353000 831347 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0905 20:01:48.396000 831348 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 20:01:48.396000 831348 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-05 20:01:48] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.27it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.75it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.75it/s]Capturing batches (bs=1 avail_mem=74.72 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.75it/s]

Capturing batches (bs=1 avail_mem=74.72 GB): 100%|██████████| 3/3 [00:00<00:00,  5.34it/s]Capturing batches (bs=1 avail_mem=74.72 GB): 100%|██████████| 3/3 [00:00<00:00,  4.88it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Gerald and I'm a professional chef in London. I have been working in the catering industry for over 25 years and I have experience of operating the business from an operating office, kitchen and production facilities. My experience has taken me to all the major cities in the UK and I have worked in lots of different locations and for lots of different types of restaurants. I have a keen interest in exploring new ways of cooking and experimenting with new cuisines and techniques. I have a keen eye for detail and I make sure that all the food that is cooked is of the highest quality and that the food is cooked as well as possible. I
Prompt: The president of the United States is
Generated text:  from which country?
What is the answer? (A). France
B. Russia
C. Canada
D. United Kingdom
D. United Kingdom
The answer is D. United Kingdom. The president of the United States is from the United Kingdom, as the U. S. president is the head of state and the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm [age], [gender], and I have [number] years of experience in [industry]. I'm a [type of person] and I'm always looking for ways to [advantage]. I'm always eager to learn and grow, and I'm always willing to help others. I'm a [character trait] person and I'm always ready to make a difference. I'm a [character trait] person and I'm always ready to make

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower and vibrant cultural scene. It is also a major financial center and a popular tourist destination. Paris is home to many famous landmarks such as the Louvre Museum and Notre-Dame Cathedral. The city is also known for its rich history and cultural heritage, including the Notre-Dame Cathedral and the Louvre Museum. Paris is a major hub for international business and tourism, and is a popular destination for French and international visitors alike. The city is also home to many museums, including the Musée d'Orsay and the Musée Rodin. Paris is a city of contrasts,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve, leading to more advanced AI systems that can perform a wider range of tasks and solve more complex problems. AI will also continue to be integrated into various industries, from healthcare and finance to transportation and manufacturing, leading to increased efficiency and productivity. Additionally, AI will continue to be used for ethical and social reasons, such as in the development of autonomous vehicles and the design of policies that promote social equity and justice. Finally, AI will continue to be used for entertainment and leisure, with the development of more



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name]. I'm a [major in college] majoring in [major], and I'm really excited to meet you. What can you tell me about yourself? I can tell that you're a [major in college] student and that you are very enthusiastic about your studies and the future. You are not only interested in your major, but also in other areas like [major]. What's your favorite subject to study, and why do you like it so much? I can tell that you have a keen interest in [major], and I am excited to learn more about it. What do you think makes a great teacher, and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city where the Eiffel Tower, Notre-Dame Cathedral, and Montmartre are located. The city is renowned for its beautiful architecture, rich history, and lively culture. Paris is a popular tourist destination, known for its m

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

]

 and

 I

'm

 a

 [

insert

 age

,

 experience

 level

,

 profession

,

 etc

.

].

 I

 grew

 up

 in

 [

insert

 hometown

 or

 place

 of

 origin

],

 where

 my

 parents

 always

 taught

 me

 the

 value

 of

 hard

 work

 and

 perseverance

.

 I

 had

 always

 been

 interested

 in

 [

insert

 relevant

 subject

],

 but

 I

 never

 gave

 it

 much

 thought

 because

 I

 was

 too

 busy

 with

 my

 daily

 life

.

 But

 now

 that

 I

'm

 older

,

 I

've

 realized

 that

 I

've

 gained

 a

 lot

 more

 knowledge

 and

 skills

,

 and

 I

 want

 to

 help

 others

 find

 their

 own

 path

.

 I

'm

 excited

 to

 share

 my

 knowledge

 and

 experience

 with

 others

,

 and

 I

 believe

 that

 by

 doing

 so

,

 I

 can



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 has

 a

 population

 of

 over

 

2

.

5

 million

 people

 and

 is

 known

 for

 its

 medieval

 architecture

,

 elegant

 cafes

,

 and

 vibrant

 arts

 scene

.

 The

 city

 is

 also

 home

 to

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

,

 one

 of

 the

 world

's

 most

 famous

 and

 iconic

 landmarks

.

 Paris

 is

 a

 lively

 and

 diverse

 city

 with

 a

 rich

 cultural

 and

 artistic

 heritage

,

 attracting

 visitors

 from

 all

 over

 the

 world

.

 The

 city

 is

 also

 known

 for

 its

 fashion

 industry

,

 which

 is

 home

 to

 high

-end

 bout

iques

 and

 fashion

 designers

.

 It

's

 a

 city with

 a rich

 history and

 a lively

 atmosphere,

 making it

 a popular

 destination for

 tourists and

 locals

 alike.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 innovation

,

 increased

 use

 in

 diverse

 fields

,

 and

 continued

 development

 in

 areas

 like

 language

 processing

,

 decision

-making

,

 and

 cybersecurity

.

 Here

 are

 some

 of

 the

 most

 likely

 trends

 we

 can

 expect

 to

 see

 in

 the

 near

 and

 long

 term

:



1

.

 Increased

 automation

 and

 integration

:

 As

 AI

 technology

 continues

 to

 improve

,

 we

 can

 expect

 to

 see

 more

 automation

 and

 integration

 of

 AI

 into

 various

 industries

.

 This

 could

 lead

 to

 the

 automation

 of

 certain

 tasks

,

 as

 well

 as

 the

 development

 of

 new

 types

 of

 AI

 that

 can

 perform

 previously

 impossible

 tasks

.

 This

 could

 have

 a

 significant

 impact

 on

 employment

 and

 job

 markets

,

 as

 well

 as

 on

 the

 development

 of

 new

 industries

.



2




In [6]:
llm.shutdown()