# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0903 20:21:49.681000 1156177 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 20:21:49.681000 1156177 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0903 20:21:59.097000 1156616 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 20:21:59.097000 1156616 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0903 20:21:59.104000 1156617 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 20:21:59.104000 1156617 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-03 20:21:59] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.51it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.00 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.00 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.26it/s]Capturing batches (bs=2 avail_mem=76.86 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.26it/s]Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.26it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00,  5.75it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Calvin. I'm a student at the University of Colorado Boulder, studying computer science. I'm a strong believer in open-source and software as a service (SaaS) models. I'm currently working on a project for a class that involves using a new framework to build a machine learning model. This project involves complex algorithms that require a lot of computational power. I'm having some difficulty understanding how to approach this project due to the complexity of the algorithms and the need for precise calculations.

Is there a book or course that I could take to get more familiar with SaaS and machine learning frameworks that could help me approach this project more effectively?
Prompt: The president of the United States is
Generated text:  a very important person. He or she is like the leader of the country. The president has to deal with some problems and work to keep the country in order. The president usually stays in Washington, D. C. for mos

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Academy of Sciences. Paris is a cultural and historical center with a rich history dating back to ancient times. It is a major transportation hub and a major tourist destination, attracting millions of visitors each year. The city is known for its cuisine, fashion, and art, and is a popular destination for tourists and locals alike. Paris is a vibrant and dynamic city with a strong sense of community and a strong sense of identity. It is a city that

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks with increasing accuracy and efficiency. Some potential future trends in AI include:

1. Increased integration of AI into everyday life: AI will continue to become more integrated into our daily lives, from smart home devices to self-driving cars. This will lead to more efficient and personalized services, as well as increased convenience and productivity.

2. Greater emphasis on ethical and social implications: As AI becomes more integrated into our daily lives,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a freelance writer with a passion for writing the stories that inspire and move people. I write in a style that's clean and straightforward, but with a unique twist that allows for creativity and originality. I have a knack for crafting compelling narratives that draw readers in and leave them eager to continue reading. Whether you're looking for inspiration or want to learn something new, I'm excited to help you grow your writing skills and become a better storyteller. How can I help you today? [Name] yourself. [Name] yourself. [Name] yourself. And if you're interested in learning

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the "City of Light". It is a historical and cultural center with many famous landmarks and monuments such as the Eiffel Tower, Notre Dame Cathedra

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

role

]

 specialist

 in

 the

 [

industry

].

 I

 have

 [

number

 of

 years

]

 of

 experience

,

 and

 my

 primary

 focus

 is

 [

reason

 for

 expertise

].

 I

 am

 always

 looking

 for

 new

 challenges

 to

 help

 me

 grow

 and

 improve

.

 How

 can

 I

 be

 an

 asset

 to

 you

?

 I

 am

 always

 ready

 to

 learn

 and

 adapt

 to

 new

 situations

,

 so

 please

 let

 me

 know

 if

 there

 are

 any

 opportunities

 to

 learn

.

 What

 is

 your

 industry

 of

 expertise

?

 I

 specialize

 in

 [

industry

]

 and

 have

 a

 strong

 understanding

 of

 [

specific

 skills

 or

 knowledge

].

 How

 can

 I

 help

 you

?

 I

 look

 forward

 to

 working

 with

 you

 to

 create

 a

 successful

 outcome

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



To

 ensure

 accuracy

,

 please

 provide

 the

 other

 options

 available

:


1

.

 Bordeaux




2

.

 Lyon




3

.

 Geneva




4

.

 Marseille




5

.

 Nice




6

.

 Lyon




7

.

 Paris




8

.

 Marseille




9

.

 T

oulouse




1

0

.

 Nice




1

1

.

 N

antes




1

2

.

 Str

asbourg




1

3

.

 L

ille




1

4

.

 Bes

an

çon




1

5

.

 Cler

mont

-F

err

and




1

6

.

 Bordeaux




1

7

.

 Met

z




1

8

.

 Bordeaux




1

9

.

 Ch

arent

e

-sur

-M

er




2

0

.

 L

'

Is

le

-

Adam

.

 The

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 rapidly

 evolving

,

 and

 there

 are

 many

 possible

 trends

 and

 developments

 that

 we

 can

 expect

 to

 see

 in

 the

 years

 ahead

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 focus

 on

 ethical

 considerations

:

 With

 the

 increasing

 concern

 over

 the

 ethical

 implications

 of

 AI

,

 we

 can

 expect

 to

 see

 more

 focused

 discussions

 on

 how

 AI

 can

 be

 used

 eth

ically

 in

 various

 domains

.

 This

 could

 involve

 developing

 more

 robust AI

 algorithms

 and

 ensuring

 that

 they

 are

 designed

 and

 implemented

 in

 a

 way

 that

 is

 fair

 and

 transparent

.



2

.

 Greater

 reliance

 on

 AI

 in

 healthcare

:

 As

 AI

 becomes

 more

 advanced

 and

 capable

,

 we

 can

 expect

 to

 see

 a

 growing

 emphasis

 on

 AI

 in

 healthcare

.

 This




In [6]:
llm.shutdown()