# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0911 07:00:45.496000 4070488 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 07:00:45.496000 4070488 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0911 07:00:55.424000 4071100 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 07:00:55.424000 4071100 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 07:00:55.549000 4071101 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 07:00:55.549000 4071101 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 07:00:55] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.23it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.60it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.60it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.60it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  4.80it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  4.00it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Marek and I have a keen interest in the region of Baltic Sea and the marine ecosystems surrounding it. As a child, I have always been fascinated by the sea and the ocean. I have always loved to dive and take photos of the sea and sea animals. I have always been fascinated by the marine life and marine ecosystem and the impact of human activities on it. I was attracted to the Baltic Sea because of the beautiful beaches and its tranquility.
As a child, I was fascinated by the sea and the ocean. I had a keen interest in the region of Baltic Sea and the marine ecosystems surrounding it. As a child, I
Prompt: The president of the United States is
Generated text:  seeking to improve his presidential term. He plans to use the first 12 terms of the digits to create a special code. Each term will be a single-digit number. The president wants to maximize the sum of the digits of the codes. If the code is formed by arranging the digits in ascending order

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Character] who is [Describe your character's personality and background]. I'm always [Describe your character's personality traits or qualities]. I'm [Describe your character's hobbies or interests]. I'm [Describe your character's strengths and weaknesses]. I'm [Describe your character's goals and aspirations]. I'm [Describe your character's overall personality]. I'm [Describe your character's unique selling point or special ability]. I'm [Describe your character's unique personality]. I'm [Describe your character's unique background or education]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" and "La Ville de la Rose". It is the largest city in France and the second-largest city in the European Union, with a population of over 2. 5 million people. Paris is known for its rich history, beautiful architecture, and vibrant culture, and is a major tourist destination. It is also home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. The city is known for its fashion industry, art scene, and cuisine, and is a popular tourist destination for its romantic ambiance and cultural events. Paris is a major hub

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation: As AI becomes more advanced, it is likely to become more and more integrated into our daily lives. This will lead to increased automation, where machines will be able to perform tasks that were previously done by humans. This will result in a more efficient and productive workforce, as well as a reduction in the need for human labor.

2. AI ethics and privacy: As AI becomes more advanced, it



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name]. I am a [insert profession or role]. I enjoy [insert something about your profession or role], and I strive to be the best version of myself. I am a [insert your profession or role] and I am passionate about [insert something about your profession or role]. I love [insert something about your profession or role] and I believe that my [insert something about your profession or role] can make the world a better place. I am always looking for new experiences and learning opportunities, and I am always ready to improve myself. What do you think I could say to welcome you to my world? [Insert

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in France and has a population of over 2 million people. The city is known for its rich history, beautiful architecture, and vibrant 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 I

'm

 a

 professional

 graphic

 designer

 with

 

1

0

 years

 of

 experience

 in

 web

 and

 print

 design

.

 I

 specialize

 in

 creating

 visually

 stunning

 designs

 that

 are

 both

 functional

 and

 aest

het

ically

 pleasing

.

 My

 design

 process

 is

 based

 on

 collaboration

,

 feedback

,

 and

 a

 deep

 understanding

 of

 the

 client

's

 needs

 and

 goals

.

 I

'm

 a

 collaborative

,

 problem

 solver

 who

 thr

ives

 on

 innovation

 and

 continuously

 learning

.

 My

 dedication

 to

 my

 work

 is

 genuine

 and

 I

 am

 committed

 to

 creating

 a

 lasting

 impact

 for

 my

 clients

.

 With

 a

 knack

 for

 creativity

,

 I

 believe

 that

 my

 expertise

 is

 unparalleled

 and

 I

'm

 excited

 to

 bring

 my

 creativity

 to

 new

 projects

.

 Thank

 you

 for

 taking

 the

 time



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 accurate

 and

 concise

,

 providing

 the

 key

 information

 about

 the

 capital

 city

's

 location

 and

 title

.

 



Please

 let

 me

 know

 if

 you

 need

 any

 further

 clarification

 or

 assistance

 on

 this

 topic

.

 Let

 me

 know

 and

 we

 can

 continue

 the

 conversation

.

 If

 you

 have

 any

 questions

 or

 need

 more

 information

 on

 this

 topic

,

 feel

 free

 to

 ask

!

 Let

 me

 know

!

 



Thank

 you

 for

 your

 understanding

 and

 assistance

.

 Is

 there

 anything

 else

 you

 would

 like

 me

 to

 add

 or

 provide

 more

 details

 on

?

 Let

 me

 know

!

 Please

 provide

 more

 context

 if

 needed

.

 I

'm

 here

 to

 help

!

 Sure

,

 here

's

 a

 more

 detailed

 and

 accurate

 statement

 about

 Paris

:



Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 constantly

 evolving

.

 Here

 are

 some

 potential

 trends

 that

 are

 expected

 to

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Autonomous

 vehicles

:

 Self

-driving

 cars

 are

 becoming

 more

 common

,

 and

 AI

 is

 playing

 a

 key

 role

 in

 their

 development

.

 Autonomous

 vehicles

 will

 be

 equipped

 with

 advanced

 sensors

,

 algorithms

,

 and

 machine

 learning

 systems

 that

 can

 make

 complex

 decisions

 in

 real

-time

.



2

.

 Emotional

 AI

:

 AI

 is

 already

 being

 used

 to

 provide

 emotional

 support

 and

 empathy

 to

 people

,

 but

 as

 the

 technology

 advances

,

 we

 may

 see

 even

 more

 sophisticated

 emotional

 AI

 that

 can

 understand

 and

 respond

 to

 complex

 emotional

 states

.



3

.

 Bi

ometric

 AI

:

 AI

 is

 already

 being

 used

 to

 unlock

 doors




In [6]:
llm.shutdown()