# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0909 05:24:28.467000 3139896 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 05:24:28.467000 3139896 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0909 05:24:37.913000 3140577 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 05:24:37.913000 3140577 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0909 05:24:37.974000 3140578 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 05:24:37.974000 3140578 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-09 05:24:38] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.43it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=75.45 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=75.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.40it/s]Capturing batches (bs=2 avail_mem=75.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.40it/s]Capturing batches (bs=1 avail_mem=75.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.40it/s]Capturing batches (bs=1 avail_mem=75.39 GB): 100%|██████████| 3/3 [00:00<00:00, 10.41it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Philip. I am a student at the University of Sydney in Australia. I have a Chinese name. I like the time to celebrate the Spring Festival. The Spring Festival is a traditional festival in China. In the Spring Festival, people wear new clothes and celebrate with their families. They eat dumplings, which are round and white. They also eat rice balls and noodles, which are round and colored. People go to visit their relatives and friends during the festival. They sing songs and dance. This year, the Spring Festival will be on February 1st. This is my favorite date to celebrate with my friends. We go to the park
Prompt: The president of the United States is
Generated text:  visiting several American colleges and universities. He will visit 7 colleges and each college will have one student. The president will travel with 5 other people from his administration, which includes himself. He will give out a medal to 2 students from each college. How many

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage and is the largest city in France by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its fashion industry, art, and cuisine. Paris is a popular tourist destination and is home to many museums, theaters, and other cultural institutions. It is also a major financial center and a major hub for international trade. The city is known for its romantic atmosphere and is a popular destination for tourists and locals alike. Paris is a vibrant

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation: AI is expected to become more and more integrated into various industries, from manufacturing to healthcare to finance. This will lead to increased automation of tasks, which will require more human workers to perform these tasks.

2. AI ethics and privacy: As AI becomes more integrated into our daily lives, there will be a growing concern about the ethical implications of AI. This will include issues such as bias, transparency, and accountability.

3. AI for education: AI is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [background information on the character, such as a hobby, age, etc.]. I enjoy [or am interested in learning about] [interest or hobby]. I also love [or am interested in learning about] [interest or hobby], which is why I love [occupation or career]. What brings you to [city or country]?

[Name] is the founder of [company name] and has been passionate about [interest or hobby] for [number of years].

I am confident in my ability to bring [character trait or personal characteristic] to [occupation or career] and believe that I can

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is a historic and culturally rich city known for its landmarks such as Notre-Dame Cathedral, the Arc de Triomphe, and the Louvre Museum. It is also home to numerous famous museums, such as the Musée d'Orsay an

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]

 and

 I

'm

 [

age

].

 I

'm

 a

 [

occupation

]

 and

 I

 enjoy

 [

job

 responsibilities

]

 to

 [

number

 of

 employees

].

 I

'm

 passionate

 about

 [

what

 I

 love

 to

 do

 with

 my

 life

].

 



My

 background

 is

 in

 [

education

 level

].

 I

've

 been

 working

 in

 [

job

 title

]

 for

 [

number

 of

 years

]

 and

 I

've

 taken

 on

 many

 leadership

 roles

.

 I

'm

 a

 very

 reliable

,

 reliable

 person

 and

 I

'm

 always

 looking

 for

 ways

 to

 improve

 and

 learn

 from

 my

 experiences

.

 



What

 brings

 me

 to

 your

 world

?

 It

's

 my

 passion

 for

 [

something

 that

 brings

 me

 joy

 and

 fulfillment

 in

 life

].

 



I

 hope

 you

 enjoy

 our



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Options

:


a

)

 correctly




b

)

 incorrectly




a

)

 correctly





Paris

 is

 the

 capital

 of

 France

.

 The

 statement

 is

 accurate

 and

 correct

.

 It

 accurately

 identifies

 Paris

 as

 the

 capital

 of

 France

,

 the

 world

's

 third

-largest

 country

 in

 terms

 of

 area

,

 and

 a

 major

 financial

,

 cultural

,

 and

 political

 center

.

 Paris

 is

 known

 for

 its

 historic

 architecture

,

 iconic

 landmarks

,

 world

-ren

owned

 museums

,

 and

 a

 rich

 cultural

 heritage

 that

 draws

 millions

 of

 visitors

 each

 year

.

 It

 is

 also

 renowned

 for

 its

 gastr

onomy

,

 fashion

,

 and

 cuisine

,

 and

 for

 hosting

 world

-ren

owned

 events

 such

 as

 the

 E

iff

el

 Tower

 parade

,

 the

 World

 Cup

,

 and

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 technological

 advancements

 and

 shifts

 in

 the

 way

 that

 the

 technology

 is

 used

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Enhanced

 AI

 capabilities

:

 As

 AI

 technology

 continues

 to

 improve

,

 we

 can

 expect

 to

 see

 greater

 enhancements

 in

 the

 capabilities

 of

 machines

.

 This

 could

 include

 faster

 and

 more

 powerful

 AI

 systems

 that

 can

 handle

 more

 complex

 tasks

 and

 data

.



2

.

 Deep

 Learning

:

 Deep

 learning

 is

 a

 subset

 of

 AI

 that

 involves

 the

 use

 of

 neural

 networks

 with

 multiple

 layers

 to

 learn

 complex

 patterns

 and

 relationships

 in

 data

.

 As

 this

 technology

 becomes

 more

 advanced

,

 we

 can

 expect

 to

 see

 deeper

 and

 more

 powerful

 models

 that

 can

 handle

 more

 complex

 tasks

.



3




In [6]:
llm.shutdown()