# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0905 20:47:49.680000 951363 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 20:47:49.680000 951363 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0905 20:47:58.875000 952007 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 20:47:58.875000 952007 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0905 20:47:58.883000 952006 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 20:47:58.883000 952006 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-05 20:47:59] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.08it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.32it/s]Capturing batches (bs=2 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.32it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.32it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00, 10.28it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Yurii Ivanovich and I am a senior student of the Faculty of Applied Economics and Management, so I want to discuss about the problem of influence of the morphology of the resulting artifactual culture (i.e., its constitution as a result of the influence of culture) on the character of cultural development. I hope to follow the path of this topic for several years and to put the topic of my research into practical practice.
The first generation of the artificial culture has developed in the new democratic societies, so it is not a new cultural phenomenon in the world. It is a phenomenon that has developed in the 20th century and now
Prompt: The president of the United States is
Generated text:  a person who holds the office of the President of the United States. This position can be held by anyone in the country, but certain qualifications must be met. These qualifications include:
1. Being a citizen of the United States.
2. Having the ability 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and economic hub, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. It is a popular tourist destination and a major center of politics, business, and culture in Europe. The city is also known for its cuisine, including its famous croissants and its traditional French wine. Paris is a vibrant and dynamic city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and behaviors.

2. Greater reliance on data: AI will become more data-driven, with more data being collected and analyzed to improve its performance. This could lead to more efficient and effective AI systems that can make better decisions based on data.

3. Greater use of machine learning: Machine learning will become more prevalent, with more sophisticated algorithms and techniques being developed to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am [Age], and I am a [Occupation]. I have always been fascinated by [Topic/Interest] and have a deep understanding of [Skill or Area of Expertise]. I love to [What You Love About Your Job or Profession]. I have a passion for [Favorite Hobby/Activity/Connection with Others]. I am always looking for new experiences and ideas to explore and grow as a person. I am [Your Ideal Character Type]. I am a [Name] by nature, but I am constantly learning and evolving. I value [Skill or Quality that You Admire in Me]. I am excited

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Pronunciation: Paris (PRIN-tuh)
Explanation of pronunciation: 
- "P" - Short for "p" in "Paris"
- "r" - Short for "r" in "Paris"
- "n" - Short for "n" in "Paris"
- "t" - Short for "t" in "Paris"

Note: The pronunciation of "Paris"

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

background

 information

 on

 your

 character

].

 I

 have

 always

 been

 passionate

 about

 [

specific

 hobby

 or

 activity

]

 and

 have

 a

 deep

 appreciation

 for

 [

reason

 for

 interest

].

 My

 creative

 writing

 skills

 have

 enabled

 me

 to

 create

 [

number

 of

 words

]

 words

 of

 poetry

 and

 essay

.

 What

 brings

 you

 to

 this

 world

?

 I

 am

 a

 [

job

 or

 hobby

],

 where

 my

 passion

 for

 [

h

obby

 or

 activity

]

 finds

 its

 fulfillment

.

 Please

 tell

 me

 more

 about

 you

.

 [

Tell

 us

 about

 your

 character

's

 name

,

 background

 information

,

 and

 hobby

 or

 activity

.

 Use

 a

 friendly

 and

 convers

ational

 tone

 in

 your

 introduction

.

]


Hello

,

 my

 name

 is

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

's

 correct

!

 Paris

 is

 the

 capital

 city

 of

 France

,

 the

 largest

 country

 in

 Europe

,

 located

 in

 the

 south

 of

 the

 country

.

 It

 is

 the

 largest

 city

 in

 the

 European

 Union

 and

 has

 a

 population

 of

 around

 

1

7

 million

 people

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 diverse

 culture

,

 and

 world

-class

 attractions

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 also

 home

 to

 the

 iconic

 E

iff

el

 Tower

,

 which

 has

 stood

 for

 over

 a

 century

 and

 still

 stands

 as

 a

 symbol

 of

 France

.

 Paris

 is

 often

 referred

 to

 as

 "

the

 City

 of

 Light

"

 due

 to

 its

 vibrant

 night



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 quite

 exciting

 and

 will

 likely

 continue

 to

 evolve

 in

 many

 different

 directions

.

 Here

 are

 some

 possible

 trends

 in

 the

 AI

 field

:



1

.

 Increased

 Aut

onomy

:

 AI

 is

 becoming

 more

 and

 more

 capable

 of

 making

 decisions

 without

 human

 oversight

.

 This

 means

 that

 AI

 systems

 will

 be

 able

 to

 make

 autonomous

 decisions

 in

 areas

 such

 as

 self

-driving

 cars

,

 medical

 diagnosis

,

 and

 even

 daily

 life

 tasks

 like

 grocery

 shopping

.



2

.

 Automation

:

 AI

 is

 also

 becoming

 more

 widely

 used

 for

 automation

 in

 various

 industries

,

 from

 manufacturing

 to

 healthcare

.

 As

 AI

 systems

 become

 more

 efficient

 and

 accurate

,

 they

 are

 likely

 to

 replace

 human

 workers

 in

 some

 jobs

,

 freeing

 up

 more

 time

 and

 resources

 for

 human

 workers

 to

 focus




In [6]:
llm.shutdown()