# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0907 11:21:47.968000 3393109 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0907 11:21:47.968000 3393109 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0907 11:21:57.806000 3393567 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0907 11:21:57.806000 3393567 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0907 11:21:57.839000 3393568 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0907 11:21:57.839000 3393568 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-07 11:21:58] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.48it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=41.59 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=41.59 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.49it/s]Capturing batches (bs=2 avail_mem=41.57 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.49it/s]Capturing batches (bs=1 avail_mem=41.57 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.49it/s]Capturing batches (bs=1 avail_mem=41.57 GB): 100%|██████████| 3/3 [00:00<00:00,  9.86it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Niko. I'm an AI language model. I'm here to provide information on a wide range of topics.

1. How do I improve my language skills?

I can't tell you how to improve your language skills. However, there are several ways to improve your language skills:

- Read and listen to language materials such as books, magazines, and news articles.
- Practice speaking with native speakers or language exchange partners.
- Use language learning apps and tools.
- Immerse yourself in language use and engage in conversation.
- Take language courses or language classes.

2. How do I know if I'm making progress in my language
Prompt: The president of the United States is
Generated text:  a very important person. His job is very important, but it also has a lot of responsibility. For example, he is responsible for the economy of the country. The president decides how much money and goods to spend. He also makes sure that there is enough food and medicine for the p

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [reason for interest

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its vibrant nightlife and is a popular tourist destination. The city is known for its cuisine, including its famous French cuisine, and is a major center for art, music, and literature. Paris is a city of contrasts, with its modern architecture and high-tech industries blending with its traditional French culture and history. The city is also home to many international organizations and events, including the World Cup

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends in AI include:

1. Increased integration with other technologies: AI will become more integrated with other technologies such as sensors, actuators, and actuaries, which will allow for more sophisticated and efficient applications.

2. Enhanced privacy and security: As AI becomes more integrated with other technologies, there will be increased concerns about privacy and security. There will be efforts to develop more secure and transparent AI systems.

3. Greater emphasis on ethical considerations: As AI becomes more integrated with other technologies, there will be greater emphasis



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [occupation]. I'm a [type of software] developer. I have a passion for [reason for passion]. My [most significant accomplishment] was [description]. If you want to learn more about me, please ask me anything. [Name]. I'm excited to hear from you and see what makes you unique and special. Remember to keep me updated on your progress and let me know if you need any help or support. Thanks for having me, [Name]. How can I assist you today? [Name]. I'm excited to hear from you. [Name]. I'll keep you updated on

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic Eiffel Tower and rich cultural heritage.

Paris, known for its iconic Eiffel Tower and rich cultural heritage, serves as the capital of France. Its iconic landmark, the Eiffel Tower, attracts millions of visitors 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

’m

 a

 [

Type

]

 with

 [

Experience

]

 years

 of

 experience

 in

 [

Field

].

 I

’m

 [

Age

]

 years

 old

 and

 [

Gender

].

 I

’m

 [

Occup

ation

]

 and

 I

 live

 in

 [

Your

 Location

].

 I

’m

 currently

 [

Status

]

 at

 [

Your

 Position

]

 and

 I

’m

 passionate

 about

 [

What

 You

 Do

 or

 What

 You

 Love

 To

 Do

].

 I

 enjoy

 [

What

 You

 Do

 or

 What

 You

 Love

 To

 Do

]

 because

 [

Why

].

 I

 believe

 in

 [

Why

]

 and

 I

’m

 dedicated

 to

 [

What

 You

 Do

 or

 What

 You

 Love

 To

 Do

].

 I

’m

 always

 [

How

 I

 Handle

 Diff

icult

ies

]

 and

 [

How

 I

 Manage



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Grande

 O

rl

é

ans

"

 and

 "

La

 Ré

pub

lique

"

 (

meaning

 "

Republic

")

 and

 is

 the

 most

 populous

 city

 in

 the

 country

.

 It

 is

 located

 in

 the

 south

 of

 the

 country

 and

 is

 the

 largest

 city

 in

 Europe

 by

 land

 area

.

 The

 city

 is

 located

 on

 the

 River

 Se

ine

,

 which

 forms

 the

 Paris

 Canal

.

 Paris

 is

 also

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Tour

 E

iff

el

,

 Lou

vre

 Museum

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 It

 is

 known

 for

 its

 iconic

 architecture

,

 including

 the

 Palace

 of

 Vers

ailles

 and

 the

 Arc

 de

 Tri

omp

he

,

 as

 well

 as

 its

 rich

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 development

,

 adoption

,

 and

 integration

 with

 other

 technologies

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Automation

:

 AI

 is

 expected

 to

 become

 more

 prevalent

 in

 manufacturing

 and

 service

 industries

,

 where

 automation

 can

 significantly

 reduce

 human

 errors

 and

 costs

.

 The

 use

 of

 AI

-powered

 robots

 and

 drones

 in

 agriculture

,

 manufacturing

,

 and

 logistics

 will

 also

 become

 more

 common

.



2

.

 Intelligence

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 is

 likely

 to

 become

 more

 intelligent

 than

 humans

.

 This

 includes

 the

 ability

 to

 learn

 from

 experience

,

 make

 decisions

 based

 on

 data

,

 and

 even

 form

 unique

 ideas

.



3

.

 Natural

 language

 processing

:

 AI

 is

 already

 capable

 of

 natural

 language

 processing




In [6]:
llm.shutdown()