# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0902 05:24:20.405000 548589 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 05:24:20.405000 548589 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0902 05:24:28.594000 549249 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 05:24:28.594000 549249 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0902 05:24:28.911000 549250 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 05:24:28.911000 549250 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-02 05:24:29] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.40it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.56it/s]Capturing batches (bs=2 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.56it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.56it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00,  6.62it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dany and I come from the Cote d'Ivoire. I am interested in the different religions of the world. My interest is to understand the reason why the followers of different religions share commonalities. I would like to know the history of Christianity and Judaism. Can you provide a detailed explanation of the origins and development of these religions? 

Additionally, can you provide information on the impact of these religions on global society, such as their influence on art, architecture, and politics? Please include at least two examples of religious texts that have had a significant impact on human history. Finally, can you share your thoughts on the role
Prompt: The president of the United States is
Generated text:  a chief executive of the government, and he is the head of the executive branch. True or False? To determine whether the statement "The president of the United States is a chief executive of the government, and he is the head of 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and I'm always looking for ways to improve my skills and knowledge. I'm also a [job title] at [company name], and I'm always looking for ways to improve my skills and knowledge. I'm a [job title] at [company name], and I'm always looking for ways to improve my skills and knowledge. I'm a [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

A. True
B. False
A. True

Paris is the capital city of France, located in the south of the country and is the largest city in Europe. It is known for its rich history, beautiful architecture, and vibrant culture. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its fashion industry, with many famous fashion houses and designers based in the city. The city is a popular tourist destination and is home to many museums, theaters, and other cultural institutions. Overall, Paris is a fascinating

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This will lead to the development of new jobs and the creation of new industries that require specialized skills.

2. AI ethics and privacy concerns: As AI technology becomes more advanced, there will be a growing concern about the ethical implications of AI. This will include issues such as bias, transparency,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Age] year old [Occupation]. I am a [Skill] who has a passion for [Interest] and have always been passionate about [Favorite Hobby]. Despite my age, I'm always determined to [Challenge], [Motivate], and [Inspire]. I'm very [Extroverted], [Introverted], [Loyal], [Independent], [Flexible], [Patient], [Committed], [Mentor], and [Adventurous]. I'm always [Adventurous], [Adventurous], [Adventurous], and [Adventurous]. I'm very

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city located in the center of the country.

Paris, often referred to as "The City of Light," is the largest city in France and the seat of the government of France. It is known for its historical architecture, vibrant cultural scene, and annual literary festivals. The city is also known for its fashion industry, annual

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

__.

 I

 am

 a

/an

 __

________

__

 who

 is

 currently

 __

________

__.

 I

 came

 to

 __

________

__

 to

 __

________

__

 at

 __

________

__

 in

 the

 year

 __

________

__.

 I

 am

 excited

 to

 __

________

__

 with

 you

.

 Feel

 free

 to

 share

 your

 opinion

 on

 my

 personality

,

 background

,

 and

 any

 other

 relevant

 information

 you

 have

 about

 me

.

 Remember

,

 your

 input

 is

 invaluable

 to

 me

 in

 shaping

 my

 character

.

 Welcome

 to

 the

 world

 of

 __

________

__.

 I

 hope

 you

 enjoy

 our

 shared

 experience

.


I

 am

 __

________

__.

 I

 am

 __

________

__

.


I

 am

 the

 __

________

__

 of

 the

 __

________

__.

 I

 love

 __

________

__.

 I

 am

 __

________



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 and

 the

 largest

 metropolitan

 area

 in

 the

 world

.

 Paris

 was

 founded

 in

 the 

12

th

 century

 by

 Philip

 the

 Fair

 as

 the

 royal

 capital

 of

 France

.

 Paris

 has

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Palace

 of

 Vers

ailles

.

 The

 city

 is

 also

 known

 for

 its

 vibrant

 French

 culture

 and

 cuisine

.

 Paris is

 a popular

 tourist destination

 and a

 major center

 of

 international business

, politics

, and

 culture.

 According to

 the 

20

2

0 census

, the

 population of

 Paris is

 

2.

2 million

 people

. 



I apologize

,

 but there



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to continue

 to be

 shaped

 by rapid

 advancements

 in various

 fields,

 including

 machine learning

,

 robotics

, and

 the development

 of new

 hardware.

 Some

 possible

 trends

 that

 may

 be

 seen

 in

 the

 future

 include

:

1

. Increased

 integration

 of

 AI

 into

 everyday

 technology

:

 More

 and

 more

 devices

 and

 systems

 will

 become

 interconnected

,

 with

 AI

 playing

 a

 central

 role

 in

 their

 operation

.

 This

 could

 lead

 to

 a

 more

 seamless

 and

 intelligent

 user

 experience

,

 making

 everyday

 tasks

 and

 decisions

 more

 efficient

 and

 personalized

.



2

.

 The

 development

 of

 more

 advanced

 AI

 systems

 that

 can

 learn

 and

 adapt

 to

 new

 situations

:

 As

 AI

 technology

 continues

 to

 improve

,

 it

 is

 likely

 that

 we

 will

 see

 the

 development

 of

 more

 advanced

 systems




In [6]:
llm.shutdown()