# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0911 00:05:17.578000 1283893 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 00:05:17.578000 1283893 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0911 00:05:26.352000 1284620 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 00:05:26.352000 1284620 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 00:05:26.404000 1284621 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 00:05:26.404000 1284621 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 00:05:26] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.69it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.29it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.29it/s]Capturing batches (bs=1 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.29it/s]Capturing batches (bs=1 avail_mem=74.73 GB): 100%|██████████| 3/3 [00:00<00:00,  5.84it/s]Capturing batches (bs=1 avail_mem=74.73 GB): 100%|██████████| 3/3 [00:00<00:00,  5.05it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Claudia, a 16-year old girl. I usually stay at home and play computer games. I like playing computers so much that I have to work very hard to be successful. I have my own computer and can use it whenever I want to. There's also a lot of information on the Internet, and I use it to get information on almost everything. I have been to many countries, and have visited many places. I even went to the moon with my parents. But I have a lot of problems. I've never been to a hospital and I can't even take care of myself. I have to take care of my
Prompt: The president of the United States is
Generated text:  a very important person. He is in charge of the country. He is the leader of the country. He is the boss of the country. He is the leader of the country. He makes all the important decisions. He is like a king. He is the leader of the country. He makes all the important decisions. He is the leader of the country. He makes all the important decis

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest in the field]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [reason for interest in the field] and I'm always eager to learn and improve. I'm a [reason for interest in the field] and I'm always eager to learn and improve. I'm a [reason for interest in the field] and I'm always eager to learn and improve. I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. It is also home to the Louvre Museum, the most famous art museum in the world, and the Notre-Dame Cathedral, which is a UNESCO World Heritage site. Paris is a bustling metropolis with a rich history and a vibrant cultural scene, making it a popular destination for tourists and locals alike. The city is also home to many famous landmarks and attractions, including the Champs-Élysées, the Louvre, and the Eiffel Tower. Overall, Paris is a city of contrasts and beauty that is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI will continue to automate many tasks, from manufacturing and transportation to customer service and healthcare. This will lead to increased efficiency and productivity, but it will also create new jobs and raise concerns about job displacement.

2. Enhanced human intelligence: AI will continue to improve its ability to understand and interpret human language, emotions, and behaviors. This will lead to more intelligent and empathetic AI that can better understand and respond to human needs and emotions.

3. AI will become more integrated with the physical world: AI will continue to be integrated into our daily lives, from smart home



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Age]. I am a [Occupation] who has been [Number of Years] in the industry. I am a [Professional Role] with [Number of Years] experience in the industry. I am an [Professional Role] with [Number of Years] experience in the industry. I am an [Professional Role] with [Number of Years] experience in the industry. I am an [Professional Role] with [Number of Years] experience in the industry. I am an [Professional Role] with [Number of Years] experience in the industry. I am an [Professional Role] with [Number

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northwestern region of the country, and is one of the most important cities in the country. It is home to many world-famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is also know

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

 and

 I

 am

 a

 free

 spirit

 who

 loves

 to

 travel

 and

 explore

 new

 places

.

 I

 am

 a

 seasoned

 traveler

 who

 has

 traveled

 around

 the

 world

 and

 have

 lived

 in

 many

 different

 cultures

.

 I

 have

 a

 passion

 for

 learning

 new

 languages

 and

 cultures

 and

 I

 love

 to

 share

 my

 experiences

 with

 others

 through

 my

 blog

.

 I

 am

 a

 kind

-hearted

 and

 compassionate

 person

 who

 is

 always

 willing

 to

 lend

 a

 helping

 hand

 to

 those

 in

 need

.

 I

 am

 always

 looking

 for

 new

 adventures

 and

 new

 experiences

 and

 I

 am

 excited

 to

 meet

 new

 people

 and

 expand

 my

 hor

izons

.

 I

 am

 a

 free

 spirit

 who

 loves

 to

 travel

 and

 explore

 new

 places

,

 a

 seasoned

 traveler

 who

 has

 lived

 in

 many

 different

 cultures



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



A

).

 True




B

).

 False




A

).

 True





The

 statement

 is

 true

.

 Paris

,

 officially

 known

 as

 the

 "

Met

ropolis

 of

 France

,"

 is

 the

 capital

 city

 of

 France

.

 It

 is

 a

 historical

 and

 cultural

 center

 that

 has

 been

 a

 major

 hub

 of

 French

 politics

,

 culture

,

 and

 economy

 for

 over

 

3

0

0

 years

.

 The

 city

 is

 known

 for

 its

 beautiful

 architecture

,

 rich

 historical

 sites

,

 and

 vibrant

 nightlife

.

 Paris

 is

 also

 home

 to

 many

 world

-ren

owned

 attractions

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 It

 is

 one

 of

 the

 most

 popular

 tourist

 destinations

 in

 the

 world

 and

 a

 major

 international

 center

 for

 business

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 highly

 diverse

 and

 will

 continue

 to

 evolve

,

 with

 many

 different

 trends

 and

 possibilities

 emerging

.



1

.

 Adv

ancements

 in

 machine

 learning

:

 The

 main

 focus

 of

 AI

 research

 in

 the

 future

 will

 be

 to

 improve

 the

 capabilities

 of

 machine

 learning

 models

,

 especially

 in

 areas

 such

 as

 image

 and

 speech

 recognition

,

 natural

 language

 processing

,

 and

 predictive

 analytics

.



2

.

 Integration

 with

 traditional

 industries

:

 AI

 is

 expected

 to

 play

 a

 significant

 role

 in

 industries

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.

 These

 industries

 will

 benefit

 from

 AI

's

 ability

 to

 process

 large

 amounts

 of

 data

,

 analyze

 complex

 patterns

,

 and

 predict

 outcomes

.



3

.

 Emer

gence

 of

 autonomous

 vehicles

:

 As

 the

 use

 of

 autonomous

 vehicles




In [6]:
llm.shutdown()