# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0901 19:25:12.727000 596426 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0901 19:25:12.727000 596426 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0901 19:25:21.802000 596958 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0901 19:25:21.802000 596958 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0901 19:25:21.846000 596959 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0901 19:25:21.846000 596959 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-01 19:25:22] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.82it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.81it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.96it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.96it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.96it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.44it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emma and I'm a 15 year old female. I was born with a neurological condition that causes a weakness in my legs and a tightness in my lower back. I am currently 2.5 feet tall with a 50 pound weight. I also have a painful, thick, sore spot where the posterior sacral hiatus is located. I also have a kink in the lower back. I'm not sure if I am in need of surgery. My question is: Should I go to a doctor or is it better to just observe the situation?

Based on your description, it seems like you have a condition that could
Prompt: The president of the United States is
Generated text:  a person. If the statement "A person is a president of the United States" is true, then is it true that "a person is president of the United States"? To determine whether the statement "A person is a president of the United States" is true given the statement "A person is a president of the United States," we need to carefully analyze the logical relationship between t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Gender] [Occupation]. I'm currently [Current Location] and I'm here to [Purpose of Visit]. I'm excited to meet you and learn more about you. How can I assist you today? [Name] [Age] [Gender] [Occupation] [Current Location] [Purpose of Visit] [Your Name] [Your Age] [Your Gender] [Your Occupation] [Your Current Location] [Your Purpose of Visit] [Your Name] [Your Age] [Your Gender] [Your Occupation] [Your Current Location] [Your Purpose

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the Louvre Museum. It is also home to the Louvre Museum, the most famous art museum in the world, and the Notre-Dame Cathedral, a stunning Gothic cathedral. Paris is a vibrant and diverse city with a rich history and a thriving economy. It is a popular tourist destination and a cultural hub for France and the world. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest and most prestigious in the world. The French language is spoken by millions of people in Paris, and the city is home to many famous French

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends in AI that are expected to shape the future:

1. Increased automation and robotics: As AI continues to advance, we are likely to see an increase in automation and robotics in various industries. This will lead to the creation of more efficient and cost-effective solutions, as well as the creation of new jobs in areas such as robotics and software development.

2. Enhanced personalization: AI will continue to improve our ability to personalize our experiences with technology. This will involve the use



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [main profession] with [number of years] years of experience in [main profession]. I am [age] years old and [occupation] is [job title]. I am passionate about [main profession], and I am always eager to learn and grow. I am always willing to put in the extra time and effort required to achieve success, and I am always looking for new opportunities to learn and grow. I am always ready to listen and offer my support, and I am always willing to help others when necessary. Thank you for taking the time to meet me. [Name]. [Name] is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city with a population of over 2 million people. 

Explain the significance of Paris in French culture and history, including notable landmarks, cultural institutions, and annual events. One of the city's mos

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 an

 [

occupation

 or

 profession

]

 with

 a

 passion

 for

 [

a

 particular

 hobby

 or

 activity

].

 I

 love

 spending

 time

 with

 friends

 and

 family

,

 reading

,

 and

 exploring

 new

 places

.

 I

 believe

 in

 the

 power

 of

 unity

 and

 cooperation

 to

 overcome

 challenges

.

 I

 also

 enjoy

 learning

 new

 things

,

 whether

 it

's

 through

 books

,

 movies,

 or

 cooking

.

 What

 is

 your

 favorite

 hobby

 or

 activity

 to

 do

?

 [

Your

 Name

].

 What

 inspires

 you

 to

 pursue

 your

 interests

?

 To

 me

,

 it

's

 the

 thrill

 of

 discovery

,

 the

 process

 of

 growth

,

 and

 the

 joy

 of

 creating

 something

 unique

.

 What

 do

 you

 hope

 to

 achieve

 in

 the

 next

 year

?

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 where

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

 stand

 tall

.

 It

 is

 also

 the

 birth

place

 of

 many

 notable

 figures

 including

 Michel

angelo

 and

 Napoleon

 Bon

ap

arte

,

 and

 the

 home

 of

 the

 French

 Revolution

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

,

 food

,

 and

 fashion

,

 and

 is

 often

 considered

 the

 world

’s

 most

 exciting

 destination

 for

 tourists

.

 It

 is

 also

 one

 of

 the

 most

 beautiful

 cities

 in

 the

 world

,

 with

 its

 stunning

 views

 of

 the

 E

iff

el

 Tower

 and

 the

 Py

rene

es

 mountains

.

 Paris

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 the

 Ch

amps

-

É

lys

ées

,

 the

 R



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 exciting

 and

 unpredictable

,

 with

 a

 wide

 range

 of

 potential

 developments

 and

 applications

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 future

 of

 artificial

 intelligence

:



1

.

 Increased

 precision

 and

 accuracy

:

 One

 of

 the

 most

 promising

 areas

 for

 AI

 development

 is

 in

 increasing

 the

 precision

 and

 accuracy

 of

 its

 applications

.

 This

 could

 involve

 developing

 more

 sophisticated

 algorithms

,

 better

 models

,

 and

 more

 accurate

 data

 analysis

 techniques

.



2

.

 AI

 will

 become

 more

 integrated

 with

 other

 technologies

:

 AI

 is

 already

 integrated

 with

 a

 variety

 of

 other

 technologies

,

 such

 as

 robotics

,

 autonomous

 vehicles

,

 and

 smart

 cities

.

 It

 is

 likely

 that

 this

 trend

 will

 continue

,

 with

 more

 integration

 between

 AI

 and

 these

 other

 technologies

,




In [6]:
llm.shutdown()