# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0905 23:07:39.398000 4109679 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 23:07:39.398000 4109679 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0905 23:07:48.564000 4110300 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 23:07:48.564000 4110300 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0905 23:07:48.620000 4110299 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 23:07:48.620000 4110299 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-05 23:07:49] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.32it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.52 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=76.52 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.68it/s]Capturing batches (bs=2 avail_mem=76.09 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.68it/s]Capturing batches (bs=1 avail_mem=76.09 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.68it/s]Capturing batches (bs=1 avail_mem=76.09 GB): 100%|██████████| 3/3 [00:00<00:00,  4.51it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Huahua. I like travelling, playing computer games and listening to music. I usually have breakfast at 6:00, and I usually go to bed at 10:00. In the afternoon, I play computer games for about three hours and then go to see a movie. Usually, I have lunch at 2:00, and I have dinner at 8:00. I often read books on my computer during the day. I like to listen to music when I have free time. What's your favorite subject? My favorite subject is Science. I like science because it's interesting. I
Prompt: The president of the United States is
Generated text:  76 years old now. In $5$ years, the president's age will be half of the age he will be in $n$ years from now. Find the value of $n$.

To solve the problem, let's define the current age of the president as \( x \). According to the problem, the president's current age is 76 years, so \( x = 76 \).

In 5 years, the president's age will be \( x + 5 = 76 + 5 = 81 \) years old. The problem also states 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and art galleries. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, diverse culture, and vibrant nightlife. It is the largest city in France and one of the most visited cities in the world. Paris is also home to the French Parliament, the Eiffel Tower, and the Louvre Museum. The city is known for its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and preferences.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be increased emphasis on ethical considerations and guidelines for its development and use. This could lead to more stringent regulations and standards for AI systems, as well as greater transparency and accountability in their development and deployment.

3. Increased use



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Your profession or role] who is currently [Your current job title or role]. I enjoy [Your hobbies, interests, or what you enjoy doing for fun] and I also enjoy [Your personal talents or skills]. I'm always looking for new experiences and things to do, and I'm always willing to learn new things. I'm a [Your character trait or quality] and I believe in [Your values or beliefs]. I hope to [Your future goals or aspirations]. How would you describe yourself?
[Name]: Hello, my name is [Name] and I'm a [Your profession or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is a cosmopolitan city known for its rich history, culture, and stunning architecture. It is the largest city in France and a UNESCO World Heritage site, featuring numerous iconic landmarks such as the Eiffel Tower, Notr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [insert

 name].

 I

'm a

 [insert

 profession

]

 who

 has

 been

 working

 in

 this

 field

 for

 [

insert

 number

 of

 years

]

 years

 now

.

 I

 love

 the

 work

 that

 I

 do

 and

 I

 enjoy

 the

 challenge

 of

 constantly

 learning

 and

 growing

.

 I

 believe

 in

 the

 power

 of

 teamwork

 and

 I

'm

 always

 willing

 to

 support

 and

 encourage

 my

 colleagues

.

 I

 am

 a

 [

insert

 hobby

 or

 interest

].

 I

 am

 happy

 to

 have

 the

 opportunity

 to

 learn

 and

 grow

 with

 you

!

 Let

's

 chat

!

 [

insert

 an

 opening

 line

]

 Hello

,

 and

 welcome

 to

 [

insert

 name

's

 name

].

 It

's

 nice

 to

 meet

 you

.

 [

insert

 name

]

 is

 a

 [

insert

 profession

],

 [

insert



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 center

 of

 the

 country

,

 near

 the

 River

 Se

ine

.

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 second

 largest

 city

 in

 the

 world

 by

 population

.

 The

 city

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 a

 cultural

 center

 with

 a

 rich

 history

,

 including

 the

 artistic

 and

 cultural

 scene

 of

 the

 French

 Quarter

 and

 the

 E

iff

el

 Tower

.

 The

 city

 is

 known

 for

 its

 innovative

 and

 modern

 architecture

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 It

 is

 a

 major

 transportation

 hub

 with

 a

 well

-develop

ed

 public

 transportation

 system

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 diverse

,

 with

 many

 possible

 trends

 shaping

 the

 technology

's

 future

.

 Here

 are

 some

 potential

 areas

 of

 development

 and

 innovation

:



1

.

 More

 advanced

 models

:

 As

 AI

 technology

 continues

 to

 evolve

,

 researchers

 are

 pushing

 to

 develop

 more

 sophisticated

 models

 that

 can

 understand

 and

 respond

 to

 complex

 human

 behaviors

,

 emotions

,

 and

 social

 contexts

.



2

.

 Rob

ust

 ethics

 and

 fairness

:

 The

 development

 of

 AI

 systems

 that

 can

 be

 deployed

 in

 a

 responsible

 way

,

 with

 consideration

 for

 diverse

 societal

 and

 cultural

 factors

,

 is

 a

 growing

 concern

.



3

.

 Universal

 AI

:

 The

 idea

 of

 a

 universal

 AI

,

 capable

 of

 understanding

 human

 language

,

 emotions

,

 and

 other

 complex

 human

 behaviors

,

 is

 becoming

 more

 and

 more




In [6]:
llm.shutdown()