# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0915 19:58:30.631000 4062434 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 19:58:30.631000 4062434 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0915 19:58:40.487000 4063137 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 19:58:40.487000 4063137 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0915 19:58:40.514000 4063138 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 19:58:40.514000 4063138 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 19:58:41] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.24it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.53it/s]Capturing batches (bs=2 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.53it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.53it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00, 10.49it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Calvin. I have a friend named Sarah. They both have their own hobbies and interests. Sarah enjoys reading books and painting. She likes to travel to new places and try new things. She has a love for the outdoors, especially hiking and camping. On the other hand, Calvin is a person who enjoys cooking, playing sports, and spending time with his family. He likes to watch sports on TV and goes to concerts with his friends. He also enjoys making fresh lemonade and sharing his recipes with others. 

Based on the information provided, who is the better writer? To determine who is the better writer between Calvin and Sarah, we
Prompt: The president of the United States is
Generated text:  trying to pick a new mascot for the United States. He comes up with the idea of having a bipedal bird that can stand on two legs to represent the United States. To test how well this idea works, he decided to use a duck. He wants to know if the duck will fly on its t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Character] who has always been [What motivates you to be a [Type of Character]].

I'm passionate about [What interests me or what I enjoy doing], and I believe that my unique combination of [What makes me unique] has led me to become the [Type of Character] I am today. I'm always looking for new challenges and opportunities to grow and learn, and I'm always eager to share my experiences and insights with others.

I'm a [Type of Character] who is always looking for ways to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum. Paris is a bustling city with a rich history and culture, and it is a popular tourist destination. The city is known for its fashion, art, and cuisine, and it is a major economic center in Europe. Paris is a city that is both beautiful and exciting, and it is a must-visit destination for anyone interested in French culture and history. 

Therefore, the answer to the question "What is the capital city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence. This could lead to more sophisticated forms of AI that can learn from and adapt to human behavior.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for the development and use of AI.

3. Increased use of AI in healthcare: AI is already being used in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name], and I'm a [professional or personal] [role]. I have [number of years experience] years of experience working in the [industry]. I specialize in [major skill or expertise], which I use to [describe a specific success or accomplishment]. I'm a [level of professionalism] professional and always strive to [mention a positive trait or quality]. I'm [age], and I am passionate about [why you're passionate about this field]. I believe in [reason for passion], and I believe that [reason for passion] is important. I'm a team player and always aim to [mention something positive about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city with a rich history dating back to ancient times. The city is home to the Louvre Museum, the Eiffel Tower, and many iconic landmarks such as Notre-Dame Cathedral. Paris i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

job

 title

]

 at

 [

Company

].

 I

'm

 a

 seasoned

 professional

 with

 over

 [

number

]

 years

 of

 experience

 in

 [

industry

/

field

].

 I

 have

 a

 strong

 work

 ethic

 and

 am

 dedicated

 to

 providing

 exceptional

 service

 to

 [

specific

 clients

 or

 clients

 group].

 I

'm

 always

 looking

 for

 ways

 to

 improve

 myself

 and

 continuously

 learn

 new

 skills

.

 I

'm

 friendly

,

 approach

able

,

 and

 I

 value

 teamwork

 and

 collaboration

.

 I

'm

 always

 eager

 to

 learn

 and

 adapt

 to

 new

 challenges

.

 I

 am

 a

 [

position

ing

]

 of

 [

role

].

 Let

 me

 know

 if

 you

'd

 like

 to

 know

 more

 about

 me

!

 [

Name

]

...


I

'm

 [

Name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

 and

 Lou

vre

 Museum

,

 as

 well

 as

 its

 rich

 history

 and

 world

-ren

owned

 fashion

 industry

.

 It

 is

 a

 bustling

 city

 with

 a

 diverse

 population

 of

 about

 

1

1

 million

 people

 and

 is

 one

 of

 the

 most

 important

 cities

 in

 the

 world

.

 Paris

 is

 located

 in

 the

 Lo

ire

 Valley

 region

,

 on

 the

 north

 bank

 of

 the

 Se

ine

 River

,

 and

 is

 known

 for

 its

 romantic

 ambiance

 and

 fine

 dining

.

 It

 is

 a

 popular

 tourist

 destination

 and

 home

 to

 many

 world

-ren

owned

 institutions

,

 including

 the

 French

 Academy

 of

 Fine

 Arts

,

 the

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 see

 significant

 advancements

 in

 several

 areas

,

 including

 machine

 learning

,

 natural

 language

 processing

,

 and

 robotics

.

 These

 advancements

 are

 expected

 to

 lead

 to

 new

 applications

 and

 applications

 in

 fields

 such

 as

 healthcare

,

 finance

,

 and

 manufacturing

.

 AI

 is

 also

 expected

 to

 play

 a

 role

 in

 shaping

 the

 future

 of

 technology

,

 with

 the

 ability

 to

 create

 new

 forms

 of

 interaction

 and

 interaction

 between

 humans

 and

 machines

.

 Additionally

,

 AI

 is

 expected

 to

 continue

 to

 be

 used

 for

 improving

 the

 efficiency

 of

 industries

 and

 processes

,

 as

 well

 as

 reducing

 costs

.

 Overall

,

 the

 future of

 AI looks

 promising and

 will

 continue to

 shape the

 world in

 exciting and

 transformative

 ways

.

 However

,

 it

 is

 important

 to

 note

 that

 the




In [6]:
llm.shutdown()