# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0908 09:48:56.489000 1961090 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 09:48:56.489000 1961090 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0908 09:49:05.148000 1961602 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 09:49:05.148000 1961602 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0908 09:49:05.350000 1961603 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 09:49:05.350000 1961603 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-08 09:49:05] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.12it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=21.73 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=21.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.24it/s]Capturing batches (bs=2 avail_mem=21.67 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.24it/s]Capturing batches (bs=1 avail_mem=21.66 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.24it/s]Capturing batches (bs=1 avail_mem=21.66 GB): 100%|██████████| 3/3 [00:00<00:00, 10.12it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Harriet and I’m a 31-year-old woman from New York. I’m now 18 years old and I'm a mother of one. I have a boyfriend who is a dancer and I get along with him very well. One day, I heard a comment about me that made me feel upset. It was a comment that said that my girlfriend was not good enough to have an actual partner. I was a little hurt and I couldn’t find the right words to express it. I wanted to tell my boyfriend the truth but I didn’t want to upset him. I didn’t want to tell him that I had a
Prompt: The president of the United States is
Generated text:  now the most powerful person in the world. However, I can't imagine that most people in the world have the same level of power. So, I will suggest a way to make people feel comfortable with their power.

The idea is to give everyone a chance to tell the president about any problem or challenge they face, to discuss their ideas and solutions, to ask for information, and to express their h

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason why you're passionate about your job], and I'm always looking for ways to [what you're looking for in your job]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, art, and cuisine. Paris is a major cultural and economic center in Europe and is home to many world-renowned museums, theaters, and restaurants. It is a popular tourist destination and a major hub for international business and diplomacy. Paris is a city of contrasts, with its modern architecture and vibrant culture blending with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends in AI include:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare.

2. AI in manufacturing: AI is already being used in manufacturing to improve efficiency and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in manufacturing.

3. AI in finance: AI is already being used in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am [Age]. I am a [occupation]. I'm [body type]. I'm [any experience or skills relevant to the character's background]. I have [any hobbies or interests relevant to the character's background]. I'm [any personality traits or attributes relevant to the character's background]. I'm [any unique physical or mental abilities or features]. I'm [any specific preferences or hobbies that make me unique to this role]. I'm [any awards, recognition, or accolades I've received as a result of this role].
Hey there, [Name], I'm [your name] from [your profession

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city renowned for its rich history, architecture, and vibrant culture. It serves as the seat of the French government, national and international organizations, and the capital of the European Union. T

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 computer

 programmer

 with

 a

 degree

 in

 computer

 science

 from

 [

University

/

In

stitute

].

 I

'm

 passionate

 about

 coding

,

 problem

-solving

,

 and

 learning

 new

 technologies

.

 My

 favorite

 hobby

 is

 playing

 video

 games

,

 and

 I

'm

 always

 up

 for

 solving

 puzzles

.

 I

 have

 a

 lot

 of

 energy

 and

 always

 want

 to

 keep

 my

 brain

 active

.

 I

 enjoy

 being

 outdoors

 and

 enjoying

 the

 outdoors

,

 so

 I

'm

 always

 on

 the

 go

.

 Overall

,

 I

'm

 a

 kind

 and

 enthusiastic

 individual

 who

 thr

ives

 on

 challenging

 tasks

 and seeking

 new

 knowledge

.

 I

'm

 excited

 to

 get

 started

 on

 this

 adventure

 together

.

 [

Name

]

 [

Phone

 number

 here

]

 [

Email

 address

 here

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 city

 and

 largest

 metropolitan

 area

 in

 Europe

,

 located

 on

 the

 Se

ine

 River

 in

 the

 French

 Py

rene

es

 mountains

.

 It

 is

 a

 cosm

opolitan

 and

 culturally

 rich

 city

 with

 a

 diverse

 population

 of

 around

 

1

1

 million

 people

,

 including

 

1

.

2

 million

 immigrants

.

 Paris

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 cultural

 heritage

,

 and

 vibrant

 street

 life

.

 It

 is

 a

 leading

 global

 city

 and

 a

 major

 economic

 and

 political

 hub

,

 with

 a

 history

 dating

 back

 to

 the

 Roman

 Empire

 and

 the

 medieval

 era

.

 Its

 skyline

 features

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 which

 attract

 millions

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 combination

 of

 rapid

 innovation

,

 scalability

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 trends

 in

 AI

 in

 the

 coming

 years

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 medical

 diagnosis

,

 treatment

 planning

,

 and

 drug

 discovery

.

 In

 the

 future

,

 we

 may

 see

 increased

 use

 of

 AI

 in

 diagn

osing

 diseases

,

 predicting

 patient

 outcomes

,

 and

 improving

 treatment

 plans

.



2

.

 Integration

 with

 other

 technologies

:

 AI

 is

 already

 being

 integrated

 with

 other

 technologies

 like

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 In

 the

 future

,

 we

 may

 see

 even

 more

 integration

 of

 AI

 with

 other

 technologies

,

 such

 as

 voice

 recognition

,

 robotics




In [6]:
llm.shutdown()