# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0912 07:00:52.137000 422477 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 07:00:52.137000 422477 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0912 07:01:01.372000 423232 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 07:01:01.372000 423232 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0912 07:01:01.372000 423231 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 07:01:01.372000 423231 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-12 07:01:01] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.36it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.56it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.56it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.56it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  4.27it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Zavolgy and I am a designer, artist and writer. I work with digital and physical mediums, creating original digital and physical art, illustrations and prints, and online content for businesses and personal projects. I am also a member of the Society for Creative Reporting. I created the first digital magazine, "Art Out Loud," for the Journal of Creative Writing and I am a regular contributor to "People Magazine" and "Modern Art" magazines.
When I am not in the studio, I am a husband, dad of three, avid reader, traveler, and listener to pop culture podcasts. I love to share my love of stories with
Prompt: The president of the United States is
Generated text:  trying to decide how many military bases to have. There are currently 100 bases, and the number of bases is decreasing by 5 each year. However, the president wants to maintain a total of 500 military bases. How many military bases will there be after two years?

To determine the number of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I have been working at [company name] for [number of years] years. I am passionate about [job title] and have always been interested in [job title] since I was a [age] year old. I am always looking for new challenges and opportunities to grow and learn. I am a team player and enjoy working with people from all backgrounds and cultures. I am always looking for ways to improve my skills and knowledge in order to be a better fit for the job. I am excited to be a part of [company name] and contribute

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and its rich history dating back to the Middle Ages. It is a bustling metropolis with a diverse population and a rich cultural heritage. The city is home to many famous landmarks such as the Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles. Paris is also known for its fashion industry, with many famous designers and fashion houses based in the city. The city is a popular tourist destination and a major economic center in Europe. It is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly together. Paris is a city of art, culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that we interact with technology and the world around us. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation: One of the most significant trends in AI is the increasing automation of tasks that were previously done by humans. This could include tasks such as data analysis, decision-making, and problem-solving, as well as tasks that were previously done by machines.

2. Improved privacy and security: As AI becomes more integrated into our daily lives, there is a growing concern about the potential for AI to be used



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [age] year old human with [profession or occupation]. I am a [occupation] and I have always been [what is a characteristic or trait] in my personality. I have [number] goals and objectives that I want to achieve, and I am [relatively positive or optimistic]. My communication style is [younger or more mature] and I enjoy [what is something you enjoy doing]. I am also [something you are passionate about] and I try to [what I am doing] to help [someone or something]. Thank you for asking! That's a great introduction! What can

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest city and most populous city in the European Union. It is located on the Île de la Cité and the Île de la Cité de Paris, France, and is the seat of the government, administration, and culture for Fr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Occup

ation

]

 and

 I

 work

 in

 [

Your

 Company

/

Field

].

 In

 my

 spare

 time

,

 I

 enjoy

 [

Aff

inity

 or

 Interest

].

 My

 favorite

 hobby

 is

 [

Favorite

 Activity

],

 and

 it

's

 been

 a

 passion

 for

 [

How

 long

 have

 you

 been

 passionate

 about

 this

?

].

 In

 my

 free

 time

,

 I

 enjoy

 [

Other

 Aff

inity

 or

 Interest

].

 I

'm

 always

 looking

 for

 [

Why

 do

 you

 want

 to

 learn

 more

 about

 this

 field

?

].

 If

 you

'd

 like

 to

 know

 more

 about

 me

,

 I

 can

 share

 [

What

's

 your

 most

 exciting

 achievement

 so

 far

?

].

 I

'm

 a

 team

 player

 and

 enjoy

 helping

 others

.

 If

 you

're



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 iconic

 landmarks

 such

 as

 the

 Lou

vre

 Museum

,

 and

 lively

 nightlife

.


Paris

 is

 the

 capital

 city

 of

 France

 and

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 iconic

 landmarks

 such

 as

 the

 Lou

vre

 Museum

,

 and

 lively

 nightlife

.

 (

The

 answer

 is

 Paris

,

 which

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 iconic

 landmarks

 such

 as

 the

 Lou

vre

 Museum

,

 and

 lively

 nightlife

.)



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

 and

 promising

,

 with

 many

 possibilities

 and

 potential

 applications

 emerging

.

 Some

 potential

 areas

 of

 future

 AI

 trends

 include

:



1

.

 Autonomous

 and

 semi

-aut

onomous

 vehicles

:

 As

 AI

 continues

 to

 improve

,

 we

 can

 expect

 to

 see

 autonomous

 vehicles

 becoming

 more

 common

 and

 widespread

.

 These

 vehicles

 will

 be

 able

 to

 navigate

 roads

 and

 highways

 on

 their

 own

,

 reducing

 the

 risk

 of

 accidents

 and

 providing

 a

 safer

 transportation

 option

.



2

.

 Personal

ized

 health

 and

 wellness

:

 AI

 will

 allow

 for

 more

 accurate

 and

 personalized

 health

 and

 wellness

 recommendations

.

 This

 could

 include

 things

 like

 creating

 personalized

 workout

 plans

,

 suggesting

 dietary

 changes

,

 and

 even

 predicting

 specific

 health

 risks

 for

 individuals

.



3

.

 Emotional

 intelligence

:

 AI

 can

 be

 used




In [6]:
llm.shutdown()