# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-12 11:03:55] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-12 11:03:55] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-12 11:03:55] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-12 11:03:57] INFO trace.py:60: opentelemetry package is not installed, tracing disabled






[2025-11-12 11:04:04] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-12 11:04:04] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-12 11:04:04] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-12 11:04:05] INFO trace.py:60: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.59it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:03,  6.30it/s]Capturing batches (bs=120 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  6.30it/s]Capturing batches (bs=112 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  6.30it/s]

Capturing batches (bs=104 avail_mem=74.63 GB):   5%|▌         | 1/20 [00:00<00:03,  6.30it/s]Capturing batches (bs=104 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:00, 16.79it/s]Capturing batches (bs=96 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:00, 16.79it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:00, 16.79it/s]Capturing batches (bs=80 avail_mem=74.61 GB):  20%|██        | 4/20 [00:00<00:00, 16.79it/s]

Capturing batches (bs=80 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.16it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.16it/s]Capturing batches (bs=64 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.16it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.16it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 19.36it/s]Capturing batches (bs=48 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 19.36it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 19.36it/s]

Capturing batches (bs=32 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 19.36it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.13it/s]Capturing batches (bs=24 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.13it/s]

Capturing batches (bs=16 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.13it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.13it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 15.14it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 15.14it/s] Capturing batches (bs=4 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:01<00:00, 15.14it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:01<00:00, 15.14it/s]

Capturing batches (bs=1 avail_mem=74.55 GB):  80%|████████  | 16/20 [00:01<00:00, 15.14it/s]Capturing batches (bs=1 avail_mem=74.55 GB): 100%|██████████| 20/20 [00:01<00:00, 19.22it/s]Capturing batches (bs=1 avail_mem=74.55 GB): 100%|██████████| 20/20 [00:01<00:00, 17.84it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Shingo and I am a photographer. I specialize in landscape photography. I have traveled all over the world to photograph landscapes and history, taking millions of shots. I capture the beauty of the natural world. I want to capture the reality of the people, and those who make that reality. I love to share my love of nature with you, and let you know the extraordinary beauty and wonder that lies in every landscape.

This website is a work in progress and I am always adding new things and sharing new insights into how to photograph landscapes. Don't forget to follow me on social media, join me on Facebook, and follow my personal instagram
Prompt: The president of the United States is
Generated text:  considered to be the leader of the country, and the president of the United Kingdom is considered to be the leader of the country. However, there is no president of the United States or the United Kingdom who serves as the leader of the country. Thi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] who has been [number of years] in the industry. I'm passionate about [reason for passion], and I'm always looking for ways to [action or goal]. I'm excited to meet you and learn more about your interests and experiences. What's your name, and what's your profession? [Name] [Occupation] [Number of Years] [Reason for passion] [Action or goal] [Your name] [Occupation] [Number of Years] [Reason for passion] [Action or goal] [Your name] [Occupation] [Number of Years]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the birthplace of the French Revolution and the home of the French language. Paris is a bustling metropolis with a rich cultural heritage and a diverse population. It is the largest city in France and the second-largest in the world by population. The city is known for its fashion, art, and cuisine, and is a major tourist destination. Paris is a city of contrasts, with its elegant architecture, vibrant nightlife, and diverse cultural scene. Its history and culture have made it a popular

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced interactions between the 

### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], I am [Age] years old. I am a [Occupation] who is [Description of your character's personality or background]. I enjoy [reason why you like it]. I am always [how you like to be comfortable].
Life as an [occupation] is quite [something]. I have a wide range of interests, hobbies, and friends. I am always [how you like to be there]. I like to [what you like to do]. I love [how you feel about yourself].
My favorite place is [name of location]. I love [reason why you love it]. I also love [another reason

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the banks of the Seine River, and is the heart of the country. It has a rich history, including the famous Louvre Museum, Eiffel Tower, and Notre-Dame Cathedral. The city is known for its vibrant culture, beautiful architecture, and annual f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

 old

 [

Occup

ation

].

 I

'm

 known

 for

 my

 [

Strength

,

 Expert

ise

,

 or

 Personality

 Traits

]

 that

 I

 possess

.

 I

 also

 have

 a

 love

 for

 [

Something

],

 which

 I

 consider

 my

 passion

.

 [

Describe

 something

 you

 enjoy

 doing

,

 such

 as

 hiking

,

 playing

 sports

,

 or

 spending

 time

 with

 friends

].

 Thank

 you

 for

 asking

!


I

'm

 a

 [

Name

]

 at

 [

Age

].

 I

'm

 a

 [

Occup

ation

]

 with

 a

 passion

 for

 [

Something

].

 I

 have

 [

Strength

,

 Expert

ise

,

 or

 Personality

 Traits

]

 that

 I

 enjoy

 using

 to

 help

 people

 and

 making

 the

 world

 a

 better

 place

.

 I

 love



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Task

:

 Prepare

 a

 complete

 sentence

 that

 begins

 with

 "

The

 capital

 of

 France

 is

 Paris

."



The

 capital

 of

 France

 is

 Paris

.

 



This

 is

 a

 complete

 sentence

 that

 begins

 with

 "

The

 capital

 of

 France

 is

 Paris

."

 It

 provides

 the

 specific

 information

 that

 the

 capital

 of

 France

 is

 Paris

.

 The

 sentence

 is

 gramm

atically

 correct

 and

 flows

 well

 within

 the

 given

 format

.

 



In

 French

,

 this

 would

 be

 ph

r

ased

 as

 "

La

 capit

ale

 de

 la

 France

 est

 Paris

."

 



I

 have

 included

 the

 capital

 city

 name

 in

 parentheses

 to

 maintain

 the

 structure

 of

 the

 original

 statement

.

 This

 maintains

 the

 original

 statement

's

 format

 while

 adding

 the

 capital

 city

 name

 for

 clarity

 and

 proper



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 und

eni

ably

 vast

 and

 fascinating

,

 with

 significant

 potential

 and

 potential

 challenges

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 field

:



1

.

 Personal

ization

:

 As

 AI

 becomes

 more

 advanced

,

 it

 will

 become

 increasingly

 possible

 to

 tailor

 AI

 systems

 to

 individual

 users

'

 needs

 and

 preferences

.

 This

 could

 lead

 to

 a

 more

 personalized

 and

 context

-aware

 AI

 system

 that

 adap

ts

 to

 the

 user

's

 behavior

 and

 preferences

 over

 time

.



2

.

 Autonomous

 robots

:

 Autonomous

 robots

 are

 expected

 to

 become

 more

 common

 in

 the

 future

,

 with

 more

 and

 more

 applications

 expected

 to

 take

 advantage

 of

 this

 technology

.

 This

 could

 lead

 to

 significant

 changes

 in

 employment

 and

 social

 structures

.



3

.

 Eth

ical

 considerations

:

 As

 AI




In [6]:
llm.shutdown()