# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-24 12:41:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-24 12:41:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-24 12:41:10] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-24 12:41:14] INFO server_args.py:2397: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.34it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.95 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.95 GB):   5%|▌         | 1/20 [00:00<00:03,  5.36it/s]Capturing batches (bs=120 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:03,  5.36it/s]

Capturing batches (bs=112 avail_mem=74.74 GB):   5%|▌         | 1/20 [00:00<00:03,  5.36it/s]Capturing batches (bs=104 avail_mem=74.74 GB):   5%|▌         | 1/20 [00:00<00:03,  5.36it/s]Capturing batches (bs=104 avail_mem=74.74 GB):  20%|██        | 4/20 [00:00<00:01, 15.17it/s]Capturing batches (bs=96 avail_mem=74.73 GB):  20%|██        | 4/20 [00:00<00:01, 15.17it/s] Capturing batches (bs=88 avail_mem=74.73 GB):  20%|██        | 4/20 [00:00<00:01, 15.17it/s]Capturing batches (bs=88 avail_mem=74.73 GB):  30%|███       | 6/20 [00:00<00:00, 16.88it/s]Capturing batches (bs=80 avail_mem=74.72 GB):  30%|███       | 6/20 [00:00<00:00, 16.88it/s]

Capturing batches (bs=72 avail_mem=74.72 GB):  30%|███       | 6/20 [00:00<00:00, 16.88it/s]Capturing batches (bs=72 avail_mem=74.72 GB):  40%|████      | 8/20 [00:00<00:00, 16.41it/s]Capturing batches (bs=64 avail_mem=74.68 GB):  40%|████      | 8/20 [00:00<00:00, 16.41it/s]Capturing batches (bs=56 avail_mem=74.67 GB):  40%|████      | 8/20 [00:00<00:00, 16.41it/s]Capturing batches (bs=48 avail_mem=74.67 GB):  40%|████      | 8/20 [00:00<00:00, 16.41it/s]Capturing batches (bs=48 avail_mem=74.67 GB):  55%|█████▌    | 11/20 [00:00<00:00, 19.13it/s]Capturing batches (bs=40 avail_mem=74.64 GB):  55%|█████▌    | 11/20 [00:00<00:00, 19.13it/s]

Capturing batches (bs=32 avail_mem=74.63 GB):  55%|█████▌    | 11/20 [00:00<00:00, 19.13it/s]Capturing batches (bs=24 avail_mem=74.62 GB):  55%|█████▌    | 11/20 [00:00<00:00, 19.13it/s]Capturing batches (bs=24 avail_mem=74.62 GB):  70%|███████   | 14/20 [00:00<00:00, 21.31it/s]Capturing batches (bs=16 avail_mem=74.61 GB):  70%|███████   | 14/20 [00:00<00:00, 21.31it/s]Capturing batches (bs=12 avail_mem=74.61 GB):  70%|███████   | 14/20 [00:00<00:00, 21.31it/s]Capturing batches (bs=8 avail_mem=74.60 GB):  70%|███████   | 14/20 [00:00<00:00, 21.31it/s] 

Capturing batches (bs=8 avail_mem=74.60 GB):  85%|████████▌ | 17/20 [00:00<00:00, 20.61it/s]Capturing batches (bs=4 avail_mem=74.60 GB):  85%|████████▌ | 17/20 [00:00<00:00, 20.61it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  85%|████████▌ | 17/20 [00:00<00:00, 20.61it/s]Capturing batches (bs=1 avail_mem=74.59 GB):  85%|████████▌ | 17/20 [00:00<00:00, 20.61it/s]Capturing batches (bs=1 avail_mem=74.59 GB): 100%|██████████| 20/20 [00:01<00:00, 19.74it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alina, and I'm 18 years old. I'm a high school student at the University of Michigan. I have been attending classes successfully with a strong academic focus, and I am an active member of the Student Council of the United States Naval Academy. I love to play frisbee, and I'm a beginner in this sport. I'm also good at English and have a good command of vocabulary. My mother is a teacher, and I am curious to learn more about different languages.
I don't have any hobbies, and I like to stay busy with my studies and a lot of time with my friends. I'm
Prompt: The president of the United States is
Generated text:  seeking to deliver on his promise to help his children: \"I will choose a child to grow up with a high level of intelligence and with a strong character.\"\nWhat are the possible values of $x$ in the inequality $x^2 - 5x + 6 < 0$? To solve the inequality \(x^2 - 5x + 6 < 0\), we start by finding the roots of the corresponding quadratic equ

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Gender] [Occupation]. I am a [Skill or特长] that I have honed over the years. I am [Favorite Activity] and I enjoy [Reason for Enjoyment]. I am [Favorite Book] and I love [Reason for Enjoyment]. I am [Favorite Movie] and I enjoy [Reason for Enjoyment]. I am [Favorite Music] and I love [Reason for Enjoyment]. I am [Favorite Sport] and I play [Sport] with [Friend or Family]. I am [Favorite Hobby] and I enjoy [Reason for Enjoyment

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic city with a rich history and a vibrant culture. It is located on the Seine River and is the largest city in France by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its fashion industry, art scene, and its role in the French Revolution. Paris is a popular tourist destination and a cultural hub, attracting millions of visitors each year. It is a city that has played a significant role in French history and continues to be a major economic and cultural center in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making. This could lead to more personalized and context-aware AI that can better understand and respond to the needs of individuals.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be increased concerns about privacy and security. There will be a need for more robust privacy and security measures to protect the data and information that is generated and processed by AI.

3. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [role]! 

Tell me more about yourself, and I'll do my best to answer all of your questions! 

I'm [Name] and I've been in this industry for [number] years now. I've been a [specific role] for [number] years, and I've won [number] awards! 

I'm also [specific skill or interest] in this field, and I'm always looking for ways to [specific goal] in my work. 

I enjoy [specific hobby or activity], and I try to [specific behavior or characteristic] while doing so. 



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northwestern part of the country, and is a major city of France with a population of over 2 million people. It is the most populous city in Europe, and one of the largest in the world. Paris is known for its rich culture, history, and cuisine, and is a major tourist

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 [

Age

].

 I

 am

 a

 [

occupation

]

 who

 has

 been

 in

 this

 industry

 for

 [

number

]

 years

.

 I

 have

 always

 been

 passionate

 about

 [

description

 of

 my

 interests

 and

 passions

].

 I

 enjoy

 [

mention

 any

 hobbies

 or

 activities

 you

 enjoy

]. I

 am

 [

weight

]

 pounds

.

 I

 have

 [

number

 of

 years

 of

 experience

].

 I

 am

 [

address

able

 characteristics

 of

 a

 good

 friend

].

 My

 [

personal

 characteristic

 or

 trait

]

 is

 [

mention

 something

 specific

,

 such

 as

 your

 favorite

 color

,

 your

 favorite

 hobby

,

 your

 favorite

 movie

,

 etc

.

].

 Lastly

,

 I

 am

 a

 [

type

 of

 person

]

 who

 is

 always

 [

describe

 any

 personality

 traits

 or



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



**

F

acts

 about

 France

's

 capital

 city

:

**



-

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 third

-largest

 city

 in

 Europe

.


-

 It

 is

 the

 administrative

 centre

 of

 France

.


-

 It

 is

 situated

 on

 the

 Left

 Bank

 of

 the

 Se

ine

.


-

 It

 is

 known

 for

 its

 beautiful

 architecture

,

 gastr

onomy

,

 and

 music

.

 



Paris

 is

 recognized

 as

 the

 most

 important

 city

 in

 the

 world

 for

 the

 arts

,

 literature

,

 film

,

 and

 fashion

.

 It

 is

 also

 the

 birth

place

 of

 many

 famous

 figures

 in

 France

 and

 throughout

 the

 world

.

 Paris

 is

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

,

 with

 an

 estimated

 

9

0

 million

 visitors

 annually

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 driven

 by

 two

 key

 trends

:

 the

 ubiqu

ity

 of

 data

 and

 the

 democrat

ization

 of

 access

 to

 AI

 technologies

.



The

 first

 trend

 is

 the

 increasing

 ubiqu

ity

 of

 data

.

 With

 the

 rise

 of

 big

 data

 and

 machine

 learning

,

 there

 is

 a

 growing

 recognition

 of

 the

 importance

 of

 collecting

,

 processing

,

 and

 analyzing

 vast

 amounts

 of

 data

.

 This

 trend

 will

 likely

 lead

 to

 more

 sophisticated

 and

 personalized

 AI

 systems

,

 as

 well

 as

 a

 greater

 reliance

 on

 data

 for

 decision

-making

.



The

 second

 trend

 is

 the

 democrat

ization

 of

 access

 to

 AI

 technologies

.

 AI

 systems

 are

 becoming

 more

 accessible

 to

 individuals

 and

 organizations

,

 as

 the

 cost

 and

 complexity

 of

 building

 and

 training

 AI

 models

 have

 decreased

 significantly




In [6]:
llm.shutdown()