# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-25 11:51:15] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-25 11:51:15] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-25 11:51:15] INFO utils.py:164: NumExpr defaulting to 16 threads.






[2025-11-25 11:51:23] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-25 11:51:23] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-25 11:51:23] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.18it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:03,  5.36it/s]Capturing batches (bs=120 avail_mem=75.94 GB):   5%|▌         | 1/20 [00:00<00:03,  5.36it/s]

Capturing batches (bs=112 avail_mem=75.94 GB):   5%|▌         | 1/20 [00:00<00:03,  5.36it/s]Capturing batches (bs=104 avail_mem=75.23 GB):   5%|▌         | 1/20 [00:00<00:03,  5.36it/s]Capturing batches (bs=104 avail_mem=75.23 GB):  20%|██        | 4/20 [00:00<00:01, 15.09it/s]Capturing batches (bs=96 avail_mem=75.22 GB):  20%|██        | 4/20 [00:00<00:01, 15.09it/s] Capturing batches (bs=88 avail_mem=75.22 GB):  20%|██        | 4/20 [00:00<00:01, 15.09it/s]Capturing batches (bs=80 avail_mem=75.21 GB):  20%|██        | 4/20 [00:00<00:01, 15.09it/s]Capturing batches (bs=80 avail_mem=75.21 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.01it/s]Capturing batches (bs=72 avail_mem=75.21 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.01it/s]

Capturing batches (bs=64 avail_mem=75.20 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.01it/s]Capturing batches (bs=56 avail_mem=75.20 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.01it/s]Capturing batches (bs=56 avail_mem=75.20 GB):  50%|█████     | 10/20 [00:00<00:00, 22.21it/s]Capturing batches (bs=48 avail_mem=75.19 GB):  50%|█████     | 10/20 [00:00<00:00, 22.21it/s]Capturing batches (bs=40 avail_mem=75.19 GB):  50%|█████     | 10/20 [00:00<00:00, 22.21it/s]Capturing batches (bs=32 avail_mem=74.79 GB):  50%|█████     | 10/20 [00:00<00:00, 22.21it/s]

Capturing batches (bs=32 avail_mem=74.79 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.02it/s]Capturing batches (bs=24 avail_mem=74.69 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.02it/s]Capturing batches (bs=16 avail_mem=74.68 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.02it/s]Capturing batches (bs=12 avail_mem=74.68 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.02it/s]Capturing batches (bs=12 avail_mem=74.68 GB):  80%|████████  | 16/20 [00:00<00:00, 22.11it/s]Capturing batches (bs=8 avail_mem=74.67 GB):  80%|████████  | 16/20 [00:00<00:00, 22.11it/s] Capturing batches (bs=4 avail_mem=74.66 GB):  80%|████████  | 16/20 [00:00<00:00, 22.11it/s]

Capturing batches (bs=2 avail_mem=74.66 GB):  80%|████████  | 16/20 [00:00<00:00, 22.11it/s]Capturing batches (bs=2 avail_mem=74.66 GB):  95%|█████████▌| 19/20 [00:00<00:00, 21.15it/s]Capturing batches (bs=1 avail_mem=74.66 GB):  95%|█████████▌| 19/20 [00:00<00:00, 21.15it/s]Capturing batches (bs=1 avail_mem=74.66 GB): 100%|██████████| 20/20 [00:00<00:00, 20.44it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jia. What is your name, please?
A. I'm sorry, but I don't know your name.
B. You're welcome, but I have no idea about your name either.
C. You're quite right; I know who you are. Your name is Jia, and you are welcome.
D. I'm a language model, I don't have a real name.
Answer:
C

In a hydraulic system, what is the primary function of the pressure oil?
A. Lubrication
B. Heat dissipation
C. Sealing
D. Power transmission
Answer:
D

In the past
Prompt: The president of the United States is
Generated text:  3/4 the age of the secretary. The president is 80 years old. How old is the secretary? Let's denote the age of the secretary by \( S \). According to the problem, the president is \(\frac{3}{4}\) the age of the secretary. The president is currently 80 years old. Therefore, we can set up the following equation to represent this relationship:

\[
\frac{3}{4}S = 80
\]

To find the age of the secretary \( S \), we need to solve this equation. First, 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [Age] year old, [Gender] and [Occupation]. I have a [Skill] in [Skill] and I enjoy [Favorite Activity]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [Favorite Activity] and I'm always looking for new ways to explore and discover new things. What's your favorite book or movie? I love [Favorite Book/Movie

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for art, culture, and politics, and is home to many world-renowned museums and attractions. Paris is a bustling metropolis with a rich history and a diverse population, making it a popular tourist destination. The city is also known for its cuisine, with many famous French dishes and restaurants serving up delicious meals. Overall, Paris is a city of contrasts and beauty that is a must-visit for anyone interested in French culture and history. 

Paris is the capital of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced natural language processing: AI will continue to improve its ability to understand and respond to natural language, leading to more sophisticated and intuitive interfaces.

3. Increased use of AI in healthcare: AI will be used to improve the accuracy and efficiency of medical diagnosis and treatment, leading to more effective and personalized healthcare.

4. Greater use of AI in transportation: AI will be used to improve the safety and efficiency of transportation systems



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name]. I am a [insert age] year old [insert occupation]. I enjoy [insert hobbies or interests]. My favorite hobby is [insert favorite hobby]. I am an [insert gender] and my birthday is [insert birthday date]. I love to [insert favorite activity]. I have always been [insert relevant personality trait]. And I'm always looking for [insert why you're interested in this field]. I'm a fan of [insert a book, movie, or other media] and I love to [insert reason why]. I believe in [insert religion, philosophy, or other beliefs]. I'm a [insert any

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

[Mark the correct answer] The correct answer is: Paris is the capital of France. 

This answer is factual and correct. 

To arrive at this answer, I examined the question and the provided statement about Paris. T

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

 love

 [

reason

 for

 job

],

 and

 I

'm

 constantly

 learning

 new

 things

 and

 growing

 as

 a

 person

.

 I

 enjoy

 [

something

 I

 do

 for

 fun

],

 and

 I

'm

 always

 looking

 for

 ways

 to

 make

 the

 world

 a

 better

 place

.

 If

 you

'd

 like

 to

 talk

 about

 my

 work

,

 I

'm

 always

 here

 to

 listen

 and

 learn

.

 What

's

 something

 you

're

 passionate

 about

?

 I

 love

 [

reason

 for

 passion

].

 What

's

 the

 coolest

 thing

 you

've

 ever

 done

?

 The

 coolest

 thing

 I

've

 ever

 done

 was

 [

reason

 for

 accomplishment

].

 What

's

 your

 favorite

 hobby

?

 I

 love

 [

reason

 for



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



To

 answer

 the

 other

 options

:



-

 Rome

 is

 the

 capital

 of

 Italy

.


-

 Tokyo

 is

 the

 capital

 of

 Japan

.

 



Please

 provide

 the

 correct

 answer

.

 Paris

 is

 the

 capital

 of

 France

.

 



To

 break

 it

 down

:



-

 Paris

 is

 the

 capital

 of

 France

,

 the

 largest

 country

 in

 Europe

.


-

 It

's

 known

 for

 its

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.


-

 French

 cuisine

,

 fashion

,

 and

 culture

 are

 deeply

 rooted

 in

 Paris

.


-

 Paris

 is

 often

 called

 the

 "

City

 of

 Light

"

 and

 is

 home

 to

 many

 museums

 and

 art

 galleries

.

 



The

 other

 options

 are

 incorrect

 for

 the

 following

 reasons

:



-

 Rome

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 constantly

 evolving

.

 Some

 of

 the

 possible

 trends

 in

 AI

 include

:



1

.

 Increased

 autonomy

 and

 self

-

optim

ization

:

 As

 AI

 becomes

 more

 complex

 and

 capable

,

 it

 will

 become

 more

 autonomous

 and

 able

 to

 make

 decisions

 on

 its

 own

 without

 human

 intervention

.

 This

 will

 be

 the

 case

 with

 AI

-powered

 robots

,

 autonomous

 vehicles

,

 and

 advanced

 medical

 devices

.



2

.

 Integration

 with

 other

 technologies

:

 AI

 will

 become

 more

 integrated

 with

 other

 technologies

 such

 as

 blockchain

,

 internet

 of

 things

 (

Io

T

),

 and

 quantum

 computing

,

 leading

 to

 new

 possibilities

 such

 as

 secure

 communication

,

 data

 analytics

,

 and

 personalized

 user

 experiences

.



3

.

 Inter

oper

ability

 and

 interoper

ability

:

 With

 the

 increasing

 amount

 of




In [6]:
llm.shutdown()