# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-26 10:12:42] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-26 10:12:42] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-26 10:12:42] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-26 10:12:42] INFO trace.py:48: opentelemetry package is not installed, tracing disabled






[2025-10-26 10:12:50] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-26 10:12:50] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-26 10:12:50] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-26 10:12:52] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:03,  6.10it/s]Capturing batches (bs=120 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  6.10it/s]Capturing batches (bs=112 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:03,  6.10it/s]

Capturing batches (bs=104 avail_mem=74.63 GB):   5%|▌         | 1/20 [00:00<00:03,  6.10it/s]Capturing batches (bs=104 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:00, 16.12it/s]Capturing batches (bs=96 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:00, 16.12it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:00, 16.12it/s]Capturing batches (bs=80 avail_mem=74.61 GB):  20%|██        | 4/20 [00:00<00:00, 16.12it/s]Capturing batches (bs=80 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.49it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.49it/s]Capturing batches (bs=64 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.49it/s]

Capturing batches (bs=56 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.49it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 22.32it/s]Capturing batches (bs=48 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 22.32it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 22.32it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 22.32it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.43it/s]Capturing batches (bs=24 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.43it/s]Capturing batches (bs=16 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.43it/s]

Capturing batches (bs=12 avail_mem=74.57 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.43it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 21.35it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 21.35it/s] Capturing batches (bs=4 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:00<00:00, 21.35it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:00<00:00, 21.35it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  95%|█████████▌| 19/20 [00:00<00:00, 22.19it/s]Capturing batches (bs=1 avail_mem=74.55 GB):  95%|█████████▌| 19/20 [00:00<00:00, 22.19it/s]

Capturing batches (bs=1 avail_mem=74.55 GB): 100%|██████████| 20/20 [00:00<00:00, 20.99it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Susan, I'm from the UK and I'm a student in New Zealand. I'm 16 years old and I hope to find a job in the future. My resume is simple - no extra information needed. However, I'm quite self-conscious and would like to improve my overall impression and confidence. I want to start by learning how to write a resume, which I'm rather good at. I have a friend who is good at writing resumes as well. She was recently a student in New Zealand and she says it was very helpful. Is she right? I've never really done any writing before, so I'm a bit
Prompt: The president of the United States is
Generated text:  a ____
A. political party
B. politician
C. member of the government
D. assistant of the government

To determine the correct answer, let's analyze each option:

A. Political party: While the President of the United States is indeed a member of a political party, it is not the only role.

B. Politician: While the President of the United States is a po

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a city with a rich history and culture. It is located in the south of the country and is known for its beautiful architecture, vibrant nightlife, and annual festivals. Paris is also a major center for business, finance, and art, and is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. The city is also known for its cuisine, with many famous dishes such as croissants, escargot, and escargot frites. Paris is a city of contrasts, with its modern and historic districts,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there will be a greater emphasis on developing AI that is designed to be ethical and responsible. This could involve developing AI that is designed to minimize harm to individuals and society as a whole, and that is transparent and accountable.

2. Integration of AI with other technologies: AI is already being integrated into a wide range of technologies, from smartphones and computers to healthcare and transportation. As more of these technologies become



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [short, impactful, creative, and respectful] personality. My unique perspective and approach to problem-solving are evident in my responses, and I thrive on challenging the status quo with my ideas. My journey of self-discovery and growth has led me to believe that knowledge and understanding are the key to unlocking the potential of all individuals. I am a visionary, a strategist, and a problem solver who is always seeking to make the world a better place. I am confident, confident in my ability to find innovative solutions to complex problems and to build relationships with people who are passionate about making the world a better

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light and the City of Love. It is a large city with a rich history, famous 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

 here

].

 I

'm

 a

/an

 [

insert

 occupation

 or

 profession

]

 who

 is

 passionate

 about

 [

insert

 something

 that

 you

 like

 about

 yourself

].

 What

 kind

 of

 hobbies

 or

 interests

 do

 you

 have

?


I

'm

 always

 looking

 for

 new

 experiences

 and

 challenges

,

 so

 please

 feel

 free

 to

 ask

 me

 anything

 you

 like

.

 You

 can

 write

 down

 your

 questions

 for

 me

,

 I

'm

 here

 to

 listen

 and

 answer

 them

.

 Good

 luck

 with

 your

 [

insert

 challenge

 or

 challenge

 of

 your

 dreams

].

 How

 can

 I

 help

 you

 today

?

 [

insert

 personality

 traits

 or

 qualities

]

 I

'm

 always

 up

 for

 a

 good

 laugh

,

 so

 [

insert

 a

 joke

 or

 humorous

 anecd

ote

].

 I

'm

 excited

 to

 meet



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

 statement

 accurately

 captures

 the

 core

 facts

 about

 Paris

,

 the

 most

 populous

 city

 and

 capital

 of

 France

,

 where

 it

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 beautiful

 can

als

,

 and

 rich

 cultural

 heritage

.

 Paris

 serves

 as

 the

 political

,

 cultural

,

 and

 economic

 center

 of

 France

,

 influencing

 decisions

 regarding

 government

 policy

,

 international

 relations

,

 and

 cultural

 institutions

.

 It

's

 a

 met

ropolis

 with

 a

 rich

 history

 that

 extends

 far

 beyond

 the

 city

 limits

,

 with

 historical

 landmarks

,

 museums

,

 and

 major

 events

 such

 as

 the

 Notre

-D

ame

 Cathedral

 and

 the

 Lou

vre

 Museum

.

 Paris

 plays

 a

 significant

 role

 in

 European

 and

 global

 politics

.

 Let

 me

 know

 if

 you

'd

 like

 me

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

 and

 there

 are

 numerous

 possibilities

 and

 trends

 to

 consider

.

 Here

 are

 some

 potential

 future

 trends

 that

 may

 shape

 the

 development

 of

 artificial

 intelligence

:



1

.

 Increased

 Personal

ization

:

 As

 AI

 becomes

 more

 capable

 of

 understanding

 and

 learning

 from

 user

 data

,

 it

 is

 expected

 to

 become

 increasingly

 personal

 to

 individuals

,

 leading

 to

 a

 more

 personalized

 experience

.



2

.

 Autonomous

 Agents

:

 AI

-driven

 autonomous

 agents

 are

 expected

 to

 become

 more

 common

 in

 our

 daily

 lives

,

 such

 as

 in

 vehicles

,

 machinery

,

 and

 even

 healthcare

.



3

.

 AI

 Ethics

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 lives

,

 there

 will

 be

 a

 need

 for

 ethical

 considerations

 to

 be

 developed

 and

 implemented

.

 This

 will

 include

 designing

 AI

 systems




In [6]:
llm.shutdown()