# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-30 14:22:46] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-30 14:22:46] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-30 14:22:46] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-30 14:22:49] INFO server_args.py:1565: Attention backend not specified. Use fa3 backend by default.


[2025-12-30 14:22:49] INFO server_args.py:2443: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.93it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=13.14 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=13.14 GB):   5%|▌         | 1/20 [00:00<00:03,  5.26it/s]Capturing batches (bs=120 avail_mem=13.04 GB):   5%|▌         | 1/20 [00:00<00:03,  5.26it/s]

Capturing batches (bs=112 avail_mem=13.03 GB):   5%|▌         | 1/20 [00:00<00:03,  5.26it/s]Capturing batches (bs=112 avail_mem=13.03 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.93it/s]Capturing batches (bs=104 avail_mem=13.01 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.93it/s]Capturing batches (bs=96 avail_mem=13.01 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.93it/s] 

Capturing batches (bs=96 avail_mem=13.01 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.19it/s]Capturing batches (bs=88 avail_mem=13.00 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.19it/s]Capturing batches (bs=80 avail_mem=13.00 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.19it/s]Capturing batches (bs=80 avail_mem=13.00 GB):  35%|███▌      | 7/20 [00:00<00:01, 12.37it/s]Capturing batches (bs=72 avail_mem=12.99 GB):  35%|███▌      | 7/20 [00:00<00:01, 12.37it/s]

Capturing batches (bs=64 avail_mem=12.50 GB):  35%|███▌      | 7/20 [00:00<00:01, 12.37it/s]Capturing batches (bs=56 avail_mem=12.49 GB):  35%|███▌      | 7/20 [00:00<00:01, 12.37it/s]Capturing batches (bs=56 avail_mem=12.49 GB):  50%|█████     | 10/20 [00:00<00:00, 16.30it/s]Capturing batches (bs=48 avail_mem=12.49 GB):  50%|█████     | 10/20 [00:00<00:00, 16.30it/s]Capturing batches (bs=40 avail_mem=12.48 GB):  50%|█████     | 10/20 [00:00<00:00, 16.30it/s]Capturing batches (bs=32 avail_mem=12.48 GB):  50%|█████     | 10/20 [00:00<00:00, 16.30it/s]Capturing batches (bs=32 avail_mem=12.48 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.23it/s]Capturing batches (bs=24 avail_mem=12.47 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.23it/s]

Capturing batches (bs=16 avail_mem=12.47 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.23it/s]Capturing batches (bs=12 avail_mem=12.46 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.23it/s]Capturing batches (bs=12 avail_mem=12.46 GB):  80%|████████  | 16/20 [00:00<00:00, 19.42it/s]Capturing batches (bs=8 avail_mem=12.46 GB):  80%|████████  | 16/20 [00:00<00:00, 19.42it/s] Capturing batches (bs=4 avail_mem=12.45 GB):  80%|████████  | 16/20 [00:01<00:00, 19.42it/s]Capturing batches (bs=2 avail_mem=12.45 GB):  80%|████████  | 16/20 [00:01<00:00, 19.42it/s]

Capturing batches (bs=1 avail_mem=12.44 GB):  80%|████████  | 16/20 [00:01<00:00, 19.42it/s]Capturing batches (bs=1 avail_mem=12.44 GB): 100%|██████████| 20/20 [00:01<00:00, 21.38it/s]Capturing batches (bs=1 avail_mem=12.44 GB): 100%|██████████| 20/20 [00:01<00:00, 17.31it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  David. I am the President of the United States. That is the highest title of a person in the United States. 

I am president because I am the head of government of the United States, and I have the power to make important decisions for the country. 

The first president of the United States was George Washington. He was the first president of the country and also became the first President of the United States. 

I have more than 40 years of experience in the government. I worked for President Obama in 2010 and he appointed me the President of the United States. 

I have to tell you something
Prompt: The president of the United States is
Generated text:  a ______.  A.  A.  Chief of Staff B.  B.  President C.  C.  Secretary of State D.  D.  President of the Senate
Answer: B.  B.  President

Please answer the following question: Where is the largest international organization of the United Nations that deals with human rights issues?  A. in New 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short, positive, enthusiastic, or neutral description of your personality or skills]. I enjoy [insert a short, positive, enthusiastic, or neutral description of your hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I'm always up for a challenge and love to try new things. What's your favorite book or movie? I love [insert a short

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and festivals throughout the year. Paris is a popular tourist destination and a major hub for international business and diplomacy. It is also home to many famous French artists, writers, and musicians. The city is known for its rich history, including the influence of the French Revolution and the influence of the French Revolution on the arts and culture of the world. Paris is a vibrant and dynamic city that continues to be a major center of global culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI will continue to automate many tasks, from manufacturing to customer service, and will likely become more efficient and accurate as technology advances.

2. Enhanced human intelligence: AI will continue to improve in terms of its ability to understand and interpret human language, emotions, and behaviors, and will likely become more capable of empathy and emotional intelligence.

3. Personalization: AI will continue to improve in terms of its ability to personalize experiences for users, from personalized recommendations to targeted advertising.

4. Ethical and responsible AI: As AI becomes more integrated into our daily lives, there will



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a digital AI assistant designed to assist with various tasks, including answering questions, generating text, and providing information. How can I assist you today? 
I am a versatile AI that can perform a variety of functions, including language translation, text completion, and information retrieval. I am here to provide you with the best assistance possible and I am always here to help. How can I assist you today? 
[Name] is a digital AI assistant designed to assist with various tasks, including answering questions, generating text, and providing information. Here's a short, neutral self-introduction for [Name]: 
Name:

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Please expand on the cultural significance of Paris in French history and architecture. 

For example: 
- Paris is the bi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]. I

'm

 a

 [

occupation

 or

 profession

].

 I

 have

 always

 been

 passionate

 about

 [

mention

 an

 area

 of

 interest

 or

 hobby

] and

 I've

 been

 pursuing

 this

 passion

 with

 great

 enthusiasm

.

 I

'm

 always

 learning

 and

 trying

 new

 things

 to

 grow

 as

 a

 person

.

 I

 love

 having

 a

 great

 time

 with

 my

 friends

,

 but

 I

 also

 enjoy

 going

 to

 the

 gym

 and

 eating

 healthy

.

 I

'm

 always

 looking

 for

 new

 ways

 to

 challenge

 myself

 and

 achieve

 my

 goals

.

 If

 you

 have

 any

 questions

 or

 need

 any

 information

,

 don

't

 hesitate

 to

 reach

 out

.

 #

Your

self

Intro





[

If

 you

're

 new

 to

 writing

,

 you

 can

 skip

 this

 part

.

 If

 you

're

 already



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

 and

 numerous

 art

 museums

.

 Paris

 is

 also the

 birth

place

 of

 modern

 fashion

,

 and

 has

 been

 an

 international

 capital

 since

 the

 

1

3

th

 century

.

 Additionally

,

 Paris

 is

 a

 cultural

 hub

 for

 science

,

 art

,

 and

 literature

,

 attracting

 visitors

 from

 all

 over

 the

 world

.

 It

 is

 a

 popular

 tourist

 destination

 for

 its

 beautiful

 architecture

,

 and

 its

 rich

 culinary

 traditions

.

 Paris

 is

 also

 home

 to

 numerous

 museums

,

 including

 the

 Lou

vre

 and

 the

 Mus

ée

 d

'

Or

say

,

 which

 are

 world

-ren

owned

 cultural

 institutions

.

 The

 city

 is

 also

 known

 for

 its

 vibrant

 nightlife

 and

 its

 annual

 Ha

ute

-Mar

ne

 Carnival

,

 which

 draws



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 complex

 and

 constantly

 evolving

,

 with

 many

 possibilities

 and

 challenges

.

 Here

 are

 some

 of

 the

 trends

 that

 are

 likely

 to

 shape

 the

 industry

 in

 the

 coming

 years

:



1

.

 Increased

 precision

 and

 accuracy

:

 One

 of

 the

 most

 exciting

 areas

 of

 AI

 research

 is

 improving

 the

 precision

 and

 accuracy

 of

 AI

 systems

.

 Advances

 in

 machine

 learning

 algorithms

 and

 neural

 networks

 are

 making

 it

 possible

 to

 perform

 complex

 tasks

 with

 greater

 accuracy

 than

 ever

 before

.



2

.

 Natural

 language

 processing

:

 The

 ability

 to understand

 and respond

 to

 natural

 language

 has

 become

 increasingly

 important

 in

 many

 industries

.

 AI

 systems

 are

 learning

 to

 understand

 language

 more

 deeply

 and

 use

 it

 to

 generate

 human

-like

 responses

.



3

.

 Autonomous

 and

 semi

-aut

onomous

 vehicles

:




In [6]:
llm.shutdown()