# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-17 07:14:25] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-17 07:14:25] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-17 07:14:25] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-17 07:14:28] INFO server_args.py:1646: Attention backend not specified. Use fa3 backend by default.


[2026-01-17 07:14:28] INFO server_args.py:2545: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.87it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=58.77 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=58.77 GB):   5%|▌         | 1/20 [00:00<00:14,  1.29it/s]Capturing batches (bs=120 avail_mem=58.64 GB):   5%|▌         | 1/20 [00:00<00:14,  1.29it/s]Capturing batches (bs=112 avail_mem=58.63 GB):   5%|▌         | 1/20 [00:00<00:14,  1.29it/s]Capturing batches (bs=104 avail_mem=58.60 GB):   5%|▌         | 1/20 [00:00<00:14,  1.29it/s]Capturing batches (bs=104 avail_mem=58.60 GB):  20%|██        | 4/20 [00:00<00:02,  5.73it/s]Capturing batches (bs=96 avail_mem=58.60 GB):  20%|██        | 4/20 [00:00<00:02,  5.73it/s] Capturing batches (bs=88 avail_mem=58.59 GB):  20%|██        | 4/20 [00:00<00:02,  5.73it/s]Capturing batches (bs=80 avail_mem=58.59 GB):  20%|██        | 4/20 [00:00<00:02,  5.73it/s]

Capturing batches (bs=72 avail_mem=58.58 GB):  20%|██        | 4/20 [00:00<00:02,  5.73it/s]Capturing batches (bs=72 avail_mem=58.58 GB):  40%|████      | 8/20 [00:01<00:01, 11.29it/s]Capturing batches (bs=64 avail_mem=58.58 GB):  40%|████      | 8/20 [00:01<00:01, 11.29it/s]Capturing batches (bs=56 avail_mem=58.57 GB):  40%|████      | 8/20 [00:01<00:01, 11.29it/s]Capturing batches (bs=48 avail_mem=58.57 GB):  40%|████      | 8/20 [00:01<00:01, 11.29it/s]Capturing batches (bs=48 avail_mem=58.57 GB):  55%|█████▌    | 11/20 [00:01<00:00, 14.79it/s]Capturing batches (bs=40 avail_mem=58.56 GB):  55%|█████▌    | 11/20 [00:01<00:00, 14.79it/s]Capturing batches (bs=32 avail_mem=58.56 GB):  55%|█████▌    | 11/20 [00:01<00:00, 14.79it/s]

Capturing batches (bs=24 avail_mem=58.55 GB):  55%|█████▌    | 11/20 [00:01<00:00, 14.79it/s]Capturing batches (bs=24 avail_mem=58.55 GB):  70%|███████   | 14/20 [00:01<00:00, 18.03it/s]Capturing batches (bs=16 avail_mem=58.55 GB):  70%|███████   | 14/20 [00:01<00:00, 18.03it/s]Capturing batches (bs=12 avail_mem=58.54 GB):  70%|███████   | 14/20 [00:01<00:00, 18.03it/s]Capturing batches (bs=8 avail_mem=58.54 GB):  70%|███████   | 14/20 [00:01<00:00, 18.03it/s] Capturing batches (bs=8 avail_mem=58.54 GB):  85%|████████▌ | 17/20 [00:01<00:00, 19.11it/s]Capturing batches (bs=4 avail_mem=58.53 GB):  85%|████████▌ | 17/20 [00:01<00:00, 19.11it/s]Capturing batches (bs=2 avail_mem=58.53 GB):  85%|████████▌ | 17/20 [00:01<00:00, 19.11it/s]

Capturing batches (bs=1 avail_mem=58.52 GB):  85%|████████▌ | 17/20 [00:01<00:00, 19.11it/s]Capturing batches (bs=1 avail_mem=58.52 GB): 100%|██████████| 20/20 [00:01<00:00, 13.97it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Bob. I'm a student. I'm a very kind and honest man. I work in a bank. It's a very good place. I have a lot of friends there. I'm happy to talk about what I like and dislike. I also like to listen to music. I like to eat various kinds of food. I enjoy playing basketball. I like to learn from history. I like to read. I like to travel. My favorite place to eat is the Cha-Cha-Samba restaurant. I like to play tennis very well. I'm not a good student. I've had to work very hard to learn. I
Prompt: The president of the United States is
Generated text:  a very important person. He is in charge of the government and he makes important decisions to keep the country running smoothly. But he does not make all the important decisions on his own. He works with other important people to make the decisions. For example, he may listen to the ideas of other important people who are in the United Nations. He may talk to other important people from the Middle Eas

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or profession]. I enjoy [insert a short description of your hobbies or interests]. I'm [insert a short description of your personality or character traits]. I'm always looking for new experiences and learning new things. What's your favorite hobby or activity? I love [insert a short description of your favorite activity]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite book

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, and Louvre Museum. The city is also famous for its fashion industry, art scene, and its role in the French Revolution. Paris is a bustling metropolis with a diverse population and a rich cultural heritage. It is a popular tourist destination and a major economic center in Europe. The city is home to many famous landmarks and museums, including the Louvre, the Musée d'Orsay

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased use of AI in healthcare: AI is already being used to diagnose and treat diseases, and it has the potential to revolutionize the field of medicine. In the future, we may see even more advanced AI systems that can analyze medical data and provide personalized treatment plans.

2. AI in manufacturing: AI is already being used to optimize production processes and improve quality control. In the future, we may see even more advanced AI systems that can analyze data from sensors and machines to improve efficiency and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Occupation]. I'm confident and skilled in [Skill/Position], and I'm eager to learn more about what you're looking for.

I enjoy [Reason for Interest]. I'm enthusiastic about [Project/Goal]. And I'm always up for [Project/Goal]. I'm eager to help you achieve your goals. What's the point of having me here? My name is [Name], and I'm a [Occupation]. I'm confident and skilled in [Skill/Position], and I'm eager to learn more about what you're looking for.

I enjoy [Reason for Interest].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northwestern part of the country and is the most populous city in the country. It is also known as "la ville grande" because it is the largest city in France in terms of population. Paris is a historical, cultural, and artistic center of the wo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

Your

 Profession

]

 who

 loves

 [

Your

 passion

 or

 hobby

].

 I

 enjoy

 [

Your

 hobbies

 or

 interests

]

 and

 am

 always

 looking

 for

 new

 adventures

 and

 experiences

.

 I

'm

 always

 ready

 to

 learn

 and

 grow

 as

 a

 person

.

 How

 are

 you

?

 That

's

 my

 first

 question

 to

 get

 to

 know

 you

 better

.

 Here

's

 how

 I

 will

 respond

:

 Hi

 [

Recipient

's

 Name

],

 it

's

 nice

 to

 meet

 you

!

 My

 name

 is

 [

Your

 Name

]

 and

 I

'm

 a

 [

Your

 Profession

]

 who

 loves

 [

Your

 passion

 or

 hobby

].

 I

 enjoy

 [

Your

 hobbies

 or

 interests

]

 and

 am

 always

 looking

 for

 new

 adventures

 and

 experiences

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



A

 summary

 of

 the

 above

 text

 is

:



Paris

 is

 the

 capital

 of

 France

,

 the

 second

 most

 populous

 city

 in

 the

 world

 and

 a

 major

 international

 met

ropolis

.

 It

 is

 the

 most

 visited

 city

 in

 the

 world

 by

 cruise

 ships

 and

 the

 world

's

 most

 traveled

 city

 by

 millions

 of

 tourists

.

 At

 

2

0

3

,

5

4

1

 sq

 km

 (

7

8

,

3

7

1

 sq

 mi

),

 Paris

 is

 the

 

1

2

th

 largest

 city

 in

 the

 world

 by

 population

.

 The

 city

 was

 founded

 in

 the

 

8

th

 century

 as

 the

 capital

 of

 the

 Princip

ality

 of

 Au

ver

gne

.

 It

 is

 located

 on

 the

 Right

 Bank

 of

 the

 Se

ine

 in

 the

 Î



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 continue

 to

 grow

 and

 evolve

,

 with

 new

 technologies

 and

 applications

 emerging

 on

 a

 regular

 basis

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Artificial

 intelligence

 that

 can

 learn

 from

 and

 adapt

 to

 new

 data

 sources

:

 With

 the

 increasing

 amount

 of

 data

 available

 on

 the

 web

,

 it

's

 becoming

 increasingly

 challenging

 for

 AI

 to

 learn

 from

 it

.

 As

 a

 result

,

 it

's

 possible

 that

 future

 AI

 systems

 will

 have

 the

 ability

 to

 learn

 and

 adapt

 to

 new

 data

 sources

,

 allowing

 them

 to

 improve

 and

 adapt

 over

 time

.



2

.

 AI

 that

 can

 recognize

 and

 respond

 to

 diverse

 human

 emotions

:

 As

 more

 people

 engage

 with

 AI

-powered

 systems

,

 there

's

 a

 possibility

 that

 AI

 will




In [6]:
llm.shutdown()