# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-24 06:33:15] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-24 06:33:15] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-24 06:33:15] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-24 06:33:18] INFO server_args.py:1769: Attention backend not specified. Use fa3 backend by default.


[2026-01-24 06:33:18] INFO server_args.py:2658: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.77it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=1.25 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=1.25 GB):   5%|▌         | 1/20 [00:00<00:07,  2.69it/s]Capturing batches (bs=120 avail_mem=1.15 GB):   5%|▌         | 1/20 [00:00<00:07,  2.69it/s]Capturing batches (bs=112 avail_mem=1.14 GB):   5%|▌         | 1/20 [00:00<00:07,  2.69it/s]Capturing batches (bs=104 avail_mem=1.14 GB):   5%|▌         | 1/20 [00:00<00:07,  2.69it/s]Capturing batches (bs=96 avail_mem=1.13 GB):   5%|▌         | 1/20 [00:00<00:07,  2.69it/s] Capturing batches (bs=96 avail_mem=1.13 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.00it/s]Capturing batches (bs=88 avail_mem=1.13 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.00it/s]Capturing batches (bs=80 avail_mem=1.12 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.00it/s]Capturing batches (bs=72 avail_mem=1.12 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.00it/s]

Capturing batches (bs=64 avail_mem=1.11 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.00it/s]Capturing batches (bs=64 avail_mem=1.11 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.10it/s]Capturing batches (bs=56 avail_mem=1.11 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.10it/s]Capturing batches (bs=48 avail_mem=1.10 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.10it/s]Capturing batches (bs=40 avail_mem=1.10 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.10it/s]Capturing batches (bs=32 avail_mem=1.09 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.10it/s]Capturing batches (bs=32 avail_mem=1.09 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.94it/s]Capturing batches (bs=24 avail_mem=1.09 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.94it/s]

Capturing batches (bs=16 avail_mem=1.08 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.94it/s]Capturing batches (bs=12 avail_mem=1.08 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.94it/s]Capturing batches (bs=12 avail_mem=1.08 GB):  80%|████████  | 16/20 [00:00<00:00, 22.12it/s]Capturing batches (bs=8 avail_mem=1.07 GB):  80%|████████  | 16/20 [00:00<00:00, 22.12it/s] Capturing batches (bs=4 avail_mem=1.07 GB):  80%|████████  | 16/20 [00:00<00:00, 22.12it/s]Capturing batches (bs=2 avail_mem=1.06 GB):  80%|████████  | 16/20 [00:00<00:00, 22.12it/s]Capturing batches (bs=1 avail_mem=1.06 GB):  80%|████████  | 16/20 [00:00<00:00, 22.12it/s]

Capturing batches (bs=1 avail_mem=1.06 GB): 100%|██████████| 20/20 [00:01<00:00, 25.91it/s]Capturing batches (bs=1 avail_mem=1.06 GB): 100%|██████████| 20/20 [00:01<00:00, 19.82it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Clara. This is the first time I'm here. I don't know where I am or who I am. I don't know why I'm here. I don't know what my next move should be. I'm always wondering what I should do. I have no idea what to do. 

What is probably true about Clara? (If the question is unanswerable, reply "unanswerable"). Based on the given information, Clara is probably feeling lost or unsure about where she is and what her next step should be. 

The answer is: Clara is probably feeling lost or unsure about where she is and what her next
Prompt: The president of the United States is
Generated text:  30 years older than the president of Brazil. The president of Brazil is 30 years younger than the president of the United States. If the president of the United States is currently 248 years old, how old would the president of Brazil be in 5 years?
To determine the age of the president of Brazil in 5 years, we need to follow a step-by-step approach based on the giv

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new experiences and learning opportunities. What are some of your favorite things to do? I love [insert a short description of your favorite activities or hobbies]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite thing to do? I love [insert a short description of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French Parliament building. Paris is a bustling metropolis with a rich history and a diverse population, making it a popular tourist destination. The city is known for its fashion, art, and cuisine, and is a major hub for business and commerce. It is also home to many international organizations and institutions, including the European Parliament and the United Nations. Paris is a city of contrasts, with its modern architecture and historical landmarks blending

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI will continue to automate many tasks, from manufacturing to customer service, and will become more efficient and accurate. This will lead to increased productivity and lower costs for businesses.

2. Enhanced human intelligence: AI will continue to improve its ability to understand and interpret human language, emotions, and behaviors. This will lead to more intuitive and personalized interactions with humans.

3. AI will become more integrated with other technologies: AI will become more integrated with other technologies, such as the Internet of Things (IoT), to create more intelligent and connected systems.

4. AI will become



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am [Your Age], [Your Experience], [Your Profession, including your area of expertise]. I am a [Your Area of Expertise], [Your Main Area of Expertise]. I have a deep passion for [Your Field of Interest], [Your Area of Interest], and I am an expert in [Your Subarea of Interest], [Your Subspecialty of Interest]. I am very experienced and can assist you in [Your Relevant Area of Assistance]. If you have any questions or need help, please feel free to ask me. Thank you! #Self-Introduction

---

Dear [Recipient's Name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the most populous city and the largest metropolitan area in the European Union and has a population of approximately 2.7 million. The city is located in the western part of France and is home to the country's ruling house 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Alex

.

 I

'm

 a

 seasoned

 professional

 who

 has

 spent

 the

 last

 few

 years

 working

 in

 tech

,

 specializing

 in

 web

 development

 and

 cybersecurity

.

 I

 have

 a

 deep

 understanding

 of

 modern

 web

 technologies

 and

 am

 skilled

 at

 troubleshooting

 and

 optimizing

 website

 performance

.

 I

'm

 also

 a

 strong

 leader

,

 with

 a

 passion

 for

 creating

 innovative

 solutions

 for

 businesses

 and

 individuals

 alike

.

 I

 believe

 that

 with

 hard

 work

 and

 dedication

,

 anyone

 can

 achieve

 their

 goals

,

 and

 that

's

 what

 I

 aim

 to

 do

 with

 my

 life

.

 Thank

 you

 for

 considering

 me

 for

 a

 job

.

 I

'm

 excited

 to

 meet

 you

 and

 discuss

 how

 I

 can

 contribute

 to

 your

 team

 or

 organization

.

 Let

 me

 know

 if

 you

'd

 like

 to

 talk

 more



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 architecture

,

 art

,

 and

 vibrant

 culture

,

 with

 attractions

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 



*

Note

:

 The

 statement

 may

 vary

 based

 on

 the

 context

,

 but

 this

 is

 an

 example

 of

 a

 common

 fact

 about

 Paris

.

 The

 city

 is

 often

 referred

 to

 as

 "

La

 Ville

 Fl

aque

"

 due

 to

 its

 characteristic

,

 fl

aky

 appearance

 from

 erosion

.

 The

 statement

 above

 reflects

 this

 common

 French

 nickname

.

 



The

 statement

 about

 the

 capital

 city

 is

 fact

ually

 correct

,

 and

 it

's

 common

 to

 include

 specific

 attractions

 and

 cultural

 landmarks

 in

 a

 brief

 factual

 statement

 about

 the

 city

.

 For

 example

,

 some



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 very

 promising

,

 but

 it

 is

 not

 certain

 what

 the

 exact

 future

 looks

 like

.

 However

,

 we

 can

 make

 some

 educated

 guesses

 about

 what

 the

 future

 of

 AI

 might

 hold

.



One

 of

 the

 most

 exciting

 possibilities

 is

 the

 use

 of

 AI

 to

 create

 highly

 accurate

,

 fast

,

 and

 accurate

 simulations

 of

 the

 natural

 world

.

 With

 the

 advancement

 of

 AI

,

 we

 are

 able

 to

 create

 simulations

 that

 can

 mimic

 the

 behavior

 of

 complex

 biological

 systems

 such

 as

 the

 human

 body

,

 climate

 systems

,

 and

 ecosystems

.

 This

 could

 be

 used

 for

 a

 variety

 of

 applications

,

 such

 as

 food

 safety

,

 medical

 research

,

 and

 climate

 modeling

.



Another

 area

 of

 potential

 for

 AI

 is

 the

 development

 of

 AI

 that

 can

 work

 in




In [6]:
llm.shutdown()