# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-22 12:08:02] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-22 12:08:02] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-22 12:08:02] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-22 12:08:05] INFO server_args.py:1835: Attention backend not specified. Use fa3 backend by default.


[2026-02-22 12:08:05] INFO server_args.py:2888: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=14.80 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=14.80 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s]Capturing batches (bs=120 avail_mem=14.69 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s]

Capturing batches (bs=112 avail_mem=14.69 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s]Capturing batches (bs=104 avail_mem=14.68 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s]Capturing batches (bs=104 avail_mem=14.68 GB):  20%|██        | 4/20 [00:00<00:00, 16.29it/s]Capturing batches (bs=96 avail_mem=14.68 GB):  20%|██        | 4/20 [00:00<00:00, 16.29it/s] Capturing batches (bs=88 avail_mem=14.67 GB):  20%|██        | 4/20 [00:00<00:00, 16.29it/s]Capturing batches (bs=88 avail_mem=14.67 GB):  30%|███       | 6/20 [00:00<00:00, 16.40it/s]Capturing batches (bs=80 avail_mem=14.67 GB):  30%|███       | 6/20 [00:00<00:00, 16.40it/s]

Capturing batches (bs=72 avail_mem=14.66 GB):  30%|███       | 6/20 [00:00<00:00, 16.40it/s]Capturing batches (bs=72 avail_mem=14.66 GB):  40%|████      | 8/20 [00:00<00:01, 10.54it/s]Capturing batches (bs=64 avail_mem=14.66 GB):  40%|████      | 8/20 [00:00<00:01, 10.54it/s]Capturing batches (bs=56 avail_mem=14.65 GB):  40%|████      | 8/20 [00:00<00:01, 10.54it/s]

Capturing batches (bs=56 avail_mem=14.65 GB):  50%|█████     | 10/20 [00:00<00:00, 10.79it/s]Capturing batches (bs=48 avail_mem=14.65 GB):  50%|█████     | 10/20 [00:00<00:00, 10.79it/s]Capturing batches (bs=40 avail_mem=14.64 GB):  50%|█████     | 10/20 [00:00<00:00, 10.79it/s]Capturing batches (bs=32 avail_mem=14.64 GB):  50%|█████     | 10/20 [00:00<00:00, 10.79it/s]Capturing batches (bs=32 avail_mem=14.64 GB):  65%|██████▌   | 13/20 [00:00<00:00, 14.60it/s]Capturing batches (bs=24 avail_mem=14.63 GB):  65%|██████▌   | 13/20 [00:00<00:00, 14.60it/s]Capturing batches (bs=16 avail_mem=14.63 GB):  65%|██████▌   | 13/20 [00:01<00:00, 14.60it/s]

Capturing batches (bs=16 avail_mem=14.63 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.53it/s]Capturing batches (bs=12 avail_mem=14.62 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.53it/s]Capturing batches (bs=8 avail_mem=14.62 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.53it/s] Capturing batches (bs=4 avail_mem=14.61 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.53it/s]Capturing batches (bs=2 avail_mem=14.61 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.53it/s]Capturing batches (bs=2 avail_mem=14.61 GB):  95%|█████████▌| 19/20 [00:01<00:00, 20.68it/s]Capturing batches (bs=1 avail_mem=14.33 GB):  95%|█████████▌| 19/20 [00:01<00:00, 20.68it/s]Capturing batches (bs=1 avail_mem=14.33 GB): 100%|██████████| 20/20 [00:01<00:00, 16.16it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Robert and I'm a cat. I'm 7 years old, and I'm very smart, very clever and very kind. I have my own favorite toy - a cat toothpaste tube (I'm not allowed to use it for anything! ), and I have my own name - The Cat Toothpaste. I'm very good with my tongue and I'm good with my nose. I always go there first in the morning and finish off the toothpaste there last in the afternoon. 

1. Can you find the toothpaste tube? 2. What does the cat think the toothpaste tube should be called? 3. What
Prompt: The president of the United States is
Generated text:  trying to decide between two different plans for increasing the national debt. The first plan involves increasing the amount of money that the government collects from the sale of bonds, while the second plan involves increasing the amount of money that the government collects from the sale of silver. The president wants to know which plan will lead to a higher national debt. Can you help him determ

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and culture, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major center for fashion, art, and music, and is home to many world-renowned museums, theaters, and restaurants. The city is known for its diverse population, including French, Italian, and other nationalities, and is a major economic and cultural hub in Europe. Paris is a popular tourist destination, with millions of visitors each year, and is a UNESCO World Heritage site.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced creativity and innovation: AI is likely to become more capable of generating creative and innovative solutions to complex problems, as well as performing tasks that were previously thought to be beyond the capabilities of humans.

3. Greater transparency and accountability: AI systems are likely to become more transparent and accountable, allowing users to understand how the system is making decisions and to hold the system accountable for its actions.

4. Increased ethical considerations:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [Age] year old self-proclaimed "perfect" individual. I am an [Occupation] with a [Brief Description of Your Occupation] and I hold [Achievement or Interest] in my heart. I am always ready to learn and improve, and I am passionate about [Favorite Hobby or Activity]. I have always been a hardworking individual, and I strive to be the best version of myself. I value relationships and people, and I am an [Favorite Person/Relationship] in my life. My journey towards becoming a better version of myself is ongoing, and I am always learning from my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Conclude your sentence with a question that encourages further exploration of French culture and history. 

Paris, often referred to as the city of love, is one of the most iconic cities in the wor

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 Name

]

 and

 I

'm

 a

 [

Job

 Title

]

 with

 [

Number

 of

 Years

 in

 Position

]

 years

 of

 experience

.

 I

'm

 dedicated

 to

 excellence

 in

 [

Specific

 Field

 or

 Skill

],

 and

 I

'm

 always

 eager

 to

 learn

 and

 grow

.

 What

 can

 I

 expect

 from

 me

 as

 a

 new

 hire

?

 I

 am

 willing

 to

 take

 on

 new

 challenges

 and

 be

 a

 valuable

 asset

 to

 your

 team

.

 Let

's

 build

 a

 successful

 career

 together

.

 [

Your

 Name

]

 is

 looking

 to

 make

 new

 friends

 and

 explore

 the

 world

 through

 social

 media

.

 [

Your

 Name

]

 is

 a

 [

Number

 of

 Years

 in

 Position

]

 year

 veteran

 of

 [

Specific

 Field

 or

 Skill

],

 with

 extensive

 experience

 in

 [

Specific



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 historical

 city

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 fashion

 industry

,

 and

 bustling

 nightlife

.

 As

 the

 French

 capital

,

 Paris

 is

 a

 popular

 tourist

 destination

 and

 home

 to

 numerous

 cultural

 institutions

,

 including

 the

 Lou

vre

 Museum

.

 It

 is

 also

 a

 major

 transportation

 hub

,

 with

 the

 Notre

-D

ame

 Cathedral

 being

 one

 of

 the

 city

's

 most

 famous

 landmarks

.

 Paris

's

 reputation

 as

 a

 world

-class

 city

 has

 made

 it

 a

 popular

 destination

 for

 business

 and

 international

 diplomacy

.

 Paris

's

 charm

 and

 cultural

 richness

 continue

 to

 attract

 millions

 of

 visitors

 each

 year

.

 Its

 status

 as

 the

 capital

 of

 France

 has

 made

 it

 an

 essential

 hub

 for

 French

 politics

,

 culture

,

 and

 economy

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 changing

 at

 a

 rapid

 pace

.

 Some

 potential

 future

 trends

 include

:



1

.

 Increased

 automation

 and

 robotics

:

 With

 advancements

 in

 AI

,

 it

 is

 likely

 that

 more

 tasks

 will

 be

 automated

,

 making

 them

 more

 efficient

 and

 reducing

 the

 need

 for

 human

 intervention

.

 This could

 lead

 to

 a

 greater

 reliance

 on

 robots

 and

 AI

 in

 various

 industries

,

 such

 as

 manufacturing

,

 healthcare

,

 and

 transportation

.



2

.

 AI

-in

novation

:

 The

 convergence

 of

 AI

 and

 other

 fields

 such

 as

 neuroscience

,

 computer

 science

,

 and

 engineering

 could

 lead

 to

 breakthrough

s

 in

 fields

 such

 as

 artificial

 intelligence

,

 nan

otechnology

,

 and

 bi

otechnology

.

 This

 could

 have

 profound

 impacts

 on

 medicine

,

 agriculture

,

 and

 energy

,

 among

 other




In [6]:
llm.shutdown()