# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-11 00:10:55] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-11 00:10:55] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-11 00:10:55] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-11 00:10:58] INFO server_args.py:1643: Attention backend not specified. Use fa3 backend by default.


[2026-01-11 00:10:58] INFO server_args.py:2542: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.00it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=58.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=58.93 GB):   5%|▌         | 1/20 [00:00<00:03,  5.56it/s]Capturing batches (bs=120 avail_mem=58.83 GB):   5%|▌         | 1/20 [00:00<00:03,  5.56it/s]

Capturing batches (bs=112 avail_mem=58.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.56it/s]Capturing batches (bs=104 avail_mem=58.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.56it/s]Capturing batches (bs=104 avail_mem=58.82 GB):  20%|██        | 4/20 [00:00<00:01, 15.44it/s]Capturing batches (bs=96 avail_mem=58.81 GB):  20%|██        | 4/20 [00:00<00:01, 15.44it/s] Capturing batches (bs=88 avail_mem=58.80 GB):  20%|██        | 4/20 [00:00<00:01, 15.44it/s]Capturing batches (bs=80 avail_mem=58.78 GB):  20%|██        | 4/20 [00:00<00:01, 15.44it/s]Capturing batches (bs=80 avail_mem=58.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.55it/s]Capturing batches (bs=72 avail_mem=58.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.55it/s]

Capturing batches (bs=64 avail_mem=58.77 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.55it/s]Capturing batches (bs=56 avail_mem=58.77 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.55it/s]Capturing batches (bs=56 avail_mem=58.77 GB):  50%|█████     | 10/20 [00:00<00:00, 17.34it/s]

Capturing batches (bs=48 avail_mem=76.79 GB):  50%|█████     | 10/20 [00:00<00:00, 17.34it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 17.34it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  60%|██████    | 12/20 [00:00<00:00, 14.16it/s]Capturing batches (bs=32 avail_mem=76.78 GB):  60%|██████    | 12/20 [00:00<00:00, 14.16it/s]Capturing batches (bs=24 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:00<00:00, 14.16it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:00<00:00, 14.16it/s]

Capturing batches (bs=16 avail_mem=76.77 GB):  75%|███████▌  | 15/20 [00:00<00:00, 15.68it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  75%|███████▌  | 15/20 [00:00<00:00, 15.68it/s]Capturing batches (bs=8 avail_mem=76.76 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.68it/s] Capturing batches (bs=4 avail_mem=76.76 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.68it/s]Capturing batches (bs=4 avail_mem=76.76 GB):  90%|█████████ | 18/20 [00:01<00:00, 18.83it/s]Capturing batches (bs=2 avail_mem=76.75 GB):  90%|█████████ | 18/20 [00:01<00:00, 18.83it/s]Capturing batches (bs=1 avail_mem=76.75 GB):  90%|█████████ | 18/20 [00:01<00:00, 18.83it/s]Capturing batches (bs=1 avail_mem=76.75 GB): 100%|██████████| 20/20 [00:01<00:00, 17.54it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John and I work as a journalist and public speaker. I have been a reporter and columnist at newspapers for 20 years, and have since become a writer and speaker. I have a PhD in psychology from the University of Michigan, and a master's in journalism from the University of Texas. I am the author of the book "The Complete Guide to Persuasive Speech: How to Get Your Message Heard and Stay in the News," and have given over 80 presentations on the subject, including 1000 minutes of speaking engagements at the national and international levels.
I have also written numerous articles in the "Journal"
Prompt: The president of the United States is
Generated text:  a member of which organization?
The president of the United States is a member of the United States Senate, which is part of the legislative branch of the government. The Senate is responsible for passing and voting on major pieces of legislation, such as the Foreign Relations of the United St

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? I'm a [insert a characteristic or trait that you are passionate about or enjoy doing]. I enjoy [insert a hobby or activity that you enjoy doing]. What do you like to do in your free time? I enjoy [insert a hobby or activity that you enjoy doing]. What is your favorite book or movie? I love [insert a favorite book or movie]. What are your hobbies? I enjoy [insert a hobby or activity that you enjoy

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major cultural and economic center, with a diverse population and a thriving arts scene. The city is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. Paris is also known for its food scene, with many famous restaurants and cafes serving traditional French cuisine. The city is also home to many international organizations and events

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased accuracy and precision: AI systems are becoming more accurate and precise in their predictions and decisions, leading to better outcomes in various fields such as healthcare, finance, and transportation.

2. Integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to new forms of artificial intelligence, such as "intelligent agents" that can think and act like humans, but with the ability to learn and improve over time.

3. Personalization and customization: AI systems are becoming more personalized and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Occupation] with a passion for [What interests you about your occupation]. I am always looking to learn and expand my knowledge, so I strive to be a [Specific skill or characteristic]. If you have any questions about [What you do for a living], I would love to help you understand more about it. I believe that being a role model and mentor to others is very important to me, so I strive to be a [What you are as a role model or mentor]. I am passionate about [Why this field or career is important] and I believe that [What you believe is the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its iconic landmarks, such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral, as well as its rich cultural heritage, including French cuisine, art, and fashion. The city is als

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

occupation

].

 I

've

 always

 loved

 [

occupation

],

 and

 I

'm

 always

 ready

 to

 learn

 and

 grow

 in

 my

 field

.

 I

'm

 excited

 to

 be

 a

 [

new

 role

]

 and

 I

'm

 always

 willing

 to

 learn

 new

 things

.

 If

 you

're

 interested

 in

 learning

 more

 about

 me

,

 I

'm

 ready

 to

 share

 my

 experiences

 and

 knowledge

.

 How

 can

 I

 reach

 out

 to

 you

?

 [

Name

]

 [

Phone

 number

]

 [

Email

 address

]

 [

Website

 link

]

 [

LinkedIn

 profile

 link

]

 [

Optional

:

 Show

 a

 profile

 picture

 or

 a

 photo

 of

 yourself

 that

 reflects

 your

 personality

 and

 interests

]

 [

Optional

:

 Brief

ly

 mention

 your

 main

 skills

 or

 interests



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

,

 often

 referred

 to

 as

 "

La

 City

 of

 Love

"

 due

 to

 its

 romantic

 architecture

 and

 culture

,

 is

 the

 capital

 city

 of

 France

 and

 serves

 as

 the

 largest

 city

 and

 economic

 center

 of

 the

 country

.

 Its

 population

 is

 approximately

 

2

.

3

 million

 people

.

 Paris

 is

 the

 world

's

 

2

nd

 largest

 city

 and

 is

 often

 referred

 to

 as

 the

 "

City

 of

 a

 Thousand

 Se

es

"

 due

 to

 its

 numerous

 views

 and

 landmarks

.

 It

 has

 a

 rich

 and

 diverse

 cultural

 scene

,

 with

 attractions

 such

 as

 the

 Lou

vre

 Museum

 and

 the

 E

iff

el

 Tower

,

 and

 is

 home

 to

 many

 world

-ren

owned

 landmarks

 and

 museums

.

 The

 city

 is

 also

 known

 for

 its



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 rapid

 advancements

 and

 further

 integration

 of

 AI

 into

 various

 industries

,

 leading

 to

 new

 innovations

 and

 developments

.

 Some

 potential

 trends

 that

 may

 emerge

 in

 the

 near

 future

 include

:



1

.

 AI

 integration

 with

 natural

 language

 processing

:

 As

 AI

 continues

 to

 gain

 greater

 understanding

 of

 human

 language

,

 it

 is

 expected

 to

 continue

 to

 integrate

 more

 with

 natural

 language

 processing

,

 enabling

 machines

 to

 understand

 and

 interpret

 human

 speech

 and

 text

 more

 accurately

.



2

.

 AI

 in

 healthcare

:

 AI

-powered

 healthcare

 systems

 are

 already

 being

 developed

 and

 used

 to

 assist

 in

 diagnosis

,

 treatment

,

 and

 patient

 care

.

 In

 the

 future

,

 we

 may

 see

 even

 more

 advanced

 AI

 that

 can

 analyze

 and

 interpret

 medical

 data

 in

 real




In [6]:
llm.shutdown()