# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-05 05:22:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-05 05:22:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-05 05:22:58] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-05 05:23:01] INFO server_args.py:1796: Attention backend not specified. Use fa3 backend by default.


[2026-02-05 05:23:01] INFO server_args.py:2783: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.68it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.93 GB):   5%|▌         | 1/20 [00:01<00:20,  1.06s/it]Capturing batches (bs=120 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:01<00:20,  1.06s/it]Capturing batches (bs=112 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:01<00:20,  1.06s/it]

Capturing batches (bs=112 avail_mem=76.83 GB):  15%|█▌        | 3/20 [00:01<00:06,  2.81it/s]Capturing batches (bs=104 avail_mem=76.82 GB):  15%|█▌        | 3/20 [00:01<00:06,  2.81it/s]Capturing batches (bs=96 avail_mem=76.82 GB):  15%|█▌        | 3/20 [00:01<00:06,  2.81it/s] Capturing batches (bs=96 avail_mem=76.82 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.80it/s]Capturing batches (bs=88 avail_mem=76.81 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.80it/s]Capturing batches (bs=80 avail_mem=76.81 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.80it/s]

Capturing batches (bs=72 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.80it/s]Capturing batches (bs=64 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.80it/s]Capturing batches (bs=64 avail_mem=76.80 GB):  45%|████▌     | 9/20 [00:01<00:01,  9.59it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  45%|████▌     | 9/20 [00:01<00:01,  9.59it/s]Capturing batches (bs=48 avail_mem=76.79 GB):  45%|████▌     | 9/20 [00:01<00:01,  9.59it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  9.59it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  60%|██████    | 12/20 [00:01<00:00, 13.03it/s]Capturing batches (bs=32 avail_mem=76.78 GB):  60%|██████    | 12/20 [00:01<00:00, 13.03it/s]

Capturing batches (bs=24 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:01<00:00, 13.03it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:01<00:00, 13.03it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.02it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.02it/s]Capturing batches (bs=8 avail_mem=76.76 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.02it/s] Capturing batches (bs=4 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.02it/s]

Capturing batches (bs=2 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 15.02it/s]Capturing batches (bs=2 avail_mem=76.75 GB):  95%|█████████▌| 19/20 [00:01<00:00, 19.57it/s]Capturing batches (bs=1 avail_mem=76.74 GB):  95%|█████████▌| 19/20 [00:01<00:00, 19.57it/s]Capturing batches (bs=1 avail_mem=76.74 GB): 100%|██████████| 20/20 [00:01<00:00, 10.35it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rachel and I am 22 years old. I have been a volunteer for a year now. I have been a writer for the last six years, and the most interesting thing I've learned about writing is that I'm not good at it. I can't write great poetry or prose. I can write quick, wordy sentences. But I do have a long history of writing that I feel like I don't have a way to express or let out. I feel like I don't have a way to express myself. So, I've been thinking about how I can help myself better. I have a question that I have been
Prompt: The president of the United States is
Generated text:  a very important person in the country. He is supposed to make important decisions. He also has to do a lot of other work. But he has to keep his job. This is his job. But sometimes, he finds himself having a lot of personal problems. He may have to go to other countries. This is not always a good thing. He may go to a country where he feels very unhappy. He may have to go t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. 

(Note: The statement should be a single, clear sentence that captures the essence of Paris's importance and cultural significance.) 

Please provide the French translation of the statement in the following format: "French statement about Paris's capital city: [French translation]". 

For example:
French statement about Paris's capital city: "Paris is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene." 

French translation: "Paris est connue pour son Eiffel Tower, Notre-Dame

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation and efficiency: AI is expected to continue to automate many tasks, freeing up human workers to focus on more complex and creative work. This could lead to increased efficiency and productivity, as well as the creation of new jobs in areas like data analysis and machine learning.

2. Enhanced human-AI collaboration: AI is likely to become more integrated with human AI, allowing for more complex and nuanced interactions between humans and machines. This could lead to new forms of collaboration and communication, as well as the development of new forms of AI that can better understand and respond to human emotions and motivations



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [role] at [company name]. I'm excited to work with you.

I'm a [role] at [company name]. I'm excited to work with you. Can you tell us a bit about yourself? I'm [name] and I'm a [role] at [company name]. I'm [age] years old and I love [role]! What can you tell me about yourself? 

I'm [name] and I'm a [role] at [company name]. I'm [age] years old and I love [role]! What can you tell me about yourself

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Papouelle."

This statement encapsulates the main facts about Paris:

1. It is the capital city of France.
2. Its name is derived from "Papouelle," a fictional village in a French fairy tale.
3. It is the most populous city in France.
4. It is the seat of the French government and a major financial center.
5. Paris is renowned 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 an

 AI

 assistant

 designed

 to

 help

 people

 find

 information

 and

 answer

 their

 questions

.

 I

'm

 here

 to

 assist

 you

 with

 anything

 you

 need

.

 How

 can

 I

 assist

 you

 today

?

 What

's

 the

 most

 important

 thing

 that

 I

 can

 do

 to

 make

 your

 experience

 with

 me

 as

 pleasant

 as

 possible

?

 Let

's

 get

 started

!

 You

're

 welcome

.

 Let

's

 get

 started

!

 Hello

,

 my

 name

 is

 [

Name

],

 and

 I

'm

 an

 AI

 assistant

 designed

 to

 help

 people

 find

 information

 and

 answer

 their

 questions

.

 I

'm

 here

 to

 assist

 you

 with

 anything

 you

 need

.

 How

 can

 I

 assist

 you

 today

?

 What

's

 the

 most

 important

 thing

 that

 I

 can

 do

 to

 make



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



-

 The

 statement

 is

 fact

ually

 accurate

.


-

 It

 does

 not

 contain

 any

 inaccur

acies

 or

 omitted

 information

.


-

 It

 is

 presented

 clearly

 and

 in

 a

 concise

 manner

.


-

 It

 does

 not

 dev

iate

 from

 the

 provided

 context

.

 



Is

 the

 statement

 "

Paris

 is

 the

 capital

 of

 France

"

 accurate

?

 Yes

,

 the

 statement

 "

Paris

 is

 the

 capital

 of

 France

"

 is

 accurate

.

 



-

 The statement

 accurately

 describes

 the

 capital

 city

 of

 France

.


-

 It

 provides

 the

 name

 of

 the

 capital

 city

 and

 the

 country

 it

 belongs

 to

.


-

 It

 does

 not

 include

 any

 additional

 information

 or

 context

 beyond

 what

 is

 explicitly

 stated

.



Is

 there

 any

 other

 information

 in

 the

 provided

 text

 that



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 highly

 promising

,

 with

 many

 exciting

 developments

 on

 the

 horizon

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Personal

ization

:

 As

 AI

 technology

 improves

,

 we

 can

 expect

 to

 see

 more

 personalized

 experiences

 and

 services

,

 including

 personalized

 recommendations

 and

 targeted

 advertising

.

 This

 will

 require

 us

 to

 be

 more

 efficient

 at

 collecting

 and

 analyzing

 data

,

 as

 well

 as

 developing

 better

 algorithms

 for

 analyzing

 and

 understanding

 user

 behavior

.



2

.

 More

 Advanced

 Machine

 Learning

:

 AI

 is

 getting

 better

 at

 understanding

 and

 learning

 from

 data

,

 which

 means

 that

 we

 can

 expect

 to

 see

 even

 more

 sophisticated

 and

 accurate

 models

 as

 technology

 advances

.

 This

 will

 require

 us

 to

 continue

 investing

 in

 research

 and

 development

 to

 stay




In [6]:
llm.shutdown()