# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-01 00:05:45] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-01 00:05:45] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-01 00:05:45] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.08it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:03,  5.27it/s]Capturing batches (bs=120 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.27it/s]

Capturing batches (bs=112 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.27it/s]Capturing batches (bs=104 avail_mem=75.93 GB):   5%|▌         | 1/20 [00:00<00:03,  5.27it/s]Capturing batches (bs=104 avail_mem=75.93 GB):  20%|██        | 4/20 [00:00<00:01, 15.06it/s]Capturing batches (bs=96 avail_mem=75.93 GB):  20%|██        | 4/20 [00:00<00:01, 15.06it/s] Capturing batches (bs=88 avail_mem=75.22 GB):  20%|██        | 4/20 [00:00<00:01, 15.06it/s]Capturing batches (bs=80 avail_mem=75.22 GB):  20%|██        | 4/20 [00:00<00:01, 15.06it/s]Capturing batches (bs=80 avail_mem=75.22 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.04it/s]Capturing batches (bs=72 avail_mem=75.21 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.04it/s]

Capturing batches (bs=64 avail_mem=75.21 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.04it/s]Capturing batches (bs=56 avail_mem=75.20 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.04it/s]Capturing batches (bs=56 avail_mem=75.20 GB):  50%|█████     | 10/20 [00:00<00:00, 17.90it/s]Capturing batches (bs=48 avail_mem=75.20 GB):  50%|█████     | 10/20 [00:00<00:00, 17.90it/s]Capturing batches (bs=40 avail_mem=75.19 GB):  50%|█████     | 10/20 [00:00<00:00, 17.90it/s]Capturing batches (bs=32 avail_mem=75.19 GB):  50%|█████     | 10/20 [00:00<00:00, 17.90it/s]

Capturing batches (bs=32 avail_mem=75.19 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.12it/s]Capturing batches (bs=24 avail_mem=75.18 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.12it/s]Capturing batches (bs=16 avail_mem=75.18 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.12it/s]Capturing batches (bs=12 avail_mem=75.17 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.12it/s]Capturing batches (bs=12 avail_mem=75.17 GB):  80%|████████  | 16/20 [00:00<00:00, 20.11it/s]Capturing batches (bs=8 avail_mem=75.17 GB):  80%|████████  | 16/20 [00:00<00:00, 20.11it/s] Capturing batches (bs=4 avail_mem=74.76 GB):  80%|████████  | 16/20 [00:00<00:00, 20.11it/s]

Capturing batches (bs=2 avail_mem=74.66 GB):  80%|████████  | 16/20 [00:00<00:00, 20.11it/s]Capturing batches (bs=2 avail_mem=74.66 GB):  95%|█████████▌| 19/20 [00:00<00:00, 22.48it/s]Capturing batches (bs=1 avail_mem=74.66 GB):  95%|█████████▌| 19/20 [00:00<00:00, 22.48it/s]Capturing batches (bs=1 avail_mem=74.66 GB): 100%|██████████| 20/20 [00:01<00:00, 19.91it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sam. I was born in 1976. I have three children, and my wife works as a graphic designer. I like to travel and see new things. I love reading and taking long walks in the park. I like to cook and eat all kinds of delicious foods. I also have a few hobbies. One is to write blogs, and I make my blog posts on serious business topics. I also write something on a website about my cooking and travel. 
What is the main reason why Sam enjoys reading? 

pick from the following. a. To learn new things about business. b. To find out interesting facts
Prompt: The president of the United States is
Generated text:  a president who represents a president who is the president of the United States, not the president who is the president of a country. Now, we need to find the president of the United States who is the president of a country.
To solve this problem, we need to identify the president of the United States who is also the president of a country. 

The

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [Age], [Gender], [Nationality], [Occupation], and I have [Number] years of experience in [Field of Work]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I enjoy [Favorite Activity], and I'm always looking for new ways to expand my skills and knowledge. What's your favorite book or movie? I love [Favorite Book/Movie], and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in Europe and the third largest city in the world by population. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also a major transportation hub, with many ma

Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased automation: AI is expected to become more and more integrated into the production process, from manufacturing to healthcare. This will lead to increased automation of tasks, which will require more human oversight and control.

2. AI ethics: As AI becomes more advanced, there will be a growing concern about its ethical implications. This will lead to increased regulation and oversight of AI development and deployment.

3. AI for human benefit: AI is likely to be used for human benefit, such as in healthcare,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm here to assist you in any way that I can. How can I assist you today? Is there anything in particular you're looking for that I can help with? As an AI language model, I'm always here to help you, and I'm here to provide you with the best possible assistance possible. So, what's one thing in particular you're looking for? Whether it's information about a specific topic, a query about a problem you're facing, or simply a friendly chat, I'm here to help. So, what's one thing you're looking for today? I'm here to help

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. The city is also famous for its rich cultural heritage and its annual Fête de la Saint-Jean festival. It is one

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 come

 from

 [

Country

],

 and

 I

 am

 [

Age

],

 but

 I

'm

 not

 quite

 sure

 how

 old

 I

 am

.

 How

 did

 you

 get

 to

 where

 you

 are

 now

?

 I

'd

 love

 to

 hear

 about

 your

 life

 and

 experiences

.

 How

 do

 you

 feel

 about

 yourself

?

 I

'm

 trying

 to

 determine

 how

 you

 feel

 about

 yourself

,

 but

 I

'm

 not

 quite

 sure

 where

 to

 start

.

 I

'd

 love

 to

 hear

 your

 thoughts

 and

 experiences

 too

.

 How

 do

 you

 plan

 on

 finding

 the

 answers

?

 I

'm

 interested

 in

 learning

 more

 about

 you

.

 What

 do

 you

 want

 to

 know

?

 I

'd

 love

 to

 hear

 your

 answers

.

 I

'm

 looking

 forward

 to

 hearing

 from

 you

.




Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Note

:

 The

 French

 capital

 is

 also

 known

 as

 the

 "

City

 of

 Light

"

 due

 to

 its

 iconic

 status

 and

 cultural

 significance

.

 


 


The

 statement

 uses

 the

 information

 provided

 to

 create

 a

 concise

,

 factual

 statement

 that

 captures

 the

 essence

 of

 the

 capital

 city

.

 It

 highlights

 the

 capital

's

 role

 as

 the

 capital

 of

 France

 and

 its

 reputation

 as

 the

 city

 of

 light

,

 which

 refers

 to

 its

 historical

 importance

 and

 cultural

 significance

.

 



The

 statement

 is

 concise

 and

 captures

 the

 key

 information

 provided

 in

 the

 prompt

,

 making

 it

 easy

 for

 readers

 to

 quickly

 understand

 the

 capital

 city

's

 status

.

 Additionally

,

 it

's

 concise

 enough

 to

 be

 understood

 by

 those

 who

 may

 not

 have

 a

 prior

 knowledge



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 full

 of

 exciting

 possibilities

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Autonomous

 vehicles

:

 AI

 is

 already

 being

 used

 in

 autonomous

 vehicles

,

 and

 it

 is

 expected

 to

 become

 even

 more

 sophisticated

 and

 autonomous

 in

 the

 future

.

 Self

-driving

 cars

,

 drones

,

 and

 trucks

 could

 be

 expected

 to

 make

 transportation

 more

 efficient

 and

 reduce

 traffic

 accidents

.



2

.

 Smart

 cities

:

 AI

 is

 being

 used

 to

 improve

 the

 efficiency

 and

 sustainability

 of

 cities

 by

 predicting

 traffic

 patterns

,

 managing

 waste

,

 and

 providing

 data

 on

 public

 health

.



3

.

 Medical

 diagnosis

 and

 treatment

:

 AI

 is

 already

 being

 used

 in

 medical

 diagnosis

 and

 treatment

,

 with

 models

 able

 to

 identify

 diseases

 and

 predict

 treatment

 outcomes

.

 AI




In [6]:
llm.shutdown()