# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-21 05:09:44] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-21 05:09:44] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-21 05:09:44] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.69it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=50.37 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=50.37 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=120 avail_mem=50.27 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]

Capturing batches (bs=112 avail_mem=50.27 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=104 avail_mem=50.26 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=104 avail_mem=50.26 GB):  20%|██        | 4/20 [00:00<00:01, 15.18it/s]Capturing batches (bs=96 avail_mem=50.26 GB):  20%|██        | 4/20 [00:00<00:01, 15.18it/s] Capturing batches (bs=88 avail_mem=50.25 GB):  20%|██        | 4/20 [00:00<00:01, 15.18it/s]Capturing batches (bs=80 avail_mem=50.24 GB):  20%|██        | 4/20 [00:00<00:01, 15.18it/s]Capturing batches (bs=80 avail_mem=50.24 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.80it/s]Capturing batches (bs=72 avail_mem=50.24 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.80it/s]

Capturing batches (bs=64 avail_mem=50.23 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.80it/s]Capturing batches (bs=56 avail_mem=50.23 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.80it/s]Capturing batches (bs=56 avail_mem=50.23 GB):  50%|█████     | 10/20 [00:00<00:00, 20.88it/s]Capturing batches (bs=48 avail_mem=50.23 GB):  50%|█████     | 10/20 [00:00<00:00, 20.88it/s]Capturing batches (bs=40 avail_mem=50.22 GB):  50%|█████     | 10/20 [00:00<00:00, 20.88it/s]

Capturing batches (bs=32 avail_mem=50.22 GB):  50%|█████     | 10/20 [00:00<00:00, 20.88it/s]Capturing batches (bs=32 avail_mem=50.22 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.52it/s]Capturing batches (bs=24 avail_mem=50.21 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.52it/s]Capturing batches (bs=16 avail_mem=50.21 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.52it/s]Capturing batches (bs=12 avail_mem=50.20 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.52it/s]Capturing batches (bs=12 avail_mem=50.20 GB):  80%|████████  | 16/20 [00:00<00:00, 20.05it/s]Capturing batches (bs=8 avail_mem=50.20 GB):  80%|████████  | 16/20 [00:00<00:00, 20.05it/s] 

Capturing batches (bs=4 avail_mem=50.19 GB):  80%|████████  | 16/20 [00:00<00:00, 20.05it/s]Capturing batches (bs=2 avail_mem=50.19 GB):  80%|████████  | 16/20 [00:00<00:00, 20.05it/s]Capturing batches (bs=2 avail_mem=50.19 GB):  95%|█████████▌| 19/20 [00:01<00:00, 19.85it/s]Capturing batches (bs=1 avail_mem=50.18 GB):  95%|█████████▌| 19/20 [00:01<00:00, 19.85it/s]Capturing batches (bs=1 avail_mem=50.18 GB): 100%|██████████| 20/20 [00:01<00:00, 19.23it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mathew. My girlfriend and I are in the beginning of our relationship. It is really scary to think that someone is not as cool as me. I am a bit of a geek out of necessity, I think I'm good at math and science. My girlfriend wants me to be more "normal" and work with people like me. I feel like I am not as "cool" as she wants me to be. She has been saying the same things for a while now and I just don't know if she is mad at me or is being nice. She is very kind, I think and she is just really nice. She
Prompt: The president of the United States is
Generated text:  a very important person in the country. He has many duties to do. He is the leader of the country, he is the head of government. He decides what the president of the United States does and what the government does. The president is also responsible for making important decisions that affect the country. The president also has to make sure that the country does not have a lot of probl

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I am passionate about [job title] and I love [job title] because [reason for passion]. I am always looking for ways to [action or goal], and I am always eager to learn and grow. I am a [job title] who is always [positive trait or quality]. I am a [job title] who is always [positive trait or quality]. I am a [job title] who is always [positive trait or quality]. I am a [job title] who is always [positive trait or quality]. I am a [job

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its food, fashion, and art scene, and is a popular tourist destination. The city is a cultural hub and a major economic center in Europe. It is also a symbol of France's rich history and culture. Paris is the largest city in France by population and is considered the cultural and economic center of the country. It is also the capital of the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and artificial intelligence: As AI becomes more advanced, it is likely to become more integrated into various industries, leading to increased automation and artificial intelligence. This could lead to job displacement, but also create new opportunities for workers.

2. Improved privacy and security: As AI becomes more integrated into our daily lives, there will be a need for increased privacy and security measures. This could include measures to protect user data, prevent cyber attacks, and ensure that AI systems are



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [field or profession] [independent, professional, or volunteer], currently located [location]. As a [profession], I am [insert the first name of the character], and I am here to [insert what the character does], [insert their role in the field or profession]. I am a [insert the first name of the character], and I am here to [insert what the character does], [insert their role in the field or profession]. And as a [insert the first name of the character], I am [insert what the character does], [insert their role in the field or profession

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a historic city with a rich cultural heritage and notable landmarks such as Notre-Dame Cathedral, the Palace of Versailles, and the Louvre Museum. It is an international center for education, art, and mu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 I

'm

 a

 [

Occup

ation

/

Role

]

 who

 has

 been

 dedicated

 to

 [

Reason

 for

 being

]

 for

 [

Number

 of

 Years

].

 I

'm

 here

 to

 introduce

 myself

,

 and

 I

'm

 excited

 to

 share

 some

 of

 the

 things

 that

 make

 me

 who

 I

 am

.

 My

 name

 is

 [

Name

].

 I

'm

 a

 [

Occup

ation

/

Role

]

 who

 has

 been

 dedicated

 to

 [

Reason

 for

 being

]

 for

 [

Number

 of

 Years

].

 I

'm

 here

 to

 introduce

 myself

,

 and

 I

'm

 excited

 to

 share

 some

 of

 the

 things

 that

 make

 me

 who

 I

 am

.

 My

 name

 is

 [

Name

],

 I

'm

 a

 [

Occup

ation

/

Role

]

 who

 has

 been

 dedicated

 to

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 unique

 architecture

,

 historical

 landmarks

,

 and

 rich

 cultural

 heritage

.

 It

 is

 a

 bustling

 city

 with

 a

 population

 of

 over

 

1

 million

 people

,

 making

 it

 the

 largest

 city

 in

 Europe

 by

 population

.

 Paris

 is

 also

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 French

 government

 has

 invested

 heavily

 in

 infrastructure

 development

 and

 tourist

 attractions

 to

 promote

 tourism

 and

 cultural

 exchange

 in

 the

 city

.

 Paris

 is

 a

 city

 of

 contrasts

,

 with

 its

 traditional

 square

,

 cob

ble

stone

 streets

,

 and

 vibrant

 nightlife

.

 Its

 status

 as

 the

 capital

 of

 France

 is

 a

 symbol

 of

 the

 country

's

 history

,

 culture



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 significant

 advancements

 in

 multiple

 areas

,

 including

:



1

.

 Increased

 efficiency

 and

 productivity

:

 As

 AI

 becomes

 more

 integrated

 into

 everyday

 life

,

 it

 is

 likely

 to

 become

 more

 efficient

 and

 effective

 at

 performing

 a

 wide

 range

 of

 tasks

,

 from

 administrative

 tasks

 to

 medical

 diagnosis

 to

 optimizing

 supply

 chain

 logistics

.



2

.

 Improved

 safety

 and

 reliability

:

 AI

-powered

 systems

 are

 becoming

 more

 robust

 and

 reliable

,

 and

 are

 being

 developed

 to

 handle

 complex

 and

 unpredictable

 environments

.

 This

 includes

 the

 ability

 to

 process

 large

 amounts

 of

 data

 quickly

 and

 accurately

,

 and

 to

 make

 decisions

 that

 are

 both

 timely

 and

 safe

.



3

.

 Personal

ization

 and

 customization

:

 AI

-powered

 systems

 are

 becoming

 increasingly

 capable

 of

 analyzing

 large




In [6]:
llm.shutdown()