# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-15 07:58:47] INFO utils.py:148: Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-15 07:58:47] INFO utils.py:151: Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-15 07:58:47] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-15 07:58:49] INFO server_args.py:1832: Attention backend not specified. Use fa3 backend by default.


[2026-02-15 07:58:49] INFO server_args.py:2867: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.09it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=77.01 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=77.01 GB):   5%|▌         | 1/20 [00:00<00:03,  6.28it/s]Capturing batches (bs=120 avail_mem=76.91 GB):   5%|▌         | 1/20 [00:00<00:03,  6.28it/s]Capturing batches (bs=112 avail_mem=76.91 GB):   5%|▌         | 1/20 [00:00<00:03,  6.28it/s]

Capturing batches (bs=104 avail_mem=76.91 GB):   5%|▌         | 1/20 [00:00<00:03,  6.28it/s]Capturing batches (bs=96 avail_mem=76.91 GB):   5%|▌         | 1/20 [00:00<00:03,  6.28it/s] Capturing batches (bs=96 avail_mem=76.91 GB):  25%|██▌       | 5/20 [00:00<00:00, 21.70it/s]Capturing batches (bs=88 avail_mem=76.91 GB):  25%|██▌       | 5/20 [00:00<00:00, 21.70it/s]Capturing batches (bs=80 avail_mem=76.91 GB):  25%|██▌       | 5/20 [00:00<00:00, 21.70it/s]Capturing batches (bs=72 avail_mem=76.91 GB):  25%|██▌       | 5/20 [00:00<00:00, 21.70it/s]Capturing batches (bs=64 avail_mem=76.89 GB):  25%|██▌       | 5/20 [00:00<00:00, 21.70it/s]

Capturing batches (bs=64 avail_mem=76.89 GB):  45%|████▌     | 9/20 [00:00<00:00, 23.12it/s]Capturing batches (bs=56 avail_mem=76.88 GB):  45%|████▌     | 9/20 [00:00<00:00, 23.12it/s]Capturing batches (bs=48 avail_mem=76.88 GB):  45%|████▌     | 9/20 [00:00<00:00, 23.12it/s]Capturing batches (bs=40 avail_mem=76.88 GB):  45%|████▌     | 9/20 [00:00<00:00, 23.12it/s]Capturing batches (bs=40 avail_mem=76.88 GB):  60%|██████    | 12/20 [00:00<00:00, 24.93it/s]Capturing batches (bs=32 avail_mem=76.39 GB):  60%|██████    | 12/20 [00:00<00:00, 24.93it/s]Capturing batches (bs=24 avail_mem=76.39 GB):  60%|██████    | 12/20 [00:00<00:00, 24.93it/s]Capturing batches (bs=16 avail_mem=76.23 GB):  60%|██████    | 12/20 [00:00<00:00, 24.93it/s]

Capturing batches (bs=16 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:00<00:00, 24.03it/s]Capturing batches (bs=12 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:00<00:00, 24.03it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:00<00:00, 24.03it/s] Capturing batches (bs=4 avail_mem=76.22 GB):  75%|███████▌  | 15/20 [00:00<00:00, 24.03it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  75%|███████▌  | 15/20 [00:00<00:00, 24.03it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  95%|█████████▌| 19/20 [00:00<00:00, 27.76it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  95%|█████████▌| 19/20 [00:00<00:00, 27.76it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:00<00:00, 25.03it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John. I am currently in grade 11 and I have been using the personal computer for a few years now. Since I was little, I have been very much interested in mathematics. I have taken some elementary algebra classes, but I never did very well and never took any more classes in algebra. I am not very good at math.

I feel like I am having a lot of trouble with math. I am starting to think that I am not very good at math. I have not been able to get my grades up and my overall performance has been down. I am not very good at understanding mathematical concepts.

I have always been
Prompt: The president of the United States is
Generated text:  a high-ranking government official, serving as the head of state and government, and presiding over the daily operations of the nation. The president’s duties include leading the nation’s defense, foreign policy, national security, foreign and military relations, and negotiations and legislation related to inte

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. It is the largest city in France and the third-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also a major center for art, music, and literature, and is home to many museums, theaters, and other cultural institutions. Pa

Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there is a growing emphasis on developing ethical AI that is designed to minimize harm and maximize benefits. This could involve developing AI that is designed to be transparent, accountable, and accountable, and that is used to make decisions that are fair and just.

2. Greater integration with human decision-making: AI is likely to become more integrated with human decision-making, allowing for more complex and nuanced decision-making. This



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you today and learn more about our unique product. What can you tell me about your company and your role within it?

Great, thanks for introducing yourself. I'm excited to learn more about our product. Can you tell me more about the product and what sets it apart from other similar products in the market?

Certainly, the product is a cloud-based project management tool that allows users to easily collaborate and track tasks, projects, and team members. It's designed to be user-friendly and intuitive, making it easy for anyone to use and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its historical landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It also has a vibrant arts and culture scene, popul

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 first

 name

 and

 last

 name

].

 I

 am

 a

 [

insert

 age

 range

]

 year

 old

 boy

,

 and

 I

 am

 [

insert

 occupation

 or

 profession

]

 at

 the

 moment

.

 I

 have

 always

 been

 fascinated

 by

 [

insert

 something

 about

 your

 past

 or

 childhood

,

 such

 as

 a

 hobby

,

 experience

,

 or

 story

].

 I

 enjoy

 [

insert

 something

 you

 do

 in

 your

 free

 time

,

 such

 as

 reading

,

 playing

 music

,

 or

 spending

 time

 with

 friends

].

 I

 am

 always

 looking

 for

 [

insert

 something

 you

 are

 passionate

 about

 or

 interested

 in

,

 such

 as

 learning

 new

 things

,

 discovering

 interesting

 facts

,

 or

 connecting

 with

 others

].

 I

 am

 dedicated

 to

 [

insert

 something

 you

 are

 proud

 of

 or

 what

 you



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



A

)

 Correct

 


B

)

 Incorrect




A

)

 Correct





Paris

 is

 the

 capital

 and

 largest

 city

 of

 France

,

 located

 on

 the

 North

 Bank

 of

 the

 Se

ine

 in

 the

 Centre

-

Val

 de

 Lo

ire

 region

.

 It

 is

 the

 

1

5

th

-largest

 city

 in

 the

 European

 Union

 and

 is

 an

 important

 cultural

,

 economic

,

 and

 political

 center

 in

 Western

 Europe

.

 The

 city

 is

 renowned

 for

 its

 historic

 architecture

,

 art

,

 music

,

 and

 food

.

 Paris

 is

 known

 as

 the

 "

City

 of

 Light

"

 for

 its

 iconic

 architecture

 and

 its

 role

 as

 the

 seat

 of

 government

 for

 France

.

 It

 also

 hosts

 numerous

 cultural

 events

 and

 festivals

 throughout

 the

 year

.

 Paris

 is

 one



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

,

 with

 many

 exciting

 developments

 on

 the

 horizon

.

 Here

 are

 some

 of

 the

 possible

 trends

:



1

.

 Increased

 AI

 integration

 with

 other

 technologies

:

 AI

 is

 becoming

 more

 integrated

 with

 other

 technologies

,

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

)

 and

 the

 Internet

 of

 Things

 (

Io

T

)

 with

 AI

.

 This

 will

 allow

 more

 intelligent

 devices

 and

 systems

 to

 operate

 autonom

ously

,

 making

 them

 much

 more

 efficient

 and

 convenient

.



2

.

 Adv

ancements

 in

 AI

 ethics

 and

 transparency

:

 As

 AI

 becomes

 more

 prevalent

,

 it

 is

 becoming

 increasingly

 important

 to

 consider

 the

 ethical

 implications

 of

 its

 use

.

 This

 will

 likely

 lead

 to

 more

 research

 and

 development

 in

 AI

 ethics

,

 as

 well

 as

 more

 transparency

 and




In [6]:
llm.shutdown()