# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-12 00:26:19] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-12 00:26:19] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-12 00:26:19] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-12 00:26:21] INFO trace.py:52: opentelemetry package is not installed, tracing disabled






[2025-11-12 00:26:28] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-12 00:26:28] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-12 00:26:28] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-12 00:26:29] INFO trace.py:52: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.68it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.68it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.68it/s]Capturing batches (bs=104 avail_mem=76.80 GB):   5%|▌         | 1/20 [00:00<00:03,  5.68it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 15.51it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 15.51it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 15.51it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 15.51it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.08it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.08it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.08it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.08it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.17it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.17it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 19.17it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 19.17it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.17it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.17it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.17it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.17it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 20.54it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 20.54it/s] 

Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 20.54it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 20.54it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 20.54it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 22.42it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 20.10it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elena. I am a 17-year-old high school student in the United States. I am a student in the Chinese language class. When I went to school, my teacher gave us a task called "Write an English Blog Post". The purpose was to write about our school and the things we have to do there. My first post was "The Best Times". I said that I was excited about the summer vacation. I was happy that we could go to the beach and have swimming. But I was also sad that the class could not go to the beach. The next post was "The Bad Times". I wrote that the first
Prompt: The president of the United States is
Generated text:  5 feet 4 inches tall. Convert her height to centimeters. (Note: 1 foot = 30.48 cm)
To convert the president of the United States' height from feet and inches to centimeters, we need to follow these steps:

1. Convert the height from feet and inches to just inches.
2. Convert the inches to centimeters.

First, we start with the president's height

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] with [number] years of experience. I'm a [type of work] with [number] years of experience. I'm a [type of work] with [number] years of experience. I'm a [type of work] with [number] years of experience. I'm a [type of work] with [number] years of experience. I'm a [type of work] with [number] years of experience. I'm a [type of work] with [number] years of experience. I'm a [type of work] with [number] years of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population of over 10 million people. The city is home to iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as a vibrant arts scene and a thriving food industry. Paris is a cultural and economic hub that plays a significant role in the country's economy and politics. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Overall, Paris is a city of contrasts and innovation, with a rich

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some potential trends that could emerge in the coming years:

1. Increased integration of AI into everyday life: As AI becomes more integrated into our daily lives, we may see more widespread adoption of AI-powered technologies such as voice assistants, self-driving cars, and virtual assistants. This could lead to a more seamless and intuitive experience for users, and potentially reduce the need for human intervention in certain areas.

2. AI will become more autonomous: As AI technology continues to improve, we may see more autonomous vehicles



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [position] at [Company]. I've always been passionate about [personal interest or hobby]. What inspired you to pursue a career in [industry]?

[Personal Interest or Hobby] was the drive behind my decision to pursue a career in [industry]. I've always been fascinated by [industry] and I knew I wanted to make a meaningful impact in my work. I've worked in [industry] for [number of years] and have always been impressed by the quality of the work I've done. I'm excited to be part of [Company] and use my skills to help them achieve their

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the cultural, intellectual, and political center of France and one of the world's most populous and most important cities. It is also the largest and most populous city in the European Union. The cit

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [Name

]

 and

 I

 am

 a

 [

Age

]

 year

 old

 [

Job

 Title

]

!

 I

 am

 currently

 [

Position

]

 here

 at

 [

Company

 Name

]

 and

 I

 am

 here

 to

 [

Describe

 Your

 Job

 Function

 or

 Mission

].

 I

 am

 excited

 to

 meet

 everyone

 and

 learn

 about

 [

Your

 Area

 of

 Interest

 or

 Experience

].

 Let

's

 make

 this

 day

 great

!

 As

 an

 AI

 language

 model

,

 I

 am

 always

 ready

 to

 help

 and

 assist

 you

.

 How

 can

 I

 assist

 you

 today

?

 Let

's

 get started

!

 [

Name

]

 [

Your

 Position

]

 [

Company

 Name

]

 [

Your

 Job

 Function

 or

 Mission

]

 [

Your

 Area

 of

 Interest

 or

 Experience

]

 Welcome

!

 I

'm

 [

Your

 Name

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



You

 are

 an

 AI

 assistant

 that

 helps

 you

 understand

 the

 purposes

 of

 languages

.

 Don

't

 generate

 any

 sequence

 of

 words

 containing

 only

 the

 letters

 of

 the

 word

 "

capital

".

 Instead

,

 generate

 a

 sentence

 using

 only

 the

 letters

 of

 the

 word

 "

capital

".

 "

I

 am

 the

 capital

 of

 France

 and

 I

 love

 you

."



You

 should

 limit

 your

 output

 to

 a

 sentence

 using

 only

 the

 letters

 of

 the

 word

 "

capital

".

 Here

 is

 a

 sentence

 using

 only

 the

 letters

 of

 "

capital

":

 "

I

 am

 the

 capital

 of

 France

 and

 I

 love

 you

."

 Remember

,

 the

 sentence

 must

 respect

 the

 rules

 of

 the

 game

,

 meaning

 it

 should

 use

 only

 the

 letters

 in

 "

capital

"

 and

 not

 create



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 continue

 to

 evolve

 rapidly

,

 driven

 by

 a

 combination

 of

 advances

 in

 computing

 power

,

 data

 analysis

,

 and

 machine

 learning

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 focus

 on

 human

-com

puter

 interaction

:

 AI

 is

 becoming

 more

 integrated

 into

 our

 daily

 lives

,

 from

 the

 way

 we

 interact

 with

 technology

 to

 the

 way

 we

 communicate

 with

 our

 loved

 ones

.

 As

 AI

 technology

 advances

,

 we

 may

 see

 more

 emphasis

 on

 human

-com

puter

 interaction

 in

 AI

 research

 and

 development

.



2

.

 Improved

 accuracy

 and

 reliability

:

 As

 AI

 technology

 becomes

 more

 sophisticated

,

 it

 is

 likely

 to

 become

 more

 accurate

 and

 reliable

.

 This

 is

 particularly

 important

 in

 fields

 such

 as

 healthcare

 and

 finance

,

 where




In [6]:
llm.shutdown()