# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-18 08:03:49] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-18 08:03:49] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-18 08:03:49] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-18 08:03:52] INFO server_args.py:1655: Attention backend not specified. Use fa3 backend by default.


[2026-01-18 08:03:52] INFO server_args.py:2554: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=75.36 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=75.36 GB):   5%|▌         | 1/20 [00:00<00:12,  1.50it/s]Capturing batches (bs=120 avail_mem=74.76 GB):   5%|▌         | 1/20 [00:00<00:12,  1.50it/s]Capturing batches (bs=112 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:12,  1.50it/s]Capturing batches (bs=104 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:12,  1.50it/s]Capturing batches (bs=104 avail_mem=74.75 GB):  20%|██        | 4/20 [00:00<00:02,  6.49it/s]Capturing batches (bs=96 avail_mem=74.75 GB):  20%|██        | 4/20 [00:00<00:02,  6.49it/s] Capturing batches (bs=88 avail_mem=74.74 GB):  20%|██        | 4/20 [00:00<00:02,  6.49it/s]Capturing batches (bs=80 avail_mem=74.74 GB):  20%|██        | 4/20 [00:00<00:02,  6.49it/s]

Capturing batches (bs=72 avail_mem=74.73 GB):  20%|██        | 4/20 [00:00<00:02,  6.49it/s]Capturing batches (bs=72 avail_mem=74.73 GB):  40%|████      | 8/20 [00:00<00:00, 12.47it/s]Capturing batches (bs=64 avail_mem=74.73 GB):  40%|████      | 8/20 [00:00<00:00, 12.47it/s]Capturing batches (bs=56 avail_mem=74.72 GB):  40%|████      | 8/20 [00:00<00:00, 12.47it/s]Capturing batches (bs=48 avail_mem=74.71 GB):  40%|████      | 8/20 [00:00<00:00, 12.47it/s]Capturing batches (bs=48 avail_mem=74.71 GB):  55%|█████▌    | 11/20 [00:01<00:00, 16.00it/s]Capturing batches (bs=40 avail_mem=74.68 GB):  55%|█████▌    | 11/20 [00:01<00:00, 16.00it/s]Capturing batches (bs=32 avail_mem=74.67 GB):  55%|█████▌    | 11/20 [00:01<00:00, 16.00it/s]

Capturing batches (bs=24 avail_mem=74.67 GB):  55%|█████▌    | 11/20 [00:01<00:00, 16.00it/s]Capturing batches (bs=16 avail_mem=74.66 GB):  55%|█████▌    | 11/20 [00:01<00:00, 16.00it/s]Capturing batches (bs=16 avail_mem=74.66 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.52it/s]Capturing batches (bs=12 avail_mem=74.66 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.52it/s]Capturing batches (bs=8 avail_mem=74.65 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.52it/s] 

Capturing batches (bs=4 avail_mem=74.65 GB):  75%|███████▌  | 15/20 [00:01<00:00, 17.52it/s]Capturing batches (bs=4 avail_mem=74.65 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.13it/s]Capturing batches (bs=2 avail_mem=74.64 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.13it/s]Capturing batches (bs=1 avail_mem=74.64 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.13it/s]Capturing batches (bs=1 avail_mem=74.64 GB): 100%|██████████| 20/20 [00:01<00:00, 14.76it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Greg. I was born in 1979 in Regina, Saskatchewan, Canada. I now live and work in San Francisco, California, United States. I have an undergraduate degree in Industrial Engineering from Concordia University, an MBA from the University of San Francisco, and a PhD in Industrial Engineering from the University of Notre Dame. I was a Research Associate in the Laboratory of Industrial Engineering at the University of Notre Dame. My current research interests are in industrial design and technology. I work primarily in the areas of automotive engineering and design, robotics, and industrial measurement. My research focuses on research how people use computer aided design software and how
Prompt: The president of the United States is
Generated text:  a very important person in our country. As such, he/she must be _______ ( 1. ) in front of the public. ( 2. ) to the public.
A. in a position B. in charge C. in need D. in charge D. in charge

Explanation

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also famous for its rich history, including the French Revolution and the French Revolution Museum. Paris is a bustling city with a diverse population and is a major economic and cultural center in Europe. It is home to many famous landmarks and attractions, including the Louvre, the Notre-Dame Cathedral, and the Champs-Élysées. Paris is also known for its cuisine, including French cuisine, and is a popular tourist destination. The city is also home to many museums, including the Mus

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and privacy.

2. Integration of AI with other technologies: AI is already being integrated into a wide range of technologies, including healthcare, finance, transportation, and manufacturing. As more technologies become integrated with AI, we can expect to see even more integration in the future.

3. Development



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Occupation]. I am [Current Profession]. I live in [City/Country]. I bring a variety of skills to the table, including [List of skills]. I have a strong work ethic and a positive attitude, and I am always looking for ways to improve and grow. I am [Age], [Gender], and I am [Religion/ Cultural Background]. What about you? What brings you to the table today?
[Name], may I have your name and profession so I can provide you with a more accurate self-introduction? What about you? Are you interested in learning more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, often known as "The City of Light". It is a historic city with a rich history dating back thousands of years, with a modern skyline that reflects its status as a global cultural hub. The city is home to many renowned museums,

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 [

Age

].

 I

 was

 born

 in

 [

City

]

 and

 raised

 in

 [

State

].

 I

 love

 [

X

]

 and

 [

Y

]

 a

 lot

,

 and

 I

 am

 always

 looking

 for

 new

 ways

 to

 [

Z

]

 and

 [

A

].

 I

 believe

 that

 [

X

]

 and

 [

Y

]

 are

 the

 key

 to

 achieving

 [

Z

],

 and

 that

 by

 working

 together

,

 we

 can

 create

 a

 better

 [

X

]

 and

 [

Y

].

 I

 am

 confident

 in

 my

 abilities

 and

 I

 am

 always

 willing

 to

 learn

 new

 things

.

 Thank

 you

.

 That

's

 great

 to

 hear

!

 Can

 you

 tell

 me

 more

 about

 your

 background

 and

 how

 you

 got

 into

 mathematics

?

 The

 more



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 rich

 history

 and

 artistic

 heritage

.

 It

 is

 home

 to

 the

 iconic

 E

iff

el

 Tower

 and

 is

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

.

 Paris

 is

 also

 known

 for

 its

 vibrant

 nightlife

,

 fashion

,

 and

 cultural

 attractions

.

 In

 terms

 of

 infrastructure

,

 it

 has

 a

 well

-develop

ed

 public

 transportation

 system

 and

 is

 home

 to

 numerous

 museums

,

 art

 galleries

,

 and

 historic

 landmarks

.

 The

 city

 is

 also

 known

 for

 its

 diverse

 culinary

 scene

,

 with

 Paris

ian

 cuisine

 being

 widely

 recognized

 around

 the

 world

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 diverse

 and

 rapidly

 evolving

,

 with

 a

 range

 of

 potential

 trends

 that

 could

 shape

 the

 technology

's

 direction

 and

 impact

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Enhanced

 Natural

 Language

 Processing

:

 With

 the

 growth

 of

 internet

 and

 social

 media

,

 natural

 language

 processing

 has

 become

 more

 advanced

 and

 sophisticated

,

 allowing

 AI

 to

 understand

 and

 interpret

 human

 language

 in

 new

 ways

.

 This

 could

 lead

 to

 more

 intelligent

 and

 natural

-s

ounding

 AI

,

 capable

 of

 producing

 more

 context

-aware

 responses

 and

 answering

 questions

 in

 a

 more

 human

-like

 way

.



2

.

 Enhanced

 Computer

 Vision

:

 With

 the

 development

 of

 computer

 vision

,

 AI

 is

 becoming

 more

 capable

 of

 identifying

 and

 understanding

 objects

,

 people

,

 and

 situations

 in

 the




In [6]:
llm.shutdown()