# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-29 00:46:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-29 00:46:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-29 00:46:58] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-01-29 00:47:00] INFO server_args.py:1750: Attention backend not specified. Use fa3 backend by default.


[2026-01-29 00:47:00] INFO server_args.py:2658: Set soft_watchdog_timeout since in CI








[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.56it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.79 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.79 GB):   5%|▌         | 1/20 [00:00<00:06,  2.75it/s]Capturing batches (bs=120 avail_mem=74.66 GB):   5%|▌         | 1/20 [00:00<00:06,  2.75it/s]Capturing batches (bs=112 avail_mem=74.65 GB):   5%|▌         | 1/20 [00:00<00:06,  2.75it/s]Capturing batches (bs=104 avail_mem=74.65 GB):   5%|▌         | 1/20 [00:00<00:06,  2.75it/s]Capturing batches (bs=96 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:06,  2.75it/s] Capturing batches (bs=96 avail_mem=74.64 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.21it/s]Capturing batches (bs=88 avail_mem=74.64 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.21it/s]Capturing batches (bs=80 avail_mem=74.63 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.21it/s]Capturing batches (bs=72 avail_mem=74.63 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.21it/s]

Capturing batches (bs=64 avail_mem=74.62 GB):  25%|██▌       | 5/20 [00:00<00:01, 12.21it/s]Capturing batches (bs=64 avail_mem=74.62 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.35it/s]Capturing batches (bs=56 avail_mem=74.62 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.35it/s]Capturing batches (bs=48 avail_mem=74.61 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.35it/s]Capturing batches (bs=40 avail_mem=74.61 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.35it/s]Capturing batches (bs=32 avail_mem=74.60 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.35it/s]Capturing batches (bs=32 avail_mem=74.60 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.17it/s]Capturing batches (bs=24 avail_mem=74.60 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.17it/s]Capturing batches (bs=16 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.17it/s]

Capturing batches (bs=12 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.17it/s]Capturing batches (bs=12 avail_mem=74.59 GB):  80%|████████  | 16/20 [00:00<00:00, 22.34it/s]Capturing batches (bs=8 avail_mem=74.58 GB):  80%|████████  | 16/20 [00:00<00:00, 22.34it/s] Capturing batches (bs=4 avail_mem=74.58 GB):  80%|████████  | 16/20 [00:00<00:00, 22.34it/s]Capturing batches (bs=2 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 22.34it/s]Capturing batches (bs=1 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 22.34it/s]Capturing batches (bs=1 avail_mem=74.57 GB): 100%|██████████| 20/20 [00:00<00:00, 26.37it/s]Capturing batches (bs=1 avail_mem=74.57 GB): 100%|██████████| 20/20 [00:00<00:00, 20.13it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Leatrice and I'm the founder of Children's Hospital Northwest (CHN) in Seattle. My organization is a vision board for the 21st century - a comprehensive, evidence-based approach to family medicine that integrates the best of our best practices to serve our communities. I'm also the CEO of Education Northwest and the President of the Board of Directors of the American Academy of Pediatrics.
As a physician, I was passionate about serving the needs of sick and disabled children. From my undergraduate years at Stanford University, I chose the health care profession as a passion, and even after graduation, I continued to study medicine as a medical student.
Prompt: The president of the United States is
Generated text:  considered the head of state, but how does he represent the country? As the head of state, the president serves as the representative of the United States government to the people of the country. He is responsible for carrying out th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] who has been [number of years] in the industry. I'm passionate about [reason for passion], and I'm always looking for ways to [action or goal]. I'm always eager to learn and grow, and I'm always willing to take on new challenges. I'm confident in my abilities and I'm always ready to help others. I'm a [reason for confidence] and I'm always looking for ways to [action or goal]. I'm a [reason for confidence] and I'm always looking for ways to [action or goal]. I'm a [reason for confidence

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. Paris is also a major economic and financial center, with a thriving fashion industry, art scene, and a diverse population of over 2 million people. The city is home to many world-renowned museums, including the Louvre and the Musée d'Orsay, and is a popular tourist destination for its beautiful architecture, vibrant nightlife, and cultural events. Paris is a city of contrasts, with its historic architecture and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is expected to become more prevalent in manufacturing, transportation, and other industries, where it can automate repetitive tasks and increase efficiency. This will lead to the development of new types of AI, such as cognitive agents and neural networks, that can perform tasks that are currently performed by humans.

2. Enhanced human-computer interaction: AI will continue to improve its ability to interact with humans in a more natural and intuitive way. This will involve the development of more sophisticated natural language processing and machine learning algorithms that can understand and respond to human language in a more human-like way.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a/an [Occupation/Position] with [Number of Years] years of experience in this field. I enjoy [Reason for Interest in the Job]. What is your favorite part of your job? What's something you wish you knew more about? And finally, what's something you wish you had more of? [Optional, personal anecdotes or stories] [Optional, photos or videos] [Optional, a quote or quote from a favorite book or movie] What would your boss or manager say to you if you asked them to describe you? How do you typically approach a situation where a problem arises or a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to a rich cultural and historical heritage, with various museums and art galleries fea

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

].

 I

 am

 a

 [

career

 field

 or

 profession

]

 expert

 with

 over

 [

number

 of

 years

 of

 experience

]

 years

 of

 experience

 in

 [

related

 field

].

 I

 have

 a

 passion

 for

 [

reason

 or

 interest

]

 and

 have

 always

 been

 driven

 to

 continue

 learning

 and

 improving

.

 Whether

 it

's

 through

 my

 work

 or

 my

 personal

 development

,

 I

'm

 always

 striving

 to

 do

 better

 and

 reach

 my

 goals

.

 I

'm

 here

 to

 help

 you

 achieve

 your

 goals

,

 too

.

 What

 do

 you

 want

 to

 learn

 or

 grow

 into

?

 [

Name

]

 is

 ready

 to

 help

 you

 learn

 and

 grow

.

 Let

's

 start

 by

 discussing

 your

 goals

 and

 what

 you

 want

 to

 achieve

.

 [

Name

]

 is

 here

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 home

 to

 numerous

 historical

 sites

,

 museums

,

 and

 landmarks

.

 It

 is

 also

 known for

 its

 rich

 cultural

 heritage

 and

 is

 a

 popular

 tourist

 destination

.

 French

 cuisine

,

 known

 for

 its

 dishes

 like

 ga

lette

,

 cro

iss

ants

,

 and

 cr

ê

pes

,

 is

 also

 famous

 in

 Paris

.

 The

 city

 has

 a

 strong

 focus

 on

 education

 and

 culture

,

 with

 numerous

 educational

 institutions

 and

 arts

 venues

.

 It

 is

 also

 known

 for

 its

 vibrant

 nightlife

 and

 is

 a

 popular

 destination

 for

 tourists

 looking

 for

 a

 slice

 of

 French

 culture

.

 Paris

,

 also

 known

 as

 “

la

 ville

 verte

”

 meaning

 the

 green

 city

,

 is

 a

 sprawling

 met

ropolis

 that

 is

 a

 popular

 tourist

 destination

 and

 the

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 with

 many

 potential

 trends

 and

 developments

 shaping

 how

 we

 use

 and

 integrate

 artificial

 intelligence

 in

 our

 daily

 lives

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Enhanced

 intelligence

:

 AI

 is

 increasingly

 being

 used

 to

 enhance

 human

 intelligence

,

 particularly

 in

 fields

 such

 as

 language

 translation

,

 speech

 recognition

,

 and

 decision

-making

.

 As

 AI

 improves

,

 we

 may

 see

 more

 sophisticated

 AI

 that

 can

 learn

 and

 adapt

 in

 new

 ways

,

 allowing

 it

 to

 perform

 tasks

 that

 were

 once

 thought

 impossible

.



2

.

 Personal

ized

 AI

:

 AI

 is

 becoming

 more

 capable

 of

 understanding

 and

 interacting

 with

 people

 in

 ways

 that

 are

 more

 natural

 and

 intuitive

.

 As

 AI

 becomes

 more

 personal

,

 we

 may

 see

 more

 emphasis




In [6]:
llm.shutdown()