# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-25 23:07:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-25 23:07:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-25 23:07:58] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-25 23:08:01] INFO server_args.py:1859: Attention backend not specified. Use fa3 backend by default.


[2026-02-25 23:08:01] INFO server_args.py:2928: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.04it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=7.55 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=7.55 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s]Capturing batches (bs=120 avail_mem=7.45 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s]

Capturing batches (bs=112 avail_mem=7.44 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s]Capturing batches (bs=104 avail_mem=7.44 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s]Capturing batches (bs=96 avail_mem=7.43 GB):   5%|▌         | 1/20 [00:00<00:03,  5.60it/s] Capturing batches (bs=96 avail_mem=7.43 GB):  25%|██▌       | 5/20 [00:00<00:00, 18.08it/s]Capturing batches (bs=88 avail_mem=7.43 GB):  25%|██▌       | 5/20 [00:00<00:00, 18.08it/s]Capturing batches (bs=80 avail_mem=7.43 GB):  25%|██▌       | 5/20 [00:00<00:00, 18.08it/s]Capturing batches (bs=72 avail_mem=7.42 GB):  25%|██▌       | 5/20 [00:00<00:00, 18.08it/s]Capturing batches (bs=64 avail_mem=7.42 GB):  25%|██▌       | 5/20 [00:00<00:00, 18.08it/s]

Capturing batches (bs=64 avail_mem=7.42 GB):  45%|████▌     | 9/20 [00:00<00:00, 23.26it/s]Capturing batches (bs=56 avail_mem=7.41 GB):  45%|████▌     | 9/20 [00:00<00:00, 23.26it/s]Capturing batches (bs=48 avail_mem=7.40 GB):  45%|████▌     | 9/20 [00:00<00:00, 23.26it/s]Capturing batches (bs=40 avail_mem=7.40 GB):  45%|████▌     | 9/20 [00:00<00:00, 23.26it/s]Capturing batches (bs=40 avail_mem=7.40 GB):  60%|██████    | 12/20 [00:00<00:00, 25.22it/s]Capturing batches (bs=32 avail_mem=7.39 GB):  60%|██████    | 12/20 [00:00<00:00, 25.22it/s]Capturing batches (bs=24 avail_mem=7.39 GB):  60%|██████    | 12/20 [00:00<00:00, 25.22it/s]Capturing batches (bs=16 avail_mem=7.38 GB):  60%|██████    | 12/20 [00:00<00:00, 25.22it/s]

Capturing batches (bs=16 avail_mem=7.38 GB):  75%|███████▌  | 15/20 [00:00<00:00, 23.59it/s]Capturing batches (bs=12 avail_mem=7.38 GB):  75%|███████▌  | 15/20 [00:00<00:00, 23.59it/s]Capturing batches (bs=8 avail_mem=7.37 GB):  75%|███████▌  | 15/20 [00:00<00:00, 23.59it/s] Capturing batches (bs=4 avail_mem=7.37 GB):  75%|███████▌  | 15/20 [00:00<00:00, 23.59it/s]Capturing batches (bs=2 avail_mem=7.36 GB):  75%|███████▌  | 15/20 [00:00<00:00, 23.59it/s]Capturing batches (bs=2 avail_mem=7.36 GB):  95%|█████████▌| 19/20 [00:00<00:00, 27.18it/s]Capturing batches (bs=1 avail_mem=7.36 GB):  95%|█████████▌| 19/20 [00:00<00:00, 27.18it/s]Capturing batches (bs=1 avail_mem=7.36 GB): 100%|██████████| 20/20 [00:00<00:00, 24.18it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sasha. I want to learn how to make a movie trailer.

I want to make a movie trailer for a romantic comedy about a couple who fall in love and have a son together. The trailer should have a romantic tone, a happy ending, and a sense of nostalgia.

How can I get started with the original screenplay and original character names for the movie? And how can I find inspiration for the romantic comedy genre?
What are the steps to create a movie trailer for a romantic comedy with a happy ending?
What are some successful trailers for a romantic comedy that have a happy ending?
What is the difference between a romantic comedy and a romantic
Prompt: The president of the United States is
Generated text:  now considering new tax rates on various income brackets. The marginal tax rates on each bracket are given by the functions:

- $m_1 = \frac{1}{2}\left(1 + \frac{r}{100}\right)^{-2}$
- $m_2 = \frac{1}{2}\left(1 + \frac{r}{100}\right)^{-1}$
- $m_3 = \frac{1

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title] at [company name]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [favorite hobby or activity], and I'm always looking for new ways to explore and discover new things. What's your favorite book or movie? I love [favorite book or movie],

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. It is also the seat of the French government and the country's cultural, political, and economic center. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also famous for its cuisine, fashion, and music. Paris is a popular tourist destination and a major cultural hub in Europe. It is home to many world-renowned museums, theaters, and art galleries. The city is also known for its annual festivals and events, such

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars to personalized medicine. Additionally, AI is likely to continue to be used for a wide range of applications, from financial services to healthcare to manufacturing. As AI becomes more integrated into our daily lives, it is likely to have a significant impact on the way we work, live, and interact with each other. However, it is also likely to raise important ethical and social issues, such as the potential for AI to be used for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [职业] who has been [accomplished goal] for [number] years. I'm always looking for ways to make the world a better place and I'm committed to using my skills and experience to help others.

[Your Name] is a self-proclaimed "hacker" who loves to explore new technologies and solve complex problems. From coding to AI, I'm always eager to learn and improve.

I'm a perfectionist and I'm always on the lookout for new challenges to take me further. I'm always open to new ideas and innovative solutions.

I'm always up for a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the largest city in Europe and the birthplace of the French Revolution. Its location in the Pyrenees mountains makes it a popular tourist destination and a symbol of French culture. The city has a rich history dating back 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 a

 passionate

 and

 versatile

 person

 who

 loves

 to

 explore

 the

 world

 and

 learn

 new

 things

.

 I

 am

 always

 up

 for

 a

 challenge

 and

 have

 a

 natural

 curiosity

 that

 makes

 me

 endlessly

 curious

.

 I

 am

 passionate

 about

 being

 a

 good

 listener

 and

 making

 others

 feel

 heard

 and

 understood

.

 I

 believe

 in

 the

 power

 of

 knowledge

 and

 always

 strive

 to

 learn

 and

 grow

.

 I

 enjoy

 having

 conversations

 and

 engaging

 with

 people

,

 and

 I

 am

 always

 ready

 to

 share

 my

 thoughts

 and

 experiences

 with

 those

 around

 me

.

 What

 is

 your

 name

?

 How

 can

 I

 get

 to

 know

 you

 better

?

 Start

 by

 asking

 me

 a

 question

 or

 saying

 something

 that

 you

 think

 will

 spark

 a

 conversation

.

 How



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 encaps

ulates

 the

 city

's

 essential

 characteristics

:

 being

 the

 capital

,

 its

 status

 as

 the

 largest

 city

 in

 Europe

,

 and

 its

 cultural

 prominence

.

 The

 information

 provided

 provides

 a

 clear

 and

 concise

 representation

 of

 what

 Paris

 stands

 for

 and

 is

 known

 for

 in

 the

 broader

 context

 of

 France

 and

 the

 world

.

 



The

 statement

 is

 factual

 because

 it

 utilizes

 accurate

 and

 readily

 available

 information

 about

 Paris

, the

 capital

 of

 France

.

 It

 does

 not

 include

 any

 speculation

 or

 conject

ure

,

 but

 rather

 provides

 straightforward

 facts

 about

 the

 city

's

 significance

.

 



A

 more

 comprehensive

 statement

 could

 be

:

 "

Paris

,

 the

 capital

 city

 of

 France

,

 is

 a

 world

-ren

owned

 city

 known

 for

 its

 rich



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 increased

 automation

,

 global

isation

,

 and

 decentral

isation

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 automation

:

 AI

 is

 expected

 to

 become

 more

 and

 more

 integrated

 into

 our

 daily

 lives

,

 from

 industrial

 machinery

 to

 personal

 assistants

.

 Robots

 and

 AI

-powered

 automation

 will

 become

 more

 prevalent

 in

 various

 sectors

,

 such

 as

 manufacturing

,

 agriculture

,

 and

 healthcare

.



2

.

 Global

isation

:

 AI

 is

 expected

 to

 become

 more

 and

 more

 widespread

,

 with

 more

 countries

 adopting

 AI

 technologies

.

 Global

isation

 will

 lead

 to

 increased

 competition

 between

 countries

,

 and

 AI

 will

 become

 an

 increasingly

 important

 factor

 in

 economic

 growth

 and

 development

.



3

.

 Dec

entral

isation

:

 AI

 is

 increasingly

 becoming




In [6]:
llm.shutdown()