# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-03 04:29:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-03 04:29:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-03 04:29:32] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-03 04:29:35] INFO server_args.py:1614: Attention backend not specified. Use fa3 backend by default.


[2026-01-03 04:29:35] INFO server_args.py:2501: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.85 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.85 GB):   5%|▌         | 1/20 [00:00<00:11,  1.60it/s]Capturing batches (bs=120 avail_mem=74.69 GB):   5%|▌         | 1/20 [00:00<00:11,  1.60it/s]Capturing batches (bs=112 avail_mem=74.68 GB):   5%|▌         | 1/20 [00:00<00:11,  1.60it/s]Capturing batches (bs=104 avail_mem=74.66 GB):   5%|▌         | 1/20 [00:00<00:11,  1.60it/s]Capturing batches (bs=104 avail_mem=74.66 GB):  20%|██        | 4/20 [00:00<00:02,  6.59it/s]Capturing batches (bs=96 avail_mem=74.66 GB):  20%|██        | 4/20 [00:00<00:02,  6.59it/s] Capturing batches (bs=88 avail_mem=74.65 GB):  20%|██        | 4/20 [00:00<00:02,  6.59it/s]

Capturing batches (bs=80 avail_mem=74.65 GB):  20%|██        | 4/20 [00:00<00:02,  6.59it/s]Capturing batches (bs=80 avail_mem=74.65 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.84it/s]Capturing batches (bs=72 avail_mem=74.64 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.84it/s]Capturing batches (bs=64 avail_mem=74.64 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.84it/s]Capturing batches (bs=56 avail_mem=74.63 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.84it/s]Capturing batches (bs=56 avail_mem=74.63 GB):  50%|█████     | 10/20 [00:01<00:00, 13.99it/s]Capturing batches (bs=48 avail_mem=74.63 GB):  50%|█████     | 10/20 [00:01<00:00, 13.99it/s]

Capturing batches (bs=40 avail_mem=74.62 GB):  50%|█████     | 10/20 [00:01<00:00, 13.99it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  50%|█████     | 10/20 [00:01<00:00, 13.99it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.57it/s]Capturing batches (bs=24 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.57it/s]Capturing batches (bs=16 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.57it/s]

Capturing batches (bs=12 avail_mem=74.60 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.57it/s]Capturing batches (bs=12 avail_mem=74.60 GB):  80%|████████  | 16/20 [00:01<00:00, 17.24it/s]Capturing batches (bs=8 avail_mem=74.60 GB):  80%|████████  | 16/20 [00:01<00:00, 17.24it/s] Capturing batches (bs=4 avail_mem=74.59 GB):  80%|████████  | 16/20 [00:01<00:00, 17.24it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  80%|████████  | 16/20 [00:01<00:00, 17.24it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  95%|█████████▌| 19/20 [00:01<00:00, 19.72it/s]Capturing batches (bs=1 avail_mem=74.58 GB):  95%|█████████▌| 19/20 [00:01<00:00, 19.72it/s]Capturing batches (bs=1 avail_mem=74.58 GB): 100%|██████████| 20/20 [00:01<00:00, 13.95it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Brandon.

I am a digital marketing consultant. I offer consulting services to business owners looking to develop, optimize, and grow their online presence.

I believe that everyone has the ability to develop a brand, and what's most important is that they know how to use their brand to make a tangible difference.

Brandon is a licensed digital marketing and SEO consultant who specializes in:

  1. Personal branding
  2. Website development
  3. SEO optimization
  4. Content marketing
  5. Social media marketing

I have experience in all aspects of marketing including content marketing, SEO, web development, social media
Prompt: The president of the United States is
Generated text:  visiting a country with an interesting tradition: each president visits every fifth house in the town. The town has 25 houses numbered 1 through 25. How many houses will the president visit? To determine how many houses the president of the United States will visit,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] with [number of years] years of experience in [field]. I am a [type of person] who is always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait]. I am [type of person] and I am always [positive trait].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French Academy of Sciences. Paris is a cultural and economic hub, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. It is a popular tourist destination and a major center for business and finance in Europe. Paris is also known for its cuisine, with its famous dishes such as croissants, escargot, and escargot frites. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks with increasing accuracy and efficiency. Some potential future trends in AI include:

1. Increased integration with other technologies: As AI becomes more integrated with other technologies, such as IoT, blockchain, and quantum computing, it is likely that AI will become even more powerful and versatile.

2. Greater emphasis on ethical considerations: As AI systems become more advanced, there will be a greater emphasis on ethical considerations, such as privacy,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [major] majoring in [major] at [University Name]. I'm [Age] years old and I'm currently [location] for my current studies. I enjoy [interest/activities], and I'm always looking for new challenges and experiences to try. My [job title] is [Job Title] and my [job title] is [Job Title] at [Employer Name]. I'm always looking for the next opportunity and I'm eager to learn new things and make new friends. I hope that I can make [name] proud and that we can work together in [project/

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Why is Paris considered one of the most beautiful cities in the world?
Paris is considered one of the most beautiful cities in the world for the following reasons:

1. Grandes Plages: Paris is known for its stunning beaches, such as the Seine River and the Palace

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

age

]

 year

 old

 [

gender

]

 with

 [

work

 or

 profession

]

 experience

.

 I

 have

 a

 strong

 work

 ethic

 and

 dedication

 to

 my

 career

.

 I

 love

 to

 challenge

 myself

 and

 take

 risks

 to

 achieve

 my

 goals

.

 I

 am

 always

 learning

 new

 things

 and

 always

 seeking

 feedback

 to

 improve

.

 I

 have

 a

 friendly

 and

 approach

able

 personality

 and

 always

 try

 to

 make

 people

 smile

.

 I

'm

 confident

 and

 determined

 to

 achieve

 my

 best

 performance

 in

 everything

 I

 do

.

 Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

.

 [

Name

]

 [

Age

]

 [

Gender

]

 [

Occup

ation

/

Prof

ession

]

 [

Work

 Experience

]

 [

Education

]

 [

Skills

/



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Please

 answer

 the

 following

 question

 about

 the

 statement

:



How

 many

 inhabitants

 does

 Paris

 have

?



There

 are

 around

 

2

.

 

5

 million

 inhabitants

 in

 Paris

.

 The

 population

 of

 Paris

,

 the

 capital

 of

 France

,

 is

 around

 

2

.

 

5

 million

 in

 

2

0

2

1

.

 Paris

 is

 a

 large

 city

 located

 in

 the

 French

 region

 of

 Î

le

-de

-F

rance

.

 It

 has

 a

 population

 of

 over

 

2

.

 

5

 million

 and

 is

 the

 largest

 city

 in

 France

 by

 area

 and

 population

.

 Despite

 its

 large

 size

,

 Paris

 is

 considered

 a

 city

 of

 culture

 and

 art

 and

 is

 known

 for

 its

 rich

 history

,

 architecture

,

 and

 culinary

 traditions

.

 It

 is

 also



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 highly

 complex

 and

 unpredictable

,

 and

 the

 trend

 of

 development

 is

 likely

 to

 remain

 continuous

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Deep

 Learning

:

 Deep

 learning

 is

 likely

 to

 play

 a

 more

 significant

 role

 in

 AI

 development

,

 and

 it

 will

 be

 able

 to

 learn

 and

 improve

 from

 large

 amounts

 of

 data

.

 It

 will

 also

 be

 able

 to

 learn

 new

 concepts

 and

 patterns

 more

 efficiently

 than

 previous

 methods

.



2

.

 Autonomous

 Robots

:

 Autonomous

 robots

 will

 continue

 to

 improve

 their

 capabilities

 and

 become

 more

 common

 in

 everyday

 life

.

 They

 will

 be

 able

 to

 perform

 tasks

 such

 as

 grocery

 shopping

,

 cleaning

,

 and

 even

 driving

 cars

.



3

.

 Explain

ability

:

 As

 AI

 becomes




In [6]:
llm.shutdown()