# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-15 01:04:48] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-15 01:04:48] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-15 01:04:48] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2026-01-15 01:04:51] INFO server_args.py:1652: Attention backend not specified. Use fa3 backend by default.


[2026-01-15 01:04:51] INFO server_args.py:2551: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.13it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=61.32 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=61.32 GB):   5%|▌         | 1/20 [00:02<00:46,  2.42s/it]Capturing batches (bs=120 avail_mem=61.21 GB):   5%|▌         | 1/20 [00:02<00:46,  2.42s/it]

Capturing batches (bs=120 avail_mem=61.21 GB):  10%|█         | 2/20 [00:02<00:20,  1.17s/it]Capturing batches (bs=112 avail_mem=61.21 GB):  10%|█         | 2/20 [00:02<00:20,  1.17s/it]Capturing batches (bs=112 avail_mem=61.21 GB):  15%|█▌        | 3/20 [00:02<00:11,  1.42it/s]Capturing batches (bs=104 avail_mem=61.20 GB):  15%|█▌        | 3/20 [00:02<00:11,  1.42it/s]

Capturing batches (bs=96 avail_mem=61.20 GB):  15%|█▌        | 3/20 [00:02<00:11,  1.42it/s] 

Capturing batches (bs=96 avail_mem=61.20 GB):  25%|██▌       | 5/20 [00:03<00:06,  2.25it/s]Capturing batches (bs=88 avail_mem=61.19 GB):  25%|██▌       | 5/20 [00:03<00:06,  2.25it/s]Capturing batches (bs=80 avail_mem=61.19 GB):  25%|██▌       | 5/20 [00:03<00:06,  2.25it/s]

Capturing batches (bs=80 avail_mem=61.19 GB):  35%|███▌      | 7/20 [00:03<00:03,  3.38it/s]Capturing batches (bs=72 avail_mem=61.18 GB):  35%|███▌      | 7/20 [00:03<00:03,  3.38it/s]Capturing batches (bs=72 avail_mem=61.18 GB):  40%|████      | 8/20 [00:03<00:02,  4.01it/s]Capturing batches (bs=64 avail_mem=61.18 GB):  40%|████      | 8/20 [00:03<00:02,  4.01it/s]Capturing batches (bs=56 avail_mem=61.17 GB):  40%|████      | 8/20 [00:03<00:02,  4.01it/s]

Capturing batches (bs=56 avail_mem=61.17 GB):  50%|█████     | 10/20 [00:03<00:01,  5.66it/s]Capturing batches (bs=48 avail_mem=61.17 GB):  50%|█████     | 10/20 [00:03<00:01,  5.66it/s]

Capturing batches (bs=48 avail_mem=61.17 GB):  55%|█████▌    | 11/20 [00:04<00:02,  4.20it/s]Capturing batches (bs=40 avail_mem=61.16 GB):  55%|█████▌    | 11/20 [00:04<00:02,  4.20it/s]Capturing batches (bs=32 avail_mem=61.16 GB):  55%|█████▌    | 11/20 [00:04<00:02,  4.20it/s]Capturing batches (bs=32 avail_mem=61.16 GB):  65%|██████▌   | 13/20 [00:04<00:01,  5.85it/s]Capturing batches (bs=24 avail_mem=61.16 GB):  65%|██████▌   | 13/20 [00:04<00:01,  5.85it/s]

Capturing batches (bs=24 avail_mem=61.16 GB):  70%|███████   | 14/20 [00:04<00:00,  6.14it/s]Capturing batches (bs=16 avail_mem=61.15 GB):  70%|███████   | 14/20 [00:04<00:00,  6.14it/s]Capturing batches (bs=12 avail_mem=61.15 GB):  70%|███████   | 14/20 [00:04<00:00,  6.14it/s]Capturing batches (bs=12 avail_mem=61.15 GB):  80%|████████  | 16/20 [00:04<00:00,  8.18it/s]Capturing batches (bs=8 avail_mem=61.14 GB):  80%|████████  | 16/20 [00:04<00:00,  8.18it/s] Capturing batches (bs=4 avail_mem=61.14 GB):  80%|████████  | 16/20 [00:04<00:00,  8.18it/s]Capturing batches (bs=2 avail_mem=61.13 GB):  80%|████████  | 16/20 [00:04<00:00,  8.18it/s]

Capturing batches (bs=1 avail_mem=61.13 GB):  80%|████████  | 16/20 [00:04<00:00,  8.18it/s]Capturing batches (bs=1 avail_mem=61.13 GB): 100%|██████████| 20/20 [00:04<00:00, 13.65it/s]Capturing batches (bs=1 avail_mem=61.13 GB): 100%|██████████| 20/20 [00:04<00:00,  4.19it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Patrick and I'm a 24 year old double majoring in both Psychology and Political Science. I am an enthusiastic and passionate individual, but I am also someone who constantly pushes the boundaries of what is considered normal in my field. I have experience in both qualitative and quantitative data analysis and have worked with clients in a variety of fields, ranging from psychology to political science and beyond. I am a big believer in the power of the human brain, and I am always looking for ways to improve my skills and knowledge in order to better serve my clients. I enjoy being involved in research projects, and I am constantly seeking out new ways to
Prompt: The president of the United States is
Generated text:  considering whether to continue with the US$100bn-a-year push to curb carbon emissions, or explore other options for reducing global warming.
The president has put a lot of thought into his decision.
It's not just about the cost of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture. The city is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major center for art, music, and literature, and is home to many famous museums, theaters, and restaurants. The city is known for its fashion industry, with many famous designers and boutiques. Paris is a city of contrasts, with its modern architecture and historical landmarks blending together to create a unique and beautiful city. The city is also home to many international organizations

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there is a growing emphasis on developing AI that is designed to be ethical and responsible. This includes developing AI that is transparent, accountable, and accountable to human values.

2. Integration of AI with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing, and computer vision. This integration could lead to new applications and opportunities for AI, such as



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a 35-year-old freelance graphic designer with over 10 years of experience in the field, specializing in design work for a variety of industries such as healthcare, finance, and retail. I am a highly skilled and efficient designer with a strong attention to detail and a passion for creativity. I have a unique and innovative approach to design that sets me apart from other professionals in my field. I am always looking for new challenges and opportunities to grow and improve my skills. Thank you for asking. Let me know if you have any other questions. I look forward to meeting you. Cheers! [Name] [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the world-renowned city of light, a sprawling metropolis with towering architecture and an endless river of traffic. Its annual Gross Domestic Produc

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

age

]

 year

 old

 [

gender

]

 [

occupation

].


Hello

,

 my

 name

 is

 [

Name

],

 and

 I

'm

 a

 [

age

]

 year

 old

 [

gender

]

 [

occupation

].

 I

 love

 playing

 sports

,

 hiking

,

 and

 trying

 new

 foods

.

 I

 like

 spending

 time

 with

 my

 family

,

 reading

 books

,

 and

 writing

 short

 stories

.


This

 is

 my

 friendly

 intro

,

 but

 please

 remember

 that

 I

 am

 just

 one

 person

.

 Everyone

 is

 unique

 and

 unique

 people

 are

 special

.

 Let

's

 keep

 those

 unique

 attributes

 and

 special

 qualities

 in

 mind

.


I

 hope

 you

 find

 my

 personality

 and

 interests

 helpful

.

 Let

 me

 know

 if

 you

 need

 anything

.

 I

'm

 always

 here



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 south

 of

 the

 country

,

 and

 serves

 as

 the

 nation

's

 capital

,

 economic

,

 cultural

,

 and

 political

 centre

.

 The

 city

 is

 the

 largest

 city

 in

 France

 and

 the

 second

-largest

 in

 Europe

,

 and

 is

 home

 to

 a

 population

 of

 approximately

 

2

.

7

 million

 people

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 artistic

 and

 cultural

 scene

,

 and

 world

-class

 cuisine

,

 and

 is

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

.

 The

 city

 is

 home

 to

 numerous

 museums

,

 galleries

,

 theaters

,

 and

 concert

 halls

,

 and

 is

 considered

 one

 of

 the

 most

 important

 cultural

 and

 artistic

 centers

 in

 the

 world

.

 The

 city

's

 landmarks

,

 such

 as

 the

 E

iff



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 depends

 on

 a

 variety

 of

 factors

,

 including

 advances

 in

 technology

,

 policy

,

 and

 societal

 shifts

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 precision

 and

 accuracy

:

 With

 the

 development

 of

 more

 powerful

 AI

 systems

,

 we

 may

 see

 an

 increase

 in

 the

 accuracy

 and

 precision

 of

 predictions

,

 diagnoses

,

 and

 recommendations

.



2

.

 Enhanced

 natural

 language

 processing

:

 AI

 systems

 are

 becoming

 more

 capable

 of

 understanding

 and

 generating

 human

-like

 language

,

 which

 could

 lead

 to

 more

 natural

 and

 intuitive

 interactions

 between

 humans

 and

 machines

.



3

.

 Increased

 integration

 with

 other

 technologies

:

 AI

 systems

 will

 likely

 become

 more

 integrated

 with

 other

 technologies

,

 such

 as

 computers

,

 sensors

,

 and

 mobile

 devices

,

 to

 enable




In [6]:
llm.shutdown()