# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-30 04:17:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-30 04:17:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-30 04:17:30] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-01-30 04:17:33] INFO server_args.py:1767: Attention backend not specified. Use fa3 backend by default.


[2026-01-30 04:17:33] INFO server_args.py:2693: Set soft_watchdog_timeout since in CI








[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.02it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.76 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.76 GB):   5%|▌         | 1/20 [00:00<00:07,  2.48it/s]Capturing batches (bs=120 avail_mem=74.65 GB):   5%|▌         | 1/20 [00:00<00:07,  2.48it/s]Capturing batches (bs=112 avail_mem=74.65 GB):   5%|▌         | 1/20 [00:00<00:07,  2.48it/s]Capturing batches (bs=104 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:07,  2.48it/s]Capturing batches (bs=96 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:07,  2.48it/s] Capturing batches (bs=96 avail_mem=74.64 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.27it/s]Capturing batches (bs=88 avail_mem=74.63 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.27it/s]Capturing batches (bs=80 avail_mem=74.63 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.27it/s]Capturing batches (bs=72 avail_mem=74.62 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.27it/s]

Capturing batches (bs=64 avail_mem=74.62 GB):  25%|██▌       | 5/20 [00:00<00:01, 11.27it/s]Capturing batches (bs=64 avail_mem=74.62 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.19it/s]Capturing batches (bs=56 avail_mem=74.61 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.19it/s]Capturing batches (bs=48 avail_mem=74.61 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.19it/s]Capturing batches (bs=40 avail_mem=74.60 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.19it/s]Capturing batches (bs=40 avail_mem=74.60 GB):  60%|██████    | 12/20 [00:00<00:00, 20.15it/s]Capturing batches (bs=32 avail_mem=74.60 GB):  60%|██████    | 12/20 [00:00<00:00, 20.15it/s]Capturing batches (bs=24 avail_mem=74.60 GB):  60%|██████    | 12/20 [00:00<00:00, 20.15it/s]

Capturing batches (bs=16 avail_mem=74.59 GB):  60%|██████    | 12/20 [00:00<00:00, 20.15it/s]Capturing batches (bs=16 avail_mem=74.59 GB):  75%|███████▌  | 15/20 [00:00<00:00, 20.50it/s]Capturing batches (bs=12 avail_mem=74.59 GB):  75%|███████▌  | 15/20 [00:00<00:00, 20.50it/s]Capturing batches (bs=8 avail_mem=74.58 GB):  75%|███████▌  | 15/20 [00:00<00:00, 20.50it/s] Capturing batches (bs=4 avail_mem=74.58 GB):  75%|███████▌  | 15/20 [00:00<00:00, 20.50it/s]Capturing batches (bs=4 avail_mem=74.58 GB):  90%|█████████ | 18/20 [00:01<00:00, 22.66it/s]Capturing batches (bs=2 avail_mem=74.57 GB):  90%|█████████ | 18/20 [00:01<00:00, 22.66it/s]

Capturing batches (bs=1 avail_mem=74.57 GB):  90%|█████████ | 18/20 [00:01<00:00, 22.66it/s]Capturing batches (bs=1 avail_mem=74.57 GB): 100%|██████████| 20/20 [00:01<00:00, 18.69it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Erika and I'm a third-year undergraduate student at the University of Birmingham. My first language is English and I have a great deal of confidence in my mathematical abilities. I enjoy working on computational problems, which requires a lot of logical thinking and problem solving skills. My area of research interests is in the field of numerical and statistical methods and machine learning and my PhD research project is funded by the Wellcome Trust.
I have a strong work ethic, always prioritizing my time, and I am available for discussions, questions or consultations on any aspect of mathematical research. Erika is an intelligent, articulate and friendly person and I am always keen
Prompt: The president of the United States is
Generated text:  a popular post. His term is usually eight years. He is usually elected by all the states. The vice president of the United States is usually chosen by the president. Vice presidents have the same power

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I love [job title] because [reason for passion]. I'm always looking for ways to [action], and I'm always eager to learn new things. I'm a [job title] at [company name], and I'm always looking for ways to [action]. I'm a [job title] at [company name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a vibrant culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a popular tourist destination, with millions of visitors each year. It is also home to many cultural institutions, including the Louvre Museum and the Musée d'Orsay. Overall, Paris is a city of contrasts, with its modern architecture and historical landmarks blending

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the potential future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries. This could lead to increased efficiency, reduced costs, and improved productivity.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we can expect to see even more sophisticated applications in healthcare, such as personalized medicine and predictive analytics.

3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm [Age]. I'm currently [Current occupation] and I have been working at [Company] for [Number of years] years. I was always passionate about [What interests me or what I enjoy doing], and I'm always trying to [What goal I want to achieve in the future]. I'm looking forward to [What I'll be doing next in the company]. And I'm looking forward to making [What I hope for] in my future. Thank you. **Your name:** [Name] **Age:** [Age] **Current occupation:** [Current occupation] **Company:** [Company]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower and Notre-Dame Cathedral. The city is also famous for its gastronomy, including its famous dishes like escargot, bouillabaisse, and escargot mignon. Paris has a rich cultural scene and is home to

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

occupation

]

 with

 over

 [

number

]

 years

 of

 experience

.

 I

 am

 passionate

 about

 [

reason

 for

 interest

]

 and

 I

 believe

 in

 [

mot

ivation

].

 What

’s

 your

 background

 and

 what

 exc

ites

 you

 about

 your

 career

?

 [

Your

 background

 and

 experiences

]

 I

 am

 committed

 to

 [

why

 you

 are

 passionate

 about

 your

 career

].

 I

 am

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 grow

 and

 learn

.

 What

’s

 your

 greatest

 achievement

 and

 why

?

 [

Your

 greatest

 achievement

 and

 why

 it

’s

 significant

].

 I

 am

 always

 looking

 for

 ways

 to

 improve

 and

 continue

 learning

.

 How

 would

 you

 describe

 your

 personality

 and

 how

 do

 you

 balance

 your

 work

 and

 personal

 life



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 



Would

 you

 like

 to

 know

 more

 about

 French

 culture

,

 cuisine

,

 or

 history

 in

 general

?

 Please

 provide

 a

 brief

 explanation

.

 


France

's

 cuisine

 is

 known

 for

 its

 influences

 from

 all

 over

 the

 world

.

 From

 Arabic

 and

 Italian

 dishes

 to

 French

 past

ries

 and

 bread

,

 France

 is

 a

 melting

 pot

 of

 cultures

,

 resulting

 in

 a

 unique

 culinary

 tradition

 that

 is

 celebrated

 in

 the

 country

.

 



To

 understand

 French

 history

,

 consider

 the

 influence

 of

 Napoleon

 Bon

ap

arte

,

 who

 revolution

ized

 the

 country

 with

 his

 rule

 from

 

1

8

0



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 shaped

 by

 several

 trends

 and

 developments

 that

 are

 likely

 to

 shape

 the

 industry

 and

 impact

 the

 way

 it

 operates

.

 Here

 are

 some

 of

 the

 most

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 will

 continue

 to

 play

 a

 vital

 role

 in

 healthcare

 by

 improving

 patient

 outcomes

 and

 reducing

 costs

.

 AI

 systems

 will

 be

 used

 for

 disease

 diagnosis

,

 medication

 management

,

 medical

 imaging

,

 and

 more

.



2

.

 AI

 in

 automation

:

 AI

 is

 already

 revolution

izing

 the

 manufacturing

 industry

 with

 the

 development

 of

 robots

 and

 autonomous

 vehicles

.

 AI

 will

 continue

 to

 expand

 its

 applications

 in

 the

 manufacturing

 sector

,

 enabling

 the

 automation

 of

 repetitive

 tasks

,

 improving

 efficiency

,

 and

 reducing

 labor




In [6]:
llm.shutdown()