# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-31 12:03:42] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-31 12:03:42] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-31 12:03:42] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-12-31 12:03:45] INFO server_args.py:1602: Attention backend not specified. Use fa3 backend by default.


[2025-12-31 12:03:45] INFO server_args.py:2481: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.91it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.94 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.94 GB):   5%|▌         | 1/20 [00:00<00:04,  4.07it/s]Capturing batches (bs=120 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:04,  4.07it/s]Capturing batches (bs=112 avail_mem=74.74 GB):   5%|▌         | 1/20 [00:00<00:04,  4.07it/s]Capturing batches (bs=104 avail_mem=74.74 GB):   5%|▌         | 1/20 [00:00<00:04,  4.07it/s]Capturing batches (bs=104 avail_mem=74.74 GB):  20%|██        | 4/20 [00:00<00:01, 12.09it/s]Capturing batches (bs=96 avail_mem=74.73 GB):  20%|██        | 4/20 [00:00<00:01, 12.09it/s] Capturing batches (bs=88 avail_mem=74.72 GB):  20%|██        | 4/20 [00:00<00:01, 12.09it/s]

Capturing batches (bs=80 avail_mem=74.72 GB):  20%|██        | 4/20 [00:00<00:01, 12.09it/s]Capturing batches (bs=80 avail_mem=74.72 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.58it/s]Capturing batches (bs=72 avail_mem=74.71 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.58it/s]Capturing batches (bs=64 avail_mem=74.71 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.58it/s]Capturing batches (bs=56 avail_mem=74.70 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.58it/s]Capturing batches (bs=56 avail_mem=74.70 GB):  50%|█████     | 10/20 [00:00<00:00, 19.45it/s]Capturing batches (bs=48 avail_mem=74.67 GB):  50%|█████     | 10/20 [00:00<00:00, 19.45it/s]Capturing batches (bs=40 avail_mem=74.66 GB):  50%|█████     | 10/20 [00:00<00:00, 19.45it/s]

Capturing batches (bs=32 avail_mem=74.66 GB):  50%|█████     | 10/20 [00:00<00:00, 19.45it/s]Capturing batches (bs=32 avail_mem=74.66 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.86it/s]Capturing batches (bs=24 avail_mem=74.63 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.86it/s]Capturing batches (bs=16 avail_mem=74.62 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.86it/s]Capturing batches (bs=12 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.86it/s]

Capturing batches (bs=12 avail_mem=74.61 GB):  80%|████████  | 16/20 [00:00<00:00, 20.02it/s]Capturing batches (bs=8 avail_mem=74.60 GB):  80%|████████  | 16/20 [00:00<00:00, 20.02it/s] Capturing batches (bs=4 avail_mem=74.60 GB):  80%|████████  | 16/20 [00:00<00:00, 20.02it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  80%|████████  | 16/20 [00:00<00:00, 20.02it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  95%|█████████▌| 19/20 [00:01<00:00, 22.41it/s]Capturing batches (bs=1 avail_mem=74.59 GB):  95%|█████████▌| 19/20 [00:01<00:00, 22.41it/s]Capturing batches (bs=1 avail_mem=74.59 GB): 100%|██████████| 20/20 [00:01<00:00, 19.19it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Margaret Green and I am a cook from Pennsylvania. I have spent my life in Philadelphia. I have done much of the cooking I have done since I was 9 years old, including summer camps, church and community gatherings, and volunteer work. In the last decade, I have traveled a lot to do my own cooking, and now I have become a travel cook. I have traveled all over the country, including to New York City, New Jersey, Florida, Ohio, Texas, California, and even the Bahamas.
My favorite place is a place that is fairly close to home. I like to live where I can be by the water
Prompt: The president of the United States is
Generated text:  a member of the [ ]
A. Political party
B. Congress
C. Senate
D. Executive branch

To determine the correct answer, let's analyze each option step by step:

A. Political party - This is a political structure, not a member of any branch of the government.
B. Congress - This is a legislative body, not a member of any branch 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to the Roman Empire and the French Revolution. Paris is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin. The city is also known for its vibrant nightlife and fashion scene, with many famous fashion designers and street artists. Paris is a popular tourist destination, with millions of visitors annually. It is a major hub for international business and trade, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI will continue to automate tasks that are currently done by humans, such as data analysis, decision-making, and routine maintenance. This will lead to increased efficiency and productivity, but it will also create new jobs that require specialized skills.

2. Enhanced human-computer interaction: AI will continue to improve its ability to understand and respond to human language, emotions, and behaviors. This will lead to more natural and intuitive interactions between humans and machines, and will also create new opportunities for collaboration and communication.

3. AI will become more integrated with other technologies: AI will continue to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [occupation] with [number of years] years of experience. I'm here to help you if you need anything and I'm always ready to learn from you. What is your name? (Hint: It might be "I'm here to help you if you need anything and I'm always ready to learn from you. What is your name? (Hint: It might be "I'm here to help you if you need anything and I'm always ready to learn from you. ") ) ) (Please type the name of the character you are looking for in the field of professions, hobbies, or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city that is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and a rich cultural and historical heritage.

This statement provides a brief, informative overview of the major aspects of Paris, including its importance, key landmarks, and cultural si

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name],

 and

 I

'm an

 engineer

 with a

 background in

 [

specific field

 of expertise

]. I

've been

 working

 in the

 [

specific industry

 or domain

] for

 [number

 of years

] years

. In

 my spare

 time,

 I

 enjoy

 [example

 of leisure

 activity or

 hobby].

 What

 brings you

 to this

 profession

 and

 what

 do you

 do

? As

 an engineer

, I

'm always

 seeking

 new ways

 to innovate

 and improve

 existing processes

. I

 enjoy experimenting

 with

 new technologies

 and developing

 new ideas

. I

'm

 also passionate

 about sustainability

 and I

 believe that

 we need

 to

 keep our

 planet healthy

 and thriving

 for

 future

 generations

.

 What

 would

 you

 like

 to

 know

 about

 me

?

 [

Optional

]




Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris,

 the

 city

 with

 a

 population

 of

 over

 

2 million

.

 It is

 located

 on the

 banks of

 the

 Se

ine

 River in

 the Î

le

 de

 la

 C

ité

. The

 city is

 known for

 its

 historical

 and

 cultural

 landmarks

,

 including

 the

 Louvre

 Museum and

 the Notre

-Dame

 Cathedral

.

 Paris

 is

 a

 vibrant

 and

 cosm

opolitan

 city

 with

 a

 rich

 cultural

 heritage

, known

 for its

 art,

 food,

 and

 fashion.

 Its iconic

 landmarks,

 including the

 E

iff

el Tower

 and the

 Arc

 de Tri

omphe

, are

 also popular

 tourist

 destinations

.

 Overall,

 Paris

 is

 a

 major economic

 and

 cultural center

 in

 France and

 plays a

 significant role

 in

 the

 country's

 political

 and social



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 incredibly exciting

 and has

 the potential

 to revolution

ize a

 wide range

 of

 industries

. Here

 are some

 potential

 trends

 that

 may

 emerge:



1.

 Increased Use

 of

 AI in

 Healthcare:

 AI is

 already being

 used in

 healthcare to

 help diagnose

 diseases,

 predict medical

 outcomes,

 and even

 personalize treatment

 plans

.

 In the

 future,

 we may

 see

 even more

 advanced

 AI that

 can

 diagnose and

 treat

 diseases in

 real-time

, and

 identify

 new potential

 medical

 treatments.



2

. Autonomous

 vehicles:

 AI is

 already

 being used

 in

 autonomous

 vehicles

 to help

 improve safety

 and reduce

 accidents.

 As the

 technology

 continues

 to develop

,

 we may

 see even

 more

 advanced

 AI

 that

 can

 drive

 itself

,

 navigate

 traffic

,

 and

 even




In [6]:
llm.shutdown()