# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-04 04:40:45] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-04 04:40:45] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-04 04:40:45] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.45it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=55.86 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=55.86 GB):   5%|▌         | 1/20 [00:00<00:03,  5.41it/s]Capturing batches (bs=120 avail_mem=55.75 GB):   5%|▌         | 1/20 [00:00<00:03,  5.41it/s]

Capturing batches (bs=112 avail_mem=55.75 GB):   5%|▌         | 1/20 [00:00<00:03,  5.41it/s]Capturing batches (bs=104 avail_mem=55.74 GB):   5%|▌         | 1/20 [00:00<00:03,  5.41it/s]Capturing batches (bs=104 avail_mem=55.74 GB):  20%|██        | 4/20 [00:00<00:01, 15.29it/s]Capturing batches (bs=96 avail_mem=55.74 GB):  20%|██        | 4/20 [00:00<00:01, 15.29it/s] Capturing batches (bs=88 avail_mem=55.73 GB):  20%|██        | 4/20 [00:00<00:01, 15.29it/s]Capturing batches (bs=80 avail_mem=55.73 GB):  20%|██        | 4/20 [00:00<00:01, 15.29it/s]Capturing batches (bs=80 avail_mem=55.73 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.99it/s]Capturing batches (bs=72 avail_mem=55.73 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.99it/s]

Capturing batches (bs=64 avail_mem=55.72 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.99it/s]Capturing batches (bs=56 avail_mem=55.72 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.99it/s]Capturing batches (bs=56 avail_mem=55.72 GB):  50%|█████     | 10/20 [00:00<00:00, 22.02it/s]Capturing batches (bs=48 avail_mem=55.71 GB):  50%|█████     | 10/20 [00:00<00:00, 22.02it/s]Capturing batches (bs=40 avail_mem=55.71 GB):  50%|█████     | 10/20 [00:00<00:00, 22.02it/s]Capturing batches (bs=32 avail_mem=55.70 GB):  50%|█████     | 10/20 [00:00<00:00, 22.02it/s]Capturing batches (bs=32 avail_mem=55.70 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.29it/s]Capturing batches (bs=24 avail_mem=55.70 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.29it/s]

Capturing batches (bs=16 avail_mem=55.69 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.29it/s]Capturing batches (bs=12 avail_mem=55.69 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.29it/s]Capturing batches (bs=12 avail_mem=55.69 GB):  80%|████████  | 16/20 [00:00<00:00, 22.06it/s]Capturing batches (bs=8 avail_mem=55.68 GB):  80%|████████  | 16/20 [00:00<00:00, 22.06it/s] Capturing batches (bs=4 avail_mem=55.67 GB):  80%|████████  | 16/20 [00:00<00:00, 22.06it/s]Capturing batches (bs=2 avail_mem=55.67 GB):  80%|████████  | 16/20 [00:00<00:00, 22.06it/s]

Capturing batches (bs=2 avail_mem=55.67 GB):  95%|█████████▌| 19/20 [00:00<00:00, 24.20it/s]Capturing batches (bs=1 avail_mem=55.67 GB):  95%|█████████▌| 19/20 [00:00<00:00, 24.20it/s]Capturing batches (bs=1 avail_mem=55.67 GB): 100%|██████████| 20/20 [00:00<00:00, 21.71it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ruth. I'm a 14-year-old female and I am a new student at this school. My name is Ruth. I'm a 14-year-old female and I am a new student at this school. My name is Ruth. I'm a 14-year-old female and I am a new student at this school. My name is Ruth. I'm a 14-year-old female and I am a new student at this school. My name is Ruth. I'm a 14-year-old female and I am a new student at this school. My name is Ruth. I'm a 14
Prompt: The president of the United States is
Generated text:  trying to decide whether to visit Europe or Asia. He has two options: visit Europe or visit Asia. However, he decides that he should choose the option that is the least likely to cause him to be upset. The president is a professional traveler and has visited Europe and Asia at least once before. Given that he will be in Europe for no more than 2 weeks and Asia for no more than 4 weeks, how many countries will he visit if he decides to choose Europe?
To determine the lea

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [reason for interest

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville Blanche" and "La Ville Blanche de l'Est". It is the largest city in France and the second-largest city in the European Union, with a population of over 2. 5 million people. Paris is known for its rich history, beautiful architecture, and vibrant culture, and is a major center for politics, arts, and commerce in Europe. It is also home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is a popular tourist destination and a cultural hub for the French people. The city is also known

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there will be an increased focus on developing AI that is designed to be ethical and responsible. This could involve developing AI that is transparent, accountable, and accountable to human values.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI becomes more advanced, we can expect to see even greater use of AI in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: ... [Your Name], and I'm [Your Age] years old. I'm a [Your Field of Study] at [Your University/College]. What brings you to [Your Location]?

My name is... [Your Name], and I'm [Your Age] years old. I'm a [Your Field of Study] at [Your University/College]. What brings you to [Your Location]? Hello, my name is [Your Name] and I'm [Your Age] years old. I'm a [Your Field of Study] at [Your University/College]. What brings you to [Your Location]? Hello, my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is also known as "La Paix" and is the most important city in the country.

The capital of France is Paris, which is also known as "La Paix" and is the most important city in the country. The city is a major cultural, economic, and political center, and it is home to many famous landmarks such as Notre-

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

age

]

 year

-old

 [

gender

]

 who

 is

 passionate

 about

 [

career

 or

 hobby

],

 [

c

ultural

 background

,

 such

 as

 religion

,

 language

,

 or

 ethnicity

].

 I

 enjoy

 [

job

 or

 hobby

],

 [

part

icipation

 in

 social

 activities

,

 such

 as

 sports

,

 art

,

 music

,

 or

 volunteering

].

 I

 have

 [

number

 of

 friends

],

 and

 I

 love

 [

v

ocation

 or

 hobby

]

 [

sport

,

 hobby

,

 or

 activity

],

 [

any

 other

 interests

 or

 hobbies

].

 I

 am

 a

 [

professional

,

 hobby

ist

,

 or

 travel

 enthusiast

].

 And

 I

 am

 [

your

 ideal

 self

].

 I

 am

 [

positive

 and

 confident

,

 mature

 and

 intelligent

,

 friendly

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 capital

 city

 of

 France

,

 located

 on

 the

 Î

le

 de

 France

,

 a

 land

locked

 island

 in

 the

 Mediterranean

 Sea

,

 and

 is

 the

 largest

 city

 in

 the

 European

 Union and

 the world

's

 fifth

-largest

 city

 by

 population

.

 The

 city

 is

 renowned

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 vibrant

 culture

,

 and

 is

 an

 important

 cultural

 and

 political

 center

 in

 Europe

.

 It

 is

 home

 to

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 and

 the

 Sac

ré

-C

œur

 Basil

ica

,

 among

 other

 landmarks

.

 Paris

 has

 a

 diverse

 population

 of

 around

 

2

.

7

 million

 people

 and

 is

 an

 important

 financial

,

 economic

,

 and

 political

 center

 in

 Europe

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 highly

 diverse

 and

 transformative

.

 Here

 are

 some

 potential

 trends

 that

 could

 shape

 the

 AI

 landscape

:



1

.

 Increased

 efficiency

 and

 productivity

:

 AI

 is

 expected

 to

 become

 more

 efficient

 and

 productive

,

 especially

 in

 fields

 like

 healthcare

 and

 finance

,

 where

 data

 is

 abundant

 and

 complex

.

 Advanced

 AI

 algorithms

 could

 help

 predict

 market

 trends

,

 automate

 repetitive

 tasks

,

 and

 even

 predict

 disease

 outbreaks

.



2

.

 More

 personalized

 experiences

:

 AI

 is

 expected

 to

 provide

 more

 personalized

 experiences

 to

 consumers

,

 enhancing

 convenience

 and

 accessibility

.

 For

 example

,

 voice

-

activated

 assistants

 and

 chat

bots

 could

 improve

 customer

 service

 and

 provide

 personalized

 recommendations

.



3

.

 Enhanced

 cybersecurity

:

 AI

 systems

 are

 becoming

 more

 sophisticated

 and

 capable

 of

 detecting

 and




In [6]:
llm.shutdown()