# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-09 03:43:56] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-09 03:43:56] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-09 03:43:56] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-09 03:43:59] INFO server_args.py:1796: Attention backend not specified. Use fa3 backend by default.


[2026-02-09 03:43:59] INFO server_args.py:2783: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.93 GB):   5%|▌         | 1/20 [00:00<00:12,  1.57it/s]Capturing batches (bs=120 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:12,  1.57it/s]Capturing batches (bs=112 avail_mem=76.83 GB):   5%|▌         | 1/20 [00:00<00:12,  1.57it/s]Capturing batches (bs=104 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:12,  1.57it/s]Capturing batches (bs=104 avail_mem=76.82 GB):  20%|██        | 4/20 [00:00<00:02,  6.76it/s]Capturing batches (bs=96 avail_mem=76.82 GB):  20%|██        | 4/20 [00:00<00:02,  6.76it/s] Capturing batches (bs=88 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:02,  6.76it/s]Capturing batches (bs=80 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:02,  6.76it/s]

Capturing batches (bs=80 avail_mem=76.81 GB):  35%|███▌      | 7/20 [00:00<00:01, 11.28it/s]Capturing batches (bs=72 avail_mem=76.80 GB):  35%|███▌      | 7/20 [00:00<00:01, 11.28it/s]Capturing batches (bs=64 avail_mem=76.80 GB):  35%|███▌      | 7/20 [00:00<00:01, 11.28it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:01, 11.28it/s]Capturing batches (bs=56 avail_mem=76.79 GB):  50%|█████     | 10/20 [00:00<00:00, 15.37it/s]Capturing batches (bs=48 avail_mem=76.79 GB):  50%|█████     | 10/20 [00:00<00:00, 15.37it/s]Capturing batches (bs=40 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 15.37it/s]Capturing batches (bs=32 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:01<00:00, 15.37it/s]

Capturing batches (bs=32 avail_mem=76.78 GB):  65%|██████▌   | 13/20 [00:01<00:00, 18.72it/s]Capturing batches (bs=24 avail_mem=76.77 GB):  65%|██████▌   | 13/20 [00:01<00:00, 18.72it/s]Capturing batches (bs=16 avail_mem=76.77 GB):  65%|██████▌   | 13/20 [00:01<00:00, 18.72it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:01<00:00, 18.72it/s]Capturing batches (bs=12 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:01<00:00, 19.84it/s]Capturing batches (bs=8 avail_mem=76.76 GB):  80%|████████  | 16/20 [00:01<00:00, 19.84it/s] Capturing batches (bs=4 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 19.84it/s]Capturing batches (bs=2 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 19.84it/s]

Capturing batches (bs=1 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 19.84it/s]Capturing batches (bs=1 avail_mem=76.74 GB): 100%|██████████| 20/20 [00:01<00:00, 24.22it/s]Capturing batches (bs=1 avail_mem=76.74 GB): 100%|██████████| 20/20 [00:01<00:00, 15.34it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jules, I’m 17 years old and I’m from a small village in a mountainous area in the west of China. I’ve been living in China for 2 years and I love the place and it’s been really fun to see how it has changed. Now I’m in the third grade and I have to help my parents and my brother with their school work. My brother’s in high school and I help him with his lessons. I like to help my parents and my brother with their school work and it’s always really fun. What’s your hobby? I love reading. I like to read books and magazines
Prompt: The president of the United States is
Generated text:  52 years older than the president of Central America. The president of Central America is half the age of the president of Asia. If the president of Asia is 3 times the age of the president of Africa, how old is the president of Africa? To determine the age of the president of Africa, we will follow a step-by-step approach using the given information and performing

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the third-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also known for its fashion industry, art scene, and food culture. It is a popular tourist destination and a major economic center in France. Paris is a city that has a rich history and a unique culture that attracts millions of visitors each year. The city is also home to many important institutions and organizations, including the French Academy of Sciences

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation: One of the most significant trends in AI is the increasing automation of tasks that are currently done by humans. This could lead to the creation of more efficient and cost-effective systems that can perform a wide range of tasks with minimal human intervention.

2. Improved privacy and security: As AI systems become more sophisticated, there is a risk that they could be used to collect and analyze personal data without the consent



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [Age] year old [Gender] who was born in [Birthplace] and I was raised in [Town/Region]. I have always been fascinated by [Their Major Interest], and I strive to be the best [Their Specialty/Ability]. I am passionate about [Their Passion], and I love to [My Main Activity]. I am a [Your Relationship Status] to this character and have a good sense of humor. I am always trying to learn more about this character and their life. How would you describe your character in a short paragraph? My character is [Name] and I am a [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city is renowned for its rich history, architecture, and artistic culture.

That's correct! The capital of France is Paris, the city is renowned for its rich history, architecture, and artistic culture. With its stunningly 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 friendly

,

 outgoing

 individual

.

 I

'm

 always

 ready

 to

 lend

 a

 helping

 hand

 and

 I

 love

 spending

 time

 with

 people

.

 What

's

 your

 name

,

 and

 what

 kind

 of

 job

 or

 hobby

 are

 you

 currently

 involved

 with

?

 That

 way

,

 I

 can

 tailor

 my

 responses

 to

 your

 specific

 needs

 and

 interests

.

 Hello

,

 my

 name

 is

 [

Name

]

 and

 I

'm

 a

 friendly

,

 outgoing

 individual

.

 I

'm

 always

 ready

 to

 lend

 a

 helping

 hand

 and

 I

 love

 spending

 time

 with

 people

.

 What

's

 your

 name

,

 and

 what

 kind

 of

 job

 or

 hobby

 are

 you

 currently

 involved

 with

?

 That

 way

,

 I

 can

 tailor

 my

 responses

 to

 your

 specific

 needs

 and

 interests



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 second

-largest

 city

 in

 Europe

 by

 population

.

 It

 is

 also

 the

 seat

 of

 government

,

 the

 capital

 of

 the

 French

 Republic

 and

 the

 headquarters

 of

 the

 French

 government

.

 



Paris

 is

 known

 for

 its

 rich

 history

,

 vibrant

 culture

,

 and

 architectural

 wonders

.

 It

 has

 numerous

 museums

,

 theaters

,

 and

 landmarks

 like

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 famous

 for

 its

 cuisine

,

 particularly

 French

 cuisine

,

 which

 is

 characterized

 by

 its

 use

 of

 fresh

 ingredients

 and

 a

 love

 of

 pasta

 and

 seafood

.

 Paris

 is

 also

 home

 to

 the

 French

 national

 anthem

,

 "

La

 M

arse

ill

aise

,"

 which

 is

 performed



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 number

 of

 trends

,

 including

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 As

 AI

 becomes

 more

 powerful

,

 it

 will

 be

 used

 to

 analyze

 medical

 data

,

 predict

 patient

 outcomes

,

 and

 help

 doctors

 make

 more

 accurate

 diagnoses

.

 This

 could

 lead

 to

 earlier

 detection

 of

 diseases

,

 more

 personalized

 treatments

,

 and

 a

 better

 understanding

 of

 the

 root

 causes

 of

 diseases

.



2

.

 Integration

 of

 AI

 in

 manufacturing

:

 AI

 is

 already

 being

 used

 in

 manufacturing

 to

 optimize

 production

 processes

,

 identify

 quality

 issues

,

 and

 make

 recommendations

 for

 improving

 efficiency

 and

 reducing

 waste

.



3

.

 AI

-powered

 automation

:

 As

 AI

 technology

 advances

,

 it

 is

 likely

 to

 become

 more

 prevalent

 in

 manufacturing

,

 finance




In [6]:
llm.shutdown()