# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-02 21:42:17] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.04it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.68 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.68 GB):   5%|▌         | 1/20 [00:00<00:03,  5.22it/s]Capturing batches (bs=120 avail_mem=74.58 GB):   5%|▌         | 1/20 [00:00<00:03,  5.22it/s]

Capturing batches (bs=112 avail_mem=74.58 GB):   5%|▌         | 1/20 [00:00<00:03,  5.22it/s]Capturing batches (bs=104 avail_mem=74.57 GB):   5%|▌         | 1/20 [00:00<00:03,  5.22it/s]Capturing batches (bs=104 avail_mem=74.57 GB):  20%|██        | 4/20 [00:00<00:01, 13.65it/s]Capturing batches (bs=96 avail_mem=74.57 GB):  20%|██        | 4/20 [00:00<00:01, 13.65it/s] Capturing batches (bs=88 avail_mem=74.56 GB):  20%|██        | 4/20 [00:00<00:01, 13.65it/s]Capturing batches (bs=80 avail_mem=74.55 GB):  20%|██        | 4/20 [00:00<00:01, 13.65it/s]

Capturing batches (bs=80 avail_mem=74.55 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.13it/s]Capturing batches (bs=72 avail_mem=74.55 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.13it/s]Capturing batches (bs=64 avail_mem=74.54 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.13it/s]Capturing batches (bs=56 avail_mem=74.54 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.13it/s]Capturing batches (bs=56 avail_mem=74.54 GB):  50%|█████     | 10/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=48 avail_mem=74.53 GB):  50%|█████     | 10/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=40 avail_mem=74.53 GB):  50%|█████     | 10/20 [00:00<00:00, 19.19it/s]

Capturing batches (bs=32 avail_mem=74.53 GB):  50%|█████     | 10/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=32 avail_mem=74.53 GB):  65%|██████▌   | 13/20 [00:00<00:00, 17.52it/s]Capturing batches (bs=24 avail_mem=74.52 GB):  65%|██████▌   | 13/20 [00:00<00:00, 17.52it/s]

Capturing batches (bs=16 avail_mem=74.52 GB):  65%|██████▌   | 13/20 [00:00<00:00, 17.52it/s]Capturing batches (bs=16 avail_mem=74.52 GB):  75%|███████▌  | 15/20 [00:00<00:00, 14.46it/s]Capturing batches (bs=12 avail_mem=74.51 GB):  75%|███████▌  | 15/20 [00:00<00:00, 14.46it/s]Capturing batches (bs=8 avail_mem=74.51 GB):  75%|███████▌  | 15/20 [00:01<00:00, 14.46it/s] Capturing batches (bs=4 avail_mem=74.50 GB):  75%|███████▌  | 15/20 [00:01<00:00, 14.46it/s]

Capturing batches (bs=4 avail_mem=74.50 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.28it/s]Capturing batches (bs=2 avail_mem=74.50 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.28it/s]Capturing batches (bs=1 avail_mem=74.49 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.28it/s]Capturing batches (bs=1 avail_mem=74.49 GB): 100%|██████████| 20/20 [00:01<00:00, 16.26it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ishaan and I am 24 years old. I have a very bright mind and I use my mind to create content that engages my audience and makes them want to learn more about me and my work. I started my career as an independent writer but I have since moved on to collaborate with authors, I have a writing style and my voice are my biggest selling point. I use my craft to create education for both adults and children. I am passionate about giving people the tools and tools to improve their lives, and I believe in the power of the written word. I have written several books and articles for adults and children, and
Prompt: The president of the United States is
Generated text:  running for a second term. He must now choose between three candidates, A, B, and C. The president has interviewed A and B and found that A is the clear winner in a poll, but B has a significant lead over C in a poll. The president plans to vote in a way that maximizes the probability of wi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a cultural hub with a diverse population and a thriving economy. It is a popular tourist destination and a major center of politics and government. The city is known for its cuisine, art, and music, and is home to many famous museums

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks and biases in AI systems, there will be a greater emphasis on ethical considerations. This could lead to the development of more transparent and accountable AI systems that are designed to minimize harm to individuals and society.

2. Integration of AI with other technologies: AI is becoming increasingly integrated into other technologies, such as smart homes, self-driving cars, and virtual assistants. This integration could lead to new opportunities for AI to improve the



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a professional software developer. I'm passionate about creating high-quality software that can make people's lives easier. My experiences have taught me that communication is key, and I'm always striving to improve my skills in this area. I'm confident in my abilities and excited to bring my knowledge and skills to anyone who wants to work with me. Let's connect and start building something amazing together! #SoftwareDev #Professional #Motivation #Success #Friendly #Collaborative #Self-Improvement #TechSavvy #PositiveAttitude #FutureProof #TechExpert #Creative #ChallengeByExample #Openness

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the northwestern region of France, on the Seine river, on the Île de France. It is known for its historical significance, as it was the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Type

]

 person

.

 I

'm

 confident

,

 reliable

,

 and

 always

 ready

 to

 help

.

 I

 enjoy

 solving

 problems

 and

 making

 people

's

 lives

 better

.

 I

 don

't

 have

 any

 hobbies

 or

 interests

,

 but

 I

 love

 learning

 new

 things

 and

 trying

 new

 experiences

.

 I

'm

 always

 looking

 for

 ways

 to

 help

 others

 and

 make

 their

 lives

 happier

.

 I

'm

 friendly

,

 approach

able

,

 and

 always

 ready

 to

 assist

 with

 any

 questions

 you

 may

 have

.

 Please

 let

 me

 know

 if

 you

'd

 like

 to

 establish

 a

 connection

.

 [

Name

]

 [

Age

]

 [

Occup

ation

]

 [

Why

 do

 you

 want

 to

 be

 a

 superhero

?

]

 [

What

 are

 your

 goals



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

  



Please

 answer

 the

 following

 question

 about

 this

 factual

 statement

:


Does

 the

 phrase

 "

Paris

,

 France

's

 capital

"

 have

 a

 definition

?

 Yes

,

 Paris

 is

 the

 capital

 of

 France

.

 



Please

 answer the

 following question

 about

 this

 factual

 statement

:


Is

 Paris

,

 France

's

 capital

?

 Yes

,

 Paris

 is

 the

 capital

 of

 France

.

 



Does

 this

 mean

 that

 Paris

 is

 the

 largest

 city

 in

 France

?

 No

,

 Paris

 is

 not

 the

 largest

 city

 in

 France

.

 Paris

 is

 the

 capital

 of

 France

,

 but

 it

 is

 not

 the

 largest

 city

 in

 the

 country

.

 



Does

 Paris

,

 France

's

 capital

,

 have

 a

 capital

 city

?

 Yes

,

 Paris

 is

 a

 city

 that

 serves

 as

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 and

 there

 is

 no

 guarantee

 that

 it

 will

 continue

 to

 evolve

 in

 the

 way

 it

 has

 in

 the

 past

.

 However

,

 there

 are

 several

 potential

 trends

 that

 could

 affect

 the

 way

 AI

 is

 developed

,

 used

,

 and

 integrated

 into

 society

 in

 the

 coming

 years

.



One

 potential

 trend

 is

 the

 increasing

 reliance

 on

 AI

 for

 decision

-making

.

 As

 AI

 becomes

 more

 sophisticated

 and

 capable

 of

 making

 decisions

 based

 on

 data

 and

 patterns

,

 it

 is

 likely

 that

 decision

-making

 will

 become

 more

 automated

 and

 more

 personalized

.

 This

 could

 lead

 to

 more

 efficient

 and

 effective

 decision

-making

 in

 various

 industries

,

 but

 it

 could

 also

 lead

 to

 a

 loss

 of

 jobs

 for

 human

 decision-makers

.



Another

 potential

 trend

 is

 the

 increasing




In [6]:
llm.shutdown()