# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-22 23:02:24] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.86it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.31it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.31it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.31it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  5.90it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lina and I'm a 19-year-old English girl. I live in a big city and I study in a middle school. I like doing some sports, such as swimming and running. But I have a problem, I'm very short. I want to become taller. What should I do? I'm worried about it. Lina
Answer:

Lina, you're not alone. Many young people like Lina have a similar problem. There are several ways to improve your height:

1. Eat a healthy diet: Eating a balanced diet rich in protein, vegetables, and fruits can help you grow taller. Try to
Prompt: The president of the United States is
Generated text:  visiting three different countries. If he spends 8 hours in the first country, 7 hours in the second country, and 5 hours in the third country, what is the total number of hours he spends in these three countries?

To determine the total number of hours the president of the United States spends in these three countries, we need to add the time he spends in each country together. He

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and I'm always looking for ways to [job title] in my work. I'm also a [job title] at [company name], and I'm always looking for ways to [job title] in my work. I'm a [job title] at [company name], and I'm always looking for ways to [job title] in my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic city with a rich history dating back to the Roman Empire. It is the largest city in France and the second-largest city in the European Union, with a population of over 2.7 million people. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles. The city is also famous for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a cultural and artistic center, with many museums, theaters, and art galleries. It is also

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation: AI is expected to become more and more integrated into our daily lives, from manufacturing to customer service. We may see more automation in areas like manufacturing, healthcare, and transportation, where machines can perform tasks that are currently done by humans.

2. AI ethics and privacy: As AI becomes more integrated into our lives, there will be a growing concern about its impact on society. There will be a need for ethical guidelines and regulations to ensure that AI is used



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm [Age]. I'm a [Occupation] and I live [Location]. I'm here to introduce myself and ask you all your questions. Please tell me a little bit about yourself and how you came to be in this position. I hope you'll take the time to answer me honestly and kindly, but feel free to give me whatever answer you think is best for me. I'm interested in learning more about you and how you got here. Thank you! [Name] Self-introduction

Hello, my name is [Name] and I'm [Age]. I'm a [Occupation] and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located on the Seine River and is the largest city in the European Union. 

City facts about Paris:
- Paris is famous for its iconic Eiffel Tower, Marne River, and the Louvre Museum.
- It has been home to many prominent figures including Napoleon Bonaparte 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

 love

 [

occupation

],

 and

 I

'm

 constantly

 striving

 to

 improve

 [

skills

].

 What

's

 your

 profession

 and

 what

 kind

 of

 work

 do

 you

 do

?



I

 look

 forward

 to

 meeting

 you

 and

 discussing

 my

 career

 journey

.

 Have

 you

 had

 any

 memorable

 experiences

 that

 inspired

 you

 to

 pursue

 your

 passion

?



Hi

 there

!

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

],

 and

 I

've

 always

 been

 a

 [

occupation

]

 and

 have

 always

 wanted

 to

 [

describe

 a

 specific

 goal

 or

 hobby

].

 What

 inspired

 you

 to

 pursue

 your

 passion

 in

 [

field

]?

 It

's

 important

 that

 you

 share

 how

 you

 got

 into



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



I

 apologize

,

 but

 there

 seems

 to

 be

 a

 misunderstanding

.

 France

 does

 not

 have

 a

 capital

 city

 called

 "

Paris

".

 The

 capital

 of

 France

 is

 indeed

 Paris

,

 not

 "

Paris

".



If

 you

'd

 like

 me

 to

 provide

 a

 factual

 statement

 about

 Paris

,

 it

 would

 be

:



Paris

 is

 the

 capital

 city

 of

 France

.

 



However

,

 if

 you

 meant

 to

 ask

 about

 the

 city

 of

 Paris

,

 it

 is

 in

 fact

 the

 capital

,

 with

 Nice

 as

 the

 administrative

 capital

.

 



If

 you

 have

 any

 other

 questions

 about

 France

,

 I

'd

 be

 happy

 to

 help

!

 Let

 me

 know

 how

 else

 I

 can

 assist

 you

.

 Paris

 is

 a

 beautiful

 city

 with

 rich

 history

 and

 stunning

 architecture



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 the

 following

 trends

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 AI

 will

 continue

 to

 be

 integrated

 with

 other

 emerging

 technologies

 such

 as

 blockchain

,

 artificial

 intelligence

,

 and

 machine

 learning

.

 This

 integration

 will

 enable

 AI

 to

 gain

 more

 control

 over

 and

 influence

 these

 technologies

.



2

.

 Greater

 automation

 of

 AI

:

 AI

 will

 be

 increasingly

 integrated

 into

 various

 processes

 to

 automate

 tasks

 that

 were

 previously

 done

 by

 humans

.

 This

 automation

 will

 enable

 AI

 to

 handle

 more

 complex

 and

 unpredictable

 tasks

,

 leading

 to

 more

 efficient

 and

 effective

 use

 of

 AI

.



3

.

 Development

 of

 more

 ethical

 AI

:

 AI

 systems

 will

 become

 more

 ethical

 as

 they

 are

 developed

 to

 address

 social

,

 political

,

 and

 ethical

 issues




In [6]:
llm.shutdown()