# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-15 22:52:35] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.22it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.88 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.88 GB):   5%|▌         | 1/20 [00:00<00:05,  3.39it/s]Capturing batches (bs=120 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:05,  3.39it/s]Capturing batches (bs=112 avail_mem=74.74 GB):   5%|▌         | 1/20 [00:00<00:05,  3.39it/s]Capturing batches (bs=112 avail_mem=74.74 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.09it/s]Capturing batches (bs=104 avail_mem=74.74 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.09it/s]

Capturing batches (bs=96 avail_mem=74.70 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.09it/s] Capturing batches (bs=96 avail_mem=74.70 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.29it/s]Capturing batches (bs=88 avail_mem=74.69 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.29it/s]Capturing batches (bs=80 avail_mem=74.69 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.29it/s]Capturing batches (bs=72 avail_mem=74.66 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.29it/s]

Capturing batches (bs=72 avail_mem=74.66 GB):  40%|████      | 8/20 [00:00<00:00, 13.18it/s]Capturing batches (bs=64 avail_mem=74.64 GB):  40%|████      | 8/20 [00:00<00:00, 13.18it/s]Capturing batches (bs=56 avail_mem=74.64 GB):  40%|████      | 8/20 [00:00<00:00, 13.18it/s]Capturing batches (bs=48 avail_mem=74.63 GB):  40%|████      | 8/20 [00:00<00:00, 13.18it/s]Capturing batches (bs=48 avail_mem=74.63 GB):  55%|█████▌    | 11/20 [00:00<00:00, 16.18it/s]Capturing batches (bs=40 avail_mem=74.63 GB):  55%|█████▌    | 11/20 [00:00<00:00, 16.18it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  55%|█████▌    | 11/20 [00:00<00:00, 16.18it/s]

Capturing batches (bs=24 avail_mem=74.62 GB):  55%|█████▌    | 11/20 [00:00<00:00, 16.18it/s]Capturing batches (bs=24 avail_mem=74.62 GB):  70%|███████   | 14/20 [00:01<00:00, 18.62it/s]Capturing batches (bs=16 avail_mem=74.61 GB):  70%|███████   | 14/20 [00:01<00:00, 18.62it/s]Capturing batches (bs=12 avail_mem=74.61 GB):  70%|███████   | 14/20 [00:01<00:00, 18.62it/s]Capturing batches (bs=8 avail_mem=74.60 GB):  70%|███████   | 14/20 [00:01<00:00, 18.62it/s] 

Capturing batches (bs=8 avail_mem=74.60 GB):  85%|████████▌ | 17/20 [00:01<00:00, 17.03it/s]Capturing batches (bs=4 avail_mem=74.59 GB):  85%|████████▌ | 17/20 [00:01<00:00, 17.03it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  85%|████████▌ | 17/20 [00:01<00:00, 17.03it/s]Capturing batches (bs=1 avail_mem=74.59 GB):  85%|████████▌ | 17/20 [00:01<00:00, 17.03it/s]Capturing batches (bs=1 avail_mem=74.59 GB): 100%|██████████| 20/20 [00:01<00:00, 18.11it/s]Capturing batches (bs=1 avail_mem=74.59 GB): 100%|██████████| 20/20 [00:01<00:00, 14.82it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John and I am currently in my first year of college. I'm a third year student, and I'm majoring in biochemistry. I'm just a student who is not very successful, but I am not depressed. I'm going to the grocery store, and I saw a bag of generic bread that caught my eye. I'm not familiar with the product, but I'm not going to look at it again until I know what it is. I'm not depressed. I'm not upset at myself. The fact that I'm not depressed or upset is not a factor of the reason why I bought the bread. It was because
Prompt: The president of the United States is
Generated text:  elected for a term of two years. If the president is 65 years old now, how old will the president be when his term ends? 
a) 25 years old
b) 35 years old
c) 45 years old
d) 55 years old
To determine the age of the president when his term ends, we need to understand that the president's term is two years long. If the president is currently 65 years old, we can calculate t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] and improve my skills. I'm a [job title] and I'm always looking for ways to [job title] and improve my skills. I'm a [job title] and I'm always looking for ways to [job title] and improve my skills. I'm a [job title] and I'm always looking for ways to [job title]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and economic hub, known for its rich history, art, and cuisine. It is also a popular tourist destination, with millions of visitors each year. The city is known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is also home to the French Parliament, the French Academy of Sciences, and the French National Library. It is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars to personalized healthcare and financial services. Additionally, AI is likely to play an increasingly important role in solving some of the world's most complex problems, such as climate change and global health crises. As AI becomes more integrated into our daily lives, it is likely to have a significant impact on the way we work, communicate, and interact with each other. However, there are also potential risks and challenges associated with the development and use



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I am a [Type of Expert] with [Number of Years of Experience] years of experience in [Field of Expertise]. My expertise lies in [Specific Skills or Areas of Expertise]. In my spare time, I enjoy [Your Hobby or Interests]. I am passionate about [Your Passion], and I am always eager to learn and grow. I am excited to help you achieve your goals. What is your profession, and what type of expertise do you have in it? [Your Name]. [Type of Expert]. [Number of Years of Experience]. [Specific Skills or Areas of Expertise]. My hobbies

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the south of the country and is one of the most important cities in Europe. It is known for its rich history, beautiful architecture, and lively city life. The city is home to some of the world's most fa

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

Your

 Profession

]

 who

 has

 been

 working

 in

 the

 [

Field

 of

 Expert

ise

]

 field

 for

 [

Your

 Years

]

 years

 now

.

 My

 expertise

 lies

 in

 [

Specific

 Skill

/

Project

/

Experience

].

 I

 am

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 learn

 and

 grow

.

 What

 can

 you

 tell

 me

 about

 yourself

?

 [

Your

 Name

]

 is

 a

 [

Your

 Profession

],

 a

 [

Your

 Field

 of

 Expert

ise

],

 with

 a

 [

Your

 Years

]

 of

 experience

.

 My

 expertise

 lies

 in

 [

Specific

 Skill

/

Project

/

Experience

].

 I

 am

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 learn

 and

 grow

.

 What

 can

 you

 tell

 me

 about

 yourself

?

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 

6

th

 largest

 city

 in

 the

 world

.

 It

 is

 home

 to

 many

 of

 France

's

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 known

 for

 its

 rich

 history

,

 including

 the

 influence

 of

 ancient

 Roman

 and

 Gothic

 architecture

,

 and

 its

 reputation

 as

 a

 cosm

opolitan

 and

 vibrant

 city

.

 With

 a

 population

 of

 around

 

2

.

 

4

 million

,

 Paris

 is

 one

 of

 the

 most

 popular

 tourist

 destinations

 in

 the

 world

.

 Paris

 is

 a

 highly

 culturally

 and

 intellectually

 stimulating

 city

 with

 a

 long

 and

 stor

ied

 history

,

 and

 is

 widely

 considered

 one

 of

 the

 most

 iconic

 and

 recognizable



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 here

 are

 some

 possible

 trends

 to

 watch

:



1

.

 Deep

 Learning

:

 As

 AI

 technology

 continues

 to

 improve

,

 deep

 learning

 is

 becoming

 more

 prevalent

.

 Deep

 learning

 can

 recognize

 and

 learn

 from

 complex

 patterns

 and

 relationships

 in

 data

,

 making

 it

 potentially

 the

 most

 powerful

 AI

 technology

 of

 the

 future

.



2

.

 Natural

 Language

 Processing

:

 N

LP

 is

 becoming

 more

 advanced

 and

 enabling

 machines

 to

 understand

 and

 generate

 human

 language

.

 This

 technology

 is

 likely

 to

 revolution

ize

 chat

bots

,

 speech

 recognition

,

 and

 virtual

 assistants

.



3

.

 Autonomous

 Vehicles

:

 With

 advancements

 in

 AI

 and

 machine

 learning

,

 autonomous

 vehicles

 are

 becoming

 more

 common

 in

 various

 industries

.

 These

 vehicles

 are

 capable

 of

 making

 decisions

 and

 taking

 actions




In [6]:
llm.shutdown()