# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-17 08:30:10] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.08it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.83it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.83it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.83it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.73it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.73it/s]

Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.73it/s] Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.85it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.85it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:01, 10.85it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:01<00:02,  5.76it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:01<00:02,  5.76it/s]

Capturing batches (bs=72 avail_mem=76.79 GB):  40%|████      | 8/20 [00:01<00:01,  6.36it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:01,  6.36it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:02,  5.46it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:02,  5.46it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:01<00:01,  5.41it/s]Capturing batches (bs=48 avail_mem=76.75 GB):  50%|█████     | 10/20 [00:01<00:01,  5.41it/s]

Capturing batches (bs=40 avail_mem=76.75 GB):  50%|█████     | 10/20 [00:01<00:01,  5.41it/s]Capturing batches (bs=40 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:01<00:01,  7.53it/s]Capturing batches (bs=32 avail_mem=76.74 GB):  60%|██████    | 12/20 [00:01<00:01,  7.53it/s]Capturing batches (bs=24 avail_mem=76.74 GB):  60%|██████    | 12/20 [00:01<00:01,  7.53it/s]Capturing batches (bs=24 avail_mem=76.74 GB):  70%|███████   | 14/20 [00:01<00:00,  9.64it/s]Capturing batches (bs=16 avail_mem=76.73 GB):  70%|███████   | 14/20 [00:01<00:00,  9.64it/s]

Capturing batches (bs=12 avail_mem=76.71 GB):  70%|███████   | 14/20 [00:01<00:00,  9.64it/s]Capturing batches (bs=12 avail_mem=76.71 GB):  80%|████████  | 16/20 [00:02<00:00,  9.73it/s]Capturing batches (bs=8 avail_mem=76.71 GB):  80%|████████  | 16/20 [00:02<00:00,  9.73it/s] Capturing batches (bs=4 avail_mem=76.23 GB):  80%|████████  | 16/20 [00:02<00:00,  9.73it/s]

Capturing batches (bs=4 avail_mem=76.23 GB):  90%|█████████ | 18/20 [00:02<00:00, 10.08it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:02<00:00, 10.08it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:02<00:00, 10.08it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:02<00:00,  8.64it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tashkin and I’m a science student majoring in Biochemistry. I have already been studying and researching in this field for several years now. I am constantly studying and talking about the topic of genetic engineering.
I am also a graduate of a medical school. I studied medicine for four years and had experience in medical diagnostics and treatment.
I am also a biologist who researches about viruses.
I am also a blogger and I have a lot of experience in writing, editing and publishing my articles.
My education and training are limited to the areas of medical science, genetics and biochemistry.
In order to get a better understanding of the topic
Prompt: The president of the United States is
Generated text:  trying to decide whether to have a new war on climate change or not. In the past, the president's advisers have said that he would not have a new war on climate change if they had a better reason to believe that it is occurring, but the new 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? I'm a [insert a brief description of your character or personality]. I enjoy [insert a brief description of your hobbies or interests]. What do you like to do in your free time? I like to [insert a brief description of your hobbies or interests]. What do you think is the most important thing in your life? I think it's [insert a brief description of your life goal or aspiration]. I'm looking forward to meeting you and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center, hosting numerous museums, theaters, and restaurants. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, art, and cuisine, and is home to many famous French artists and writers. It is a city of contrasts, with its modern architecture and historical landmarks blending together to create a unique and dynamic urban landscape. Paris is a city of love, passion, and innovation, and continues to be

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence. This could lead to more sophisticated forms of AI that can learn from and adapt to human behavior and preferences.

2. Greater use of AI in healthcare: AI is already being used in healthcare to help diagnose and treat diseases, but there is a lot of potential for even greater use in the future. AI could be used to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [job title]. I've always been passionate about [field of interest or hobby] and have spent a lot of time [reason for being passionate]. I've always been a [student, employee, etc.], but I've always felt [why] I was drawn to this field. Now, my [description of current job or hobby]. 
As an AI language model, I don't have personal experiences or emotions, but I can generate a short self-introduction for you. Just give me your name and what you do. 

Hello! My name is [Your Name] and I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest city in France and the seat of government, culture, and commerce in the country. It is also known as the "City of Light" and "the City of Love" due to its famous landmarks such as the Eiffel Tower, the Louvre Museum, the Notre-Dame Cat

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Sarah

.

 I

'm

 a

 software

 developer

 with

 a

 background

 in

 graphic

 design

.

 I

 enjoy

 working

 on

 complex

 projects

,

 taking

 on

 challenging

 tasks

,

 and

 building

 things

 that

 make

 people

's

 lives

 better

.

 I

 love

 coding

 and

 learning

 new

 tools

 and

 techniques

 to

 improve

 my

 skills

.

 I

'm

 always

 looking

 for

 ways

 to

 make

 the

 world

 a

 better

 place

 through

 technology

.

 


I

'm

 excited

 to

 get

 started

 on

 a

 new

 project

 with

 you

.

 Let

's

 get

 started

!

 Have

 a

 great

 day

!

 


The

 self

-int

roduction

 is

 neutral

 and

 does

 not

 contain

 any

 political

,

 religious

,

 or

 cultural

 values

.

 It

 is

 written

 in

 a

 straightforward

 and

 non

-j

ud

gment

al

 tone

.

 The

 character

 is

 introduced



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 north

western

 region

 of

 the

 country

.

 It

 is

 the

 largest

 and

 most

 populous

 city

 in

 the

 European

 Union

 and

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

,

 with

 an

 estimated

 

6

7

 million

 tourists

 visiting

 annually

.

 The

 city

's

 rich

 history

,

 culture

,

 and

 cuisine

 make

 it

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

.

 It

 has

 also

 been

 identified

 as

 a

 UNESCO

 World

 Heritage

 Site

 twice

,

 with

 its

 iconic

 landmarks

 and

 historical

 sites

 attracting

 millions

 of

 visitors

 each

 year

.

 Overall

,

 Paris

 is

 a

 globally

 renowned

 and

 beloved

 city

 that

 continues



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

,

 and

 there

 are

 many

 potential

 directions

 it

 could

 take

 in

 the

 years

 to

 come

.

 Here

 are

 some

 possible

 trends

 in

 AI

:



1

.

 Greater

 automation

:

 AI

 will

 become

 more

 efficient

 and

 accurate

,

 potentially

 autom

ating

 many

 tasks

 that

 would

 otherwise

 be

 done

 by

 humans

.

 This

 could

 lead

 to

 increased

 productivity

 and

 reduced

 costs

,

 but

 it

 could

 also

 result

 in

 job

 loss

 in

 certain

 sectors

.



2

.

 Enhanced

 human

-A

I

 collaboration

:

 AI

 will

 continue

 to

 become

 more

 integrated

 with

 humans

,

 allowing

 for

 more

 efficient

 and

 effective

 collaboration

 between

 humans

 and

 machines

.

 This

 could

 lead

 to

 more

 positive

 outcomes

,

 such

 as

 faster

 problem

-solving

 and

 higher

-quality

 decision

-making

.



3

.

 Enhanced

 security

:

 AI




In [6]:
llm.shutdown()