# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-17 17:37:24] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.19it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.77it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.77it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.77it/s]Capturing batches (bs=104 avail_mem=76.80 GB):   5%|▌         | 1/20 [00:00<00:03,  5.77it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.37it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.37it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 14.37it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 14.37it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.75it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.75it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.75it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.75it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.75it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.75it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.75it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 19.75it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.11it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.11it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.11it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.11it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 20.29it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 20.29it/s] 

Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 20.29it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 20.29it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:00<00:00, 22.59it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:00<00:00, 22.59it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 20.16it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lina. I am a doctor. I have some interesting news for you. I have a new movie called "Life 100". I've always been a big fan of the movie, and I want to give you a sneak peek of what I have in store for you. I have an interesting story, and I want you to take a moment to watch it. Please, you have to take it very seriously and consider the impact that my movie could have on you. 

The movie is called "Life 100" and it is a thriller. It stars John Candy, a pop star, and a famous actress
Prompt: The president of the United States is
Generated text:  at the helm of the country and holds the office for a term of four years. Please provide a summary of the current president and their main accomplishments, as well as a brief overview of the current federal budget and the current administration's major policy initiatives. Please use the following table to provide further context: 

Current President: Donald Trump
Federal Budget for the Current Year: $

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Type of Person] who is [Describe your personality traits here]. I'm always [Describe your favorite hobby or activity]. I'm [Describe your most memorable achievement]. I'm [Describe your most interesting fact about yourself]. I'm [Describe your current location]. I'm [Describe your current mood]. I'm [Describe your current state of mind]. I'm [Describe your current state of health]. I'm [Describe your current physical appearance]. I'm [Describe your current social media presence]. I'm [Describe your current interests and hobbies

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center, hosting numerous museums, theaters, and festivals throughout the year. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, art, and cuisine, and is a major tourist destination for visitors from around the world. Paris is often referred to as the "City of Light" due to its vibrant nightlife and modern architecture. The city is also home to many famous museums, including the Louvre and the Mus

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and adaptive AI systems that can better understand and respond to human needs.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions. This could lead to more efficient and effective use of resources, as well as more accurate predictions of human behavior and outcomes.

3. Increased



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Emily and I am an experienced journalist with a passion for uncovering the truth. I am always looking for interesting stories and seeking out the most compelling and timely news. I enjoy writing articles for newspapers and online publications, as well as managing a team of reporters. I believe in the importance of keeping society informed and providing a platform for diverse voices to be heard. I am always learning and evolving, always seeking new ways to improve my skills and stay up-to-date with the latest news and trends. Thanks for asking! What's your favorite book or movie to read or watch, and why do you like it so much? As an AI language

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city renowned for its picturesque streets, iconic landmarks, and rich history. It is located in the Île-de-Fra

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

Age

]

 year

-old

 [

Occup

ation

]

 who

 currently

 works

 at

 [

Your

 Company

].

 I

 have

 [

Number

 of

 Years

 at

 the

 Company

]

 years

 of

 experience

 in

 [

Industry

 or

 Field

]

 and

 [

Number

 of

 Projects

/

Events

]

 projects

 I

've

 been

 involved

 in

 that

 have

 helped

 the

 company

 achieve

 its

 [

Financial

/

Marketing

/

Quality

]

 goals

.

 I

'm

 always

 looking

 for

 opportunities

 to

 grow

 and

 learn

 from

 my

 experiences

,

 and

 I

 enjoy

 collaborating

 with

 other

 talented

 professionals

.

 Please

 let

 me

 know

 if

 you

'd

 like

 to

 meet

 me

 or

 learn

 more

 about

 me

.

 [

Your

 Name

]

 [

LinkedIn

 Profile

]

 [

Contact

 Information

]

 [

Follow

 Me



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 historical

 significance

 and

 vibrant

 culture

.


Paris

,

 the

 iconic

 French

 capital

,

 is

 renowned

 for

 its

 rich

 history

,

 artistic

 heritage

,

 and

 bustling

 culture

,

 making

 it

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 French

 culture

 and

 art

.

 The

 city

's

 landmarks

 like

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

 are

 just

 a

 few

 of

 the

 iconic

 sites

 that

 attract

 millions

 of

 visitors

 each

 year

.

 From

 gourmet

 food

 and

 wine

 to

 the

 op

ulent

 Lou

vre

 Museum

,

 Paris

 offers

 a

 comprehensive

 culinary

 and

 cultural

 experience

.

 Its

 annual

 Carn

aval

,

 a

 vibrant

 celebration

 of

 traditional

 French

 culture

,

 and

 its

 iconic

 fashion

 scenes

 also

 make

 it

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 very

 exciting

,

 with

 potential

 applications

 in

 many

 areas

 including

 healthcare

,

 transportation

,

 and

 even

 entertainment

.

 Here

 are

 a

 few

 possible

 future

 trends

:



1

.

 Increased

 AI

-A

cc

eler

ated

 Medical

 Imaging

:

 As

 AI

 technology

 continues

 to

 improve

,

 it

 could

 be

 used

 to

 analyze

 medical

 images

 more

 quickly

 and

 accurately

,

 leading

 to

 earlier

 and

 more

 accurate

 diagnoses

 of

 diseases

.

 This

 could

 lead

 to

 more

 effective

 treatment

 and

 a

 better

 overall

 patient

 experience

.



2

.

 AI

 in

 Transportation

:

 AI

 could

 be

 used

 to

 improve

 the

 efficiency

 and

 safety

 of

 transportation

 systems

 like

 autonomous

 vehicles

 and

 self

-driving

 cars

.

 This

 could

 lead

 to

 more

 convenient

,

 faster

,

 and

 safer

 travel

 for

 millions

 of

 people

.



3

.

 AI




In [6]:
llm.shutdown()