# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-06 22:57:35] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.99it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.98it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.68 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.68 GB):   5%|▌         | 1/20 [00:00<00:03,  5.49it/s]Capturing batches (bs=120 avail_mem=74.58 GB):   5%|▌         | 1/20 [00:00<00:03,  5.49it/s]

Capturing batches (bs=112 avail_mem=74.58 GB):   5%|▌         | 1/20 [00:00<00:03,  5.49it/s]Capturing batches (bs=104 avail_mem=74.57 GB):   5%|▌         | 1/20 [00:00<00:03,  5.49it/s]Capturing batches (bs=104 avail_mem=74.57 GB):  20%|██        | 4/20 [00:00<00:01, 13.91it/s]Capturing batches (bs=96 avail_mem=74.56 GB):  20%|██        | 4/20 [00:00<00:01, 13.91it/s] Capturing batches (bs=88 avail_mem=74.56 GB):  20%|██        | 4/20 [00:00<00:01, 13.91it/s]Capturing batches (bs=80 avail_mem=74.55 GB):  20%|██        | 4/20 [00:00<00:01, 13.91it/s]

Capturing batches (bs=80 avail_mem=74.55 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.29it/s]Capturing batches (bs=72 avail_mem=74.55 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.29it/s]Capturing batches (bs=64 avail_mem=74.54 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.29it/s]Capturing batches (bs=56 avail_mem=74.54 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.29it/s]Capturing batches (bs=56 avail_mem=74.54 GB):  50%|█████     | 10/20 [00:00<00:00, 19.32it/s]Capturing batches (bs=48 avail_mem=74.53 GB):  50%|█████     | 10/20 [00:00<00:00, 19.32it/s]Capturing batches (bs=40 avail_mem=74.53 GB):  50%|█████     | 10/20 [00:00<00:00, 19.32it/s]

Capturing batches (bs=32 avail_mem=74.53 GB):  50%|█████     | 10/20 [00:00<00:00, 19.32it/s]Capturing batches (bs=32 avail_mem=74.53 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.73it/s]Capturing batches (bs=24 avail_mem=74.52 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.73it/s]Capturing batches (bs=16 avail_mem=74.52 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.73it/s]Capturing batches (bs=12 avail_mem=74.51 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.73it/s]Capturing batches (bs=12 avail_mem=74.51 GB):  80%|████████  | 16/20 [00:00<00:00, 20.13it/s]Capturing batches (bs=8 avail_mem=74.51 GB):  80%|████████  | 16/20 [00:00<00:00, 20.13it/s] 

Capturing batches (bs=4 avail_mem=74.50 GB):  80%|████████  | 16/20 [00:00<00:00, 20.13it/s]Capturing batches (bs=2 avail_mem=74.50 GB):  80%|████████  | 16/20 [00:00<00:00, 20.13it/s]Capturing batches (bs=1 avail_mem=74.49 GB):  80%|████████  | 16/20 [00:00<00:00, 20.13it/s]Capturing batches (bs=1 avail_mem=74.49 GB): 100%|██████████| 20/20 [00:01<00:00, 23.43it/s]Capturing batches (bs=1 avail_mem=74.49 GB): 100%|██████████| 20/20 [00:01<00:00, 19.94it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jules and I have been a saxophonist for 20 years and have been a bandleader and music educator for 20 years. I have been teaching for 17 years and have been a member of The Neighborhood Saxophone Band for 12 years.
I have been a member of the 13th Street Jazz Band since the fall of 2014 and a member of the 14th Street Jazz Band since the fall of 2015. I have been a member of the Chicago Symphony Saxophone Quartet since the fall of 2015. I have been
Prompt: The president of the United States is
Generated text:  3/4 the age of the president of Canada, who is 30 years old. If both individuals are planning a surprise birthday party for their parents, and each one is planning to invite an equal number of guests, how many guests will each person need to invite if the sum of their ages is 240? Let's break down the problem step by step:

1. The president of the United States is 3/4 the age of the president of Canada. If the president of the United Sta

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Quarter, where many famous French artists and writers have lived and worked. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. The city is known for its fashion, cuisine, and art, and is a major economic and political center in Europe. Paris is also home to many international organizations and institutions, including the European Union and the United Nations. The city is known for its beautiful architecture, including the Louvre

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, from manufacturing to healthcare. This will lead to increased efficiency, cost savings, and job displacement, but it will also create new opportunities for workers.

2. AI-powered healthcare: AI will play a crucial role in healthcare, with the ability to analyze large amounts of medical data and provide personalized treatment recommendations. This will lead to better diagnoses



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [role in your story]. I'm an [insert your profession, such as "astronomer", "teacher", "writer", etc.]. I have a [insert your field of expertise, such as "astronomy", "psychology", "history", etc.]. I'm an [insert your age, such as "25", "35", "45", etc.]. I was born in [insert your birthplace, if applicable], and I've lived my entire life in [insert your current home, if applicable]. I'm a [insert your gender, such as

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the country and the most populous city in Europe, with a population of approximately 2.2 million people. Paris is renowned for its historical landmarks, vibrant arts scene, and exquisite cuisine. It has a rich cultural history and is known for its art museums, theaters, and iconic landmarks such a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Character

's

 Name

],

 and

 I

'm

 a

 [

Role

/

Character

]

 at

 heart

!

 My

 journey

 has

 been

 filled

 with

 challenges

,

 but

 I

've

 found

 my

 way

 through

.

 I

 believe

 in

 [

a

 specific

 aspect

 or

 trait

 of

 yourself

]

 and

 I

'm

 dedicated

 to

 [

a

 specific

 goal

 or

 mission

].

 I

'm

 confident

 in

 my

 abilities

,

 driven

 by

 my

 passion

,

 and

 always

 willing

 to

 take

 on

 new

 challenges

.

 Whether

 I

'm

 solving

 a

 complex

 problem

 or

 tackling

 a

 difficult

 task

,

 I

 always

 strive

 to

 learn

 and

 grow

,

 making

 me

 a

 unique

 and

 valuable

 asset

 to

 any

 team

.

 I

'm

 here

 to

 inspire

 and

 motivate

 others

 to

 achieve

 their

 goals

,

 and

 I

'm

 always

 looking



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 world

-f

amous

 city

 known

 for

 its

 historical

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

,

 the

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 famous

 for

 its

 cuisine

,

 art

,

 and

 culture

.

 It

 is

 a

 major

 tourist

 destination

 and

 a

 cultural

 and

 political

 center

 of

 the

 country

.

 Its

 urban

 architecture

 is

 notable

 for

 its

 grand

io

se

 buildings

 and

 iconic

 landmarks

.

 Paris

 has

 a

 long

 history

 and

 is

 an

 important

 center

 of

 intellectual

,

 scientific

,

 and

 artistic

 development

.

 It

 is

 home

 to

 over

 

1

0

 million

 people

,

 making

 it

 the

 largest

 city

 in

 the

 European

 Union

 by

 population

.

 Paris

 is

 the

 cultural

 capital

 of

 France

,

 with

 many

 renowned



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 with

 several

 potential

 trends

 that

 are

 shaping

 the

 technology

 landscape

.

 Here

 are

 some

 of

 the

 most

 promising

 areas

 of

 development

:



1

.

 Autonomous

 vehicles

:

 With

 the

 increasing

 demand

 for

 mobility

,

 autonomous

 vehicles

 will

 become

 increasingly

 important

.

 AI

 will

 play

 a

 key

 role

 in

 making

 these

 vehicles

 safer

,

 more

 efficient

,

 and

 more

 affordable

.



2

.

 Personal

ized

 medicine

:

 AI

 is

 increasingly

 being

 used

 in

 the

 medical

 field

 to

 provide

 better

 diagnoses

 and

 treatment

 options

.

 AI

 algorithms

 can

 analyze

 large

 amounts

 of

 patient

 data

 to

 identify

 patterns

 and

 predict

 outcomes

,

 allowing

 doctors

 to

 make

 more

 accurate

 and

 personalized

 treatments

.



3

.

 Natural

 language

 processing

:

 This

 technology

 will

 enable

 AI

 to

 understand

 and

 interpret

 natural




In [6]:
llm.shutdown()