# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-26 22:17:33] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.68 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.68 GB):   5%|▌         | 1/20 [00:00<00:04,  4.59it/s]Capturing batches (bs=120 avail_mem=74.58 GB):   5%|▌         | 1/20 [00:00<00:04,  4.59it/s]Capturing batches (bs=112 avail_mem=74.57 GB):   5%|▌         | 1/20 [00:00<00:04,  4.59it/s]Capturing batches (bs=112 avail_mem=74.57 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.00it/s]Capturing batches (bs=104 avail_mem=74.57 GB):  15%|█▌        | 3/20 [00:00<00:01,  9.00it/s]

Capturing batches (bs=104 avail_mem=74.57 GB):  20%|██        | 4/20 [00:00<00:01,  8.12it/s]Capturing batches (bs=96 avail_mem=74.56 GB):  20%|██        | 4/20 [00:00<00:01,  8.12it/s] Capturing batches (bs=96 avail_mem=74.56 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.22it/s]Capturing batches (bs=88 avail_mem=74.56 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.22it/s]Capturing batches (bs=80 avail_mem=74.55 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.22it/s]

Capturing batches (bs=80 avail_mem=74.55 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.42it/s]Capturing batches (bs=72 avail_mem=74.55 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.42it/s]Capturing batches (bs=64 avail_mem=74.54 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.42it/s]Capturing batches (bs=64 avail_mem=74.54 GB):  45%|████▌     | 9/20 [00:00<00:00, 11.19it/s]Capturing batches (bs=56 avail_mem=74.54 GB):  45%|████▌     | 9/20 [00:00<00:00, 11.19it/s]

Capturing batches (bs=48 avail_mem=74.53 GB):  45%|████▌     | 9/20 [00:00<00:00, 11.19it/s]Capturing batches (bs=40 avail_mem=74.53 GB):  45%|████▌     | 9/20 [00:01<00:00, 11.19it/s]Capturing batches (bs=40 avail_mem=74.53 GB):  60%|██████    | 12/20 [00:01<00:00, 14.14it/s]Capturing batches (bs=32 avail_mem=74.52 GB):  60%|██████    | 12/20 [00:01<00:00, 14.14it/s]Capturing batches (bs=24 avail_mem=74.52 GB):  60%|██████    | 12/20 [00:01<00:00, 14.14it/s]

Capturing batches (bs=24 avail_mem=74.52 GB):  70%|███████   | 14/20 [00:01<00:00, 15.05it/s]Capturing batches (bs=16 avail_mem=74.51 GB):  70%|███████   | 14/20 [00:01<00:00, 15.05it/s]Capturing batches (bs=12 avail_mem=74.51 GB):  70%|███████   | 14/20 [00:01<00:00, 15.05it/s]Capturing batches (bs=12 avail_mem=74.51 GB):  80%|████████  | 16/20 [00:01<00:00, 14.18it/s]Capturing batches (bs=8 avail_mem=74.50 GB):  80%|████████  | 16/20 [00:01<00:00, 14.18it/s] 

Capturing batches (bs=4 avail_mem=74.50 GB):  80%|████████  | 16/20 [00:01<00:00, 14.18it/s]Capturing batches (bs=4 avail_mem=74.50 GB):  90%|█████████ | 18/20 [00:02<00:00,  6.75it/s]Capturing batches (bs=2 avail_mem=74.49 GB):  90%|█████████ | 18/20 [00:02<00:00,  6.75it/s]Capturing batches (bs=1 avail_mem=74.49 GB):  90%|█████████ | 18/20 [00:02<00:00,  6.75it/s]Capturing batches (bs=1 avail_mem=74.49 GB): 100%|██████████| 20/20 [00:02<00:00,  8.28it/s]Capturing batches (bs=1 avail_mem=74.49 GB): 100%|██████████| 20/20 [00:02<00:00,  9.45it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sava. I am a young student from Serbia. I have been learning English for a year. I really enjoy learning English, and I am really good at it.
I have a great body, strong enough to tackle all kinds of physical exercise. I'm very fit. I can run faster than my sister, who is even faster than me. I have a good sense of hearing and can hear sounds of the night, but I can't hear the sound of my own voice.
I have a very fast walking speed, and I'm also very strong. I am very good at sports. I'm in the top 10 in
Prompt: The president of the United States is
Generated text:  seeking to bring in a new policy that will help reduce the amount of carbon emissions. The policy proposal involves implementing a carbon tax, which is a tax levied on the sale of fossil fuels. The president is considering two different options: 

Option A: Implementing a flat tax rate of 20% on the price of a gallon of gasoline for cars and trucks.

Option B: Implementing a tax ra

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [age] year old, and I have a [job title] at [company name]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [hobby or activity], and I enjoy spending time with my family and friends. What's your favorite book or movie? I love [book/movie], and I find myself drawn to the characters and storylines in these works.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the third-largest city in the world by population. The city is located on the Seine River and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is known for its rich history, art, and culture, and is a popular tourist destination. The city is also home to many important institutions such as the French Academy of Sciences and the French National Library. Paris is a vibrant and dynamic city with a rich cultural and artistic heritage. It is a major transportation hub and a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This includes issues such as bias, transparency, accountability, and privacy.

2. Integration with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing, and computer vision. This will enable AI to perform tasks that are currently the domain of humans, such as image and speech recognition, autonomous vehicles, and personalized



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name]. I'm an [insert age] year old [insert profession] with [insert education or experience] and have been working hard to achieve [insert specific achievement or goal]. I'm always looking to learn and grow, and I'm always looking for ways to make a positive impact on the world. I'm excited to be a part of a team and contribute to our success. Thank you for asking! Let's do this! #SelfIntroduction #Character #TeamPlayer #PositiveImpact #GoalAchievement #PositiveAttitude #TeamMember
I'm [insert name], a [insert profession] with [insert education or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

This statement is factual and concise. Paris is the capital city of France, located in the northwestern part of the country. It is the largest city in France by area, with a population of over 2.3 mill

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

job

 title

]

 with

 [

number

]

 years

 of

 experience

 in

 [

special

ization

].

 I

 enjoy

 [

job

-related

 hobby

 or

 interest

],

 and

 I

'm

 always

 eager

 to

 learn

 new

 things

.

 I

'm

 excited

 to

 meet

 you

 and

 contribute

 to

 your

 journey

 in

 [

field

 or

 area

 of

 interest

].

 Let

's

 strike

 up

 a

 conversation

 and

 see

 where

 it

 takes

 us

.

 #

self

int

roduction

 #

job

title

 #

special

ization

 #

job

related

h

obby

 #

new

connection

 #

learning

goals

 #

jour

ney

start

 #

open

com

pan

yn

egot

iations

 #

inn

ovation

 #

career

development

 #

begin

ning

 #

growth

mind

set

 #

career

path





I

'm

 not

 trying

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



A

 concise

 factual

 statement

 about

 France

's

 capital

 city

 is

:

 Paris

 is

 the

 capital

 of

 France

.

 



This

 concise

 statement

 captures

 the

 essential

 information

 required

 to

 describe

 the

 capital

 city

 and

 its

 role

 in

 the

 country

.

 It

's

 easy

 to

 understand

 and

 can

 be

 conveyed

 in

 a

 single

 sentence

.

 



To

 elaborate

,

 this

 statement

 provides

:



1

.

 The

 capital

 city

's

 name

,

 which

 is

 Paris




2

.

 The

 country

 it

 represents

,

 which

 is

 France




3

.

 A

 brief

 description

 of

 the

 capital

 city

,

 which

 is

 a

 major

 city

 in

 France





These

 elements

 together

 convey

 the

 main

 information

 about

 Paris

,

 making

 it

 easier

 for

 someone

 to

 grasp

 the

 core

 concept

 and

 accurately

 state



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 an

 explosion

 of

 new

 applications

 and

 ways

 to

 use

 AI

 in

 the

 world

.

 One

 trend

 is

 the

 development

 of

 AI

 that

 can

 learn

 from

 and

 adapt

 to

 new

 data

,

 which

 could

 lead

 to

 more

 efficient

 and

 personalized

 solutions

 to

 complex

 problems

.

 Another

 trend

 is

 the

 integration

 of

 AI

 into

 everyday

 life

,

 from

 home

 automation

 to

 transportation

 to

 healthcare

.

 Additionally

,

 there

 is

 a

 growing

 interest

 in

 AI

 that

 can

 be

 used

 to

 create

 new

 forms

 of

 consciousness

 and

 self

-aware

ness

,

 as

 well

 as

 new

 forms

 of

 intelligence

 and

 understanding

.

 Finally

,

 the

 development

 of

 AI

 that

 can

 operate

 on

 quantum

 computers

 and

 handle

 tasks

 that

 are

 beyond

 the

 capabilities

 of

 traditional

 AI

 systems

 is

 also




In [6]:
llm.shutdown()