# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-07 18:10:44] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.97it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.50it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.50it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.50it/s]Capturing batches (bs=104 avail_mem=76.80 GB):   5%|▌         | 1/20 [00:00<00:03,  5.50it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 13.69it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 13.69it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 13.69it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 13.69it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.76it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.76it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.76it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.76it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 18.56it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 18.56it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 18.56it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 18.56it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.10it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.10it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.10it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.10it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 19.44it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 19.44it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 19.44it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 19.44it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:01<00:00, 21.98it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:01<00:00, 21.98it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 19.34it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ashley and I am a massage therapist in Northbrook, Illinois. I am a Registered Massage Therapist and I practice in the Heartland and beyond. I provide many different massage services, including my unique technique of "Neptune Neck Oil Massage." This massage uses my unique technique to help you relax, relieve muscle tension, and help with recovery from neck and shoulder pain. My massage therapy services also include my latest addition to the team, my newest innovation: "The Radial Distraction Massage." This massage offers gentle, soothing movement and pressure on the muscles that help with recovery from shoulder pain, neck pain, and shoulder rotator cuff problems
Prompt: The president of the United States is
Generated text:  a military commander who has a $4$-hour workday and a $50$-hour workweek. The president usually spends $30\%$ of his time in meetings and the remaining time in attending to private business. 

How many minutes does the pres

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I am passionate about [reason for interest in the company]. I am always looking for new challenges and opportunities to grow and learn. I am a [type of person] and I am always willing to put in the extra effort to achieve my goals. I am a [character trait] and I am always ready to help others. I am [character trait] and I am always willing to take risks. I am [character trait] and I am always ready to adapt to new situations. I am [character trait] and I am always ready to learn from

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that serves as the political, cultural, and economic center of the country. It is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as its rich history and diverse culture. Paris is also home to many famous museums, including the Musée d'Orsay, the Musée Rodin, and the Musée d'Orsay. The city is also known for its fashion industry, with many famous fashion houses and boutiques located in the city. Paris is a vibrant and dynamic city that is a must-visit for anyone interested in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased integration of AI into everyday life: AI is already being integrated into our daily lives, from voice assistants like Siri and Alexa to self-driving cars. As AI continues to advance, we can expect to see even more integration into our daily routines.

2. Greater focus on ethical considerations: As AI becomes more integrated into our lives, there will be a greater emphasis on ethical considerations. This includes issues such as bias, privacy, and transparency.

3. AI will become more personalized: As AI becomes



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Position] at [Company]. I am passionate about [Your Passion]. I am a [Your Interests/Enthusiasm]. I am always ready to [Your Accomplishments]. I am [Your Personality]. I look forward to [Your Goals/Goals]. What inspired you to become a [Your Job Title]? I was introduced to [Your Job Title] by [Your Mentor/Teacher], who was passionate about [Your Passion]. I was particularly drawn to [Your Passion] because [Your Expertise or Achievement]. My experience with [Your Company] has given me a strong foundation in [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a city that is known for its rich history, vibrant culture, and stunning architecture. Paris is a popular tourist destination, known for its museums, theaters, and distinctive landmarks such as the Eiffel Tower and the Louvre 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 fictional

 character

 name

],

 and

 I

'm

 a

 [

insert

 fictional

 character

 role

]

!

 I

'm

 here

 today

 to

 share

 my

 experiences

 and

 adventures

,

 and

 to

 help

 anyone

 with

 questions

 or

 concerns

.

 What

's

 your

 name

,

 and

 what

 can

 you

 tell

 me

 about

 yourself

?

 Maybe

 you

 can

 ask

 me

 about

 my

 favorite

 hobbies

,

 or

 my

 favorite

 foods

.

 Or

 maybe

 we

 can

 talk

 about

 something

 you

've

 been

 looking

 forward to

 trying out

 for

 a

 while

.

 Whatever

 you

 choose

 to

 share

,

 please

 feel

 free

 to

 go

 ahead

!

 [

insert

 fictional

 character

 name

]

 #

name

#

 Self

-

Introduction





Hey

 there

!

 I

'm

 [

insert

 fictional

 character

 name

],

 and

 I

'm

 here

 to

 share



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 located

 in

 the

 north

western

 part

 of

 the

 country

,

 near

 the

 French

 Riv

iera

 and

 the

 Mediterranean

 Sea

.

 It

 is

 the

 largest

 city

 in

 France

 by

 population

 and

 the

 most

 populous

 city

 in

 Europe

.

 Paris

 is

 the

 seat

 of

 the

 French

 government

,

 the

 capital

 of

 the

 French

 department

 of

 Î

le

-de

-F

rance

,

 and

 the

 administrative

 center

 of

 the

 Î

le

-de

-F

rance

 region

.

 It

 is

 also

 a

 cultural

 and

 educational

 center

 in

 France

 and

 is

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

.

 The

 city

 is

 known

 for

 its

 museums

,

 art

 galleries

,

 and

 art

 museums

.

 Paris

 is

 also

 a

 hub

 for

 science

 and

 technology

,

 with

 numerous

 research

 institutions



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

 and

 unpredictable

,

 with

 new

 technologies

 and

 advancements

 constantly

 emerging

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Personal

ization

 and

 Target

ed

 Advertising

:

 As

 AI

 continues

 to

 improve

 its

 ability

 to

 understand

 and

 interpret

 human

 behavior

,

 we

 can

 expect

 to

 see

 more

 personalized

 advertising

 and

 targeted

 marketing

.

 AI

 will

 be

 able

 to

 analyze

 user

 data

 to

 tailor

 ads

 to

 individual

 preferences

,

 ultimately

 increasing

 the

 effectiveness

 of

 marketing

 efforts

.



2

.

 Autonomous

 and

 Self

-

Driving

 Vehicles

:

 With

 advancements

 in

 AI

,

 we

 can

 expect

 to

 see

 more

 autonomous

 and

 self

-driving

 vehicles

 on

 the

 road

.

 AI

 will

 be

 able

 to

 control

 the

 vehicles

 with

 minimal

 human

 input

,

 freeing

 up

 human

 workers

 to

 focus




In [6]:
llm.shutdown()