# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-18 02:05:03] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.76it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:06,  2.84it/s]Capturing batches (bs=120 avail_mem=76.80 GB):   5%|▌         | 1/20 [00:00<00:06,  2.84it/s]

Capturing batches (bs=120 avail_mem=76.80 GB):  10%|█         | 2/20 [00:00<00:04,  3.73it/s]Capturing batches (bs=112 avail_mem=76.79 GB):  10%|█         | 2/20 [00:00<00:04,  3.73it/s]Capturing batches (bs=112 avail_mem=76.79 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.21it/s]Capturing batches (bs=104 avail_mem=76.79 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.21it/s]Capturing batches (bs=96 avail_mem=76.77 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.21it/s] 

Capturing batches (bs=96 avail_mem=76.77 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.33it/s]Capturing batches (bs=88 avail_mem=76.28 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.33it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.33it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.43it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.43it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.43it/s]

Capturing batches (bs=64 avail_mem=76.27 GB):  45%|████▌     | 9/20 [00:01<00:00, 12.44it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  45%|████▌     | 9/20 [00:01<00:00, 12.44it/s]

Capturing batches (bs=48 avail_mem=76.26 GB):  45%|████▌     | 9/20 [00:01<00:00, 12.44it/s]

Capturing batches (bs=48 avail_mem=76.26 GB):  55%|█████▌    | 11/20 [00:01<00:01,  7.41it/s]Capturing batches (bs=40 avail_mem=76.26 GB):  55%|█████▌    | 11/20 [00:01<00:01,  7.41it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  55%|█████▌    | 11/20 [00:01<00:01,  7.41it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.34it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.34it/s]Capturing batches (bs=16 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.34it/s]

Capturing batches (bs=16 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:01<00:00, 10.80it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:01<00:00, 10.80it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:01<00:00, 10.80it/s] Capturing batches (bs=4 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:01<00:00, 10.80it/s]Capturing batches (bs=4 avail_mem=76.23 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.57it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.57it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.57it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:01<00:00, 10.57it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lila and I am the assistant of a social media marketing company. I am passionate about creating content and brands for a variety of clients, and I enjoy collaborating with people who share my goals. My background is in digital marketing and I have experience working with various social media platforms and digital advertising.
Lila is a great person to work with because she is dedicated, professional, and always ready to go. She takes care of all the details and makes sure that everything is running smoothly. She is always up for a challenge and has a great sense of humor. Lila is always ready to lend a helping hand and she is always open to
Prompt: The president of the United States is
Generated text:  now 45 years old. In 20 years, if the president is any age, the age difference between the president and the president of Russia will be the same. How old will the president of Russia be in 20 years? Let's denote the current age of the president

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short description of your profession or experience here]. I enjoy [insert a short description of your hobbies or interests here]. What's your favorite hobby or activity? I love [insert a short description of your favorite activity here]. What's your favorite book or movie? I love [insert a short description of your favorite book or movie here]. What's your favorite color? I love [insert a short description of your favorite color here

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the third-largest city in the world by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. The city is also famous for its cuisine, fashion, and art scene. Paris is a cultural and political center of France and a major tourist destination. It is home to many famous landmarks and attractions, including the Louvre, the Eiffel Tower, and the Champs-Élysées. The city is also known for its annual festivals and events, such as the Eiffel Tower Festival and the Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced machine learning capabilities: AI is likely to become more capable of learning from large amounts of data and making more accurate predictions and decisions.

3. Improved natural language processing: AI is likely to become more capable of understanding and generating human-like language, allowing for more natural and intuitive interactions with machines.

4. Increased use of AI in healthcare: AI is likely to become more integrated with healthcare,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [age] year old [Occupation]. I have always been a [Major Profession], but my journey has taken me to other worlds, into the unknown. I have a great interest in [Field of Interest], and I am always looking for ways to [Experience, Skill, or Challenge]. I have a passion for [Favorite Hobby or Sport], and I am always looking for ways to [Improve, Develop, or Learn]. I am a [Realistic or Perfectionist], and I strive to do my best no matter what. I am always looking for ways to [Challenge, Increase,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Facts about France's capital city:

1. It is the largest city in Europe by population.
2. It has a rich history dating back to ancient times.
3. It is home to the Eiffel Tower and the Louvre Museum.
4. It is known for its French language, cuisin

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 computer

 scientist

,

 specializing

 in

 artificial

 intelligence

 and

 machine

 learning

.

 I

've

 been

 working

 in

 this

 field

 for

 the

 past

 

5

 years

,

 and

 I

've

 always

 been

 fascinated

 by

 the

 power

 of

 algorithms

 and

 data

 analysis

.

 I

've

 also

 taken

 a

 lot

 of

 courses

 and

 workshops

 to

 improve

 my

 skills

 in

 this

 area

.

 I

'm

 always

 interested

 in

 finding

 new

 ways

 to

 solve

 problems

 and

 improve

 the

 way

 we

 work

 with

 data

.

 What

 would

 you

 like

 to

 know

 about

 me

?

 What

 are

 your

 hobbies

 and

 interests

?

 What

 do

 you

 do

 for

 work

?

 How

 do

 you

 stay

 up

-to

-date

 with

 the

 latest

 advancements

 in

 artificial

 intelligence

?

 What

 kind

 of

 projects

 do



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 located

 in

 the

 south

-central

 region

 of

 the

 country

.

 



(Note

:

 The

 question

 does

 not

 provide

 any

 additional

 context

 or

 detail

 that

 would

 extend

 the

 factual

 statement

 beyond

 the

 given

 information

.

 It

 is

 a

 simple

 and

 straightforward

 statement

 about

 the

 capital

 of

 France

.

 However

,

 if

 you

 had

 additional

 information

,

 such

 as

 historical

 significance

,

 cultural

 importance

,

 or

 attractions

,

 it would

 be better

 to

 incorporate

 that

 into

 the

 response

.)

 



Example

:



"The

 capital

 of

 France

 is

 Paris

,

 a

 historic

 city

 located

 in

 the

 south

-central

 region

 of

 the

 country

."



This

 statement

 provides

 a

 concise

 and

 factual

 statement

 about

 Paris

,

 including

 its

 location

,

 historical

 significance

,

 and

 cultural

 importance

.

 If



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 marked

 by

 continued

 progress

 and

 innovation

,

 with

 new

 applications

 and

 developments

 emerging

 at

 an

 accelerating

 pace

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Greater

 integration

 with

 human

 decision

-making

:

 AI

 is

 becoming

 more

 integrated

 with

 human

 decision

-making

,

 as

 it

 becomes

 more

 capable

 of

 learning

 from

 human

 experiences

 and

 providing

 feedback

.

 This

 could

 lead

 to

 more

 personalized

 and

 adaptive

 AI

 systems

 that

 can

 provide

 better

 recommendations

 and

 solutions

 to

 users

.



2

.

 Increased

 focus

 on

 ethical

 and

 responsible

 AI

:

 As

 AI

 becomes

 more

 widespread

 and

 complex

,

 there

 will

 be

 a

 growing

 demand

 for

 ethical

 and

 responsible

 AI

 practices

.

 This

 could

 involve

 designing

 AI

 systems

 that

 are

 transparent

,

 unbiased

,

 and

 ensure

 that




In [6]:
llm.shutdown()