# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-21 04:38:12] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.30it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=1 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=1 avail_mem=74.73 GB): 100%|██████████| 3/3 [00:00<00:00, 10.63it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Darius, I'm 22 years old, and I'm a very playful and outgoing person. I like to enjoy different activities and be around people who share my interests. I also have a bit of a competitive streak, and I enjoy competing in various games and sports.

What kind of hobbies do you have and what activities do you enjoy?

As a playful and outgoing person with a competitive streak, I enjoy various activities. I like to play games like chess, checkers, and Go, and I love to travel to new places and try new experiences. I also enjoy spending time with friends and family, and I enjoy listening to
Prompt: The president of the United States is
Generated text:  now trying to secure a new term of office. The president has 200 employees, and each employee has a personal weight that affects their performance, with a standard deviation of 10 pounds. The president wants to select 10 employees for a special program. However, the president is concerned that the aver

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill/Ability] who has been [Number of Years] years in this field. I'm passionate about [What I Love About My Profession]. I'm always looking for ways to [What I Want to Improve/Develop]. I'm always eager to learn and grow, and I'm always willing to help others. I'm a [Favorite Thing to Do] and I enjoy [What I Like to Do]. I'm a [Favorite Book/Artist/Artist/Book] and I love [Why I Love It]. I'm a [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is famous for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also known for its vibrant culture, including its annual festivals and events, and its role as a major transportation hub. Paris is a popular tourist destination and a cultural center, attracting millions of visitors each year. The city is also home to many renowned museums, including the Louvre and the Musée d'Orsay. Paris is a city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient care.

2. Greater integration of AI into everyday life: AI is already being integrated into many aspects of our lives, from self-driving cars to virtual assistants.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name] and I'm a [insert profession]. I've always been [insert a trait that reflects your character], and I'm a big fan of [insert something you enjoy doing]. How are you, [insert your name]? I'm excited to meet you and see all that you have to offer. 

Remember, it's always a pleasure to get to know someone! Let's chat about [insert something specific about your profession or interests]. Have a great day! 

Feel free to add any additional information or details that may help the person get to know the character better. Good luck! 

---

Keep it brief and friendly

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the River Seine in the suburbs of Paris.

The capital of France is Paris, located on the River Seine in the suburbs of Paris. This city is known for its rich history, beautiful

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

Role

]

 [

Character

].

 I

 love

 [

Describe

 your

 favorite

 hobby

 or

 passion

].

 What

's

 your

 name

 and

 how

 did

 you

 get

 into

 this

 character

?



Please

 feel

 free

 to

 customize

 your

 introduction

 to

 better

 reflect

 your

 personality

 or

 interests

.

 Use

 [

H

obbies

,

 Inter

ests

,

 Ex

periences

,

 Personality

 Traits

]

 to

 elaborate

 on

 your

 background

 and

 character

 traits

.

 Keep

 your

 introduction

 brief

 and

 to

 the

 point

.

 



Example

:

 Hi

 there

!

 I

'm

 a

 professional

 graphic

 designer

 [

Name

]

 from

 [

Location

]

 who

 is

 passionate

 about

 [

Describe

 your

 favorite

 hobby

 or

 passion

].

 I

'm

 [

Name

]

 and

 I

 love

 [

Describe

 your

 favorite

 hobby

 or



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 also

 known

 as

 "

the

 city

 of

 a

 thousand

 gardens

"

 due

 to

 its

 extensive

 network

 of

 parks

 and

 gardens

.

 



(Note

:

 The

 statement

 provided

 seems

 to

 be

 missing

 some

 key

 information

,

 such

 as

 the

 year

 of

 the

 statement

 or

 the

 specific

 context

 in

 which

 it

 is

 given

.

 This

 would

 be

 important

 for

 a

 more

 complete

 and

 accurate

 response

.)

 



The

 capital

 of

 France

 is

 Paris

.

 Known

 for

 its

 extensive

 network

 of

 parks

 and

 gardens

,

 it

 is

 also

 known

 as

 "

the

 city

 of

 a

 thousand

 gardens

"

 due

 to

 the

 vast

 number

 of

 designated

 gardens

 in

 the

 city

.

 It

 is

 a

 city

 of

 historical

 and

 artistic

 significance

,

 and

 has

 played

 an

 important

 role

 in



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 quite

 exciting

,

 and

 there

 are

 many

 potential

 trends

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 trends

:



1

.

 Increased

 precision

:

 AI

 will

 continue

 to

 get

 more

 precise

,

 allowing

 for

 even

 higher

 levels

 of

 accuracy

 in

 tasks

 like

 image

 recognition

,

 natural

 language

 processing

,

 and

 predictive analytics

.



2

.

 Larger

 language

 models

:

 Language

 models

,

 such

 as

 chat

bots

 and

 virtual

 assistants

,

 will

 continue

 to

 get even

 more sophisticated

, allowing

 for

 more natural

 and context

ually

 appropriate

 responses to

 user

 queries

.



3

.

 Autonomous

 vehicles

:

 Autonomous

 vehicles

 will

 become

 more

 widespread

,

 with

 the

 ability

 to

 navigate

 and

 make

 decisions

 on

 their

 own

 based

 on

 real

-time

 data

.



4

.




In [6]:
llm.shutdown()