# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0815 09:35:33.973000 1483684 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0815 09:35:33.973000 1483684 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0815 09:35:43.137000 1484128 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0815 09:35:43.137000 1484128 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.39it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.92 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.92 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.95it/s]Capturing batches (bs=2 avail_mem=72.86 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.95it/s]Capturing batches (bs=1 avail_mem=72.85 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.95it/s]Capturing batches (bs=1 avail_mem=72.85 GB): 100%|██████████| 3/3 [00:00<00:00, 11.44it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tom Smith. I'm a student at the University of the Arts, London. Here's a passage that I've written about myself:
I'm a person who is passionate about art and culture. I'm always looking for ways to create new and innovative ideas. I'm always trying to push the boundaries of what's possible. I've always been fascinated by different styles of art and how they relate to each other.
I'm also passionate about educating people about art and culture. I'm always eager to share my knowledge and help others discover the beauty and complexity of art. I'm determined to make art a subject that can be enjoyed by everyone
Prompt: The president of the United States is
Generated text:  a highly influential political figure. How does this influence play out in the federal government?

There are a few ways in which the presidency plays out in the federal government:

  1. The president has the power to appoint federal judges and lawyers.
  2. The president has t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Gender] who has always been fascinated by [Interest or hobby]. I'm always looking for new experiences and learning new things, and I'm always eager to try new things. I'm always looking for new adventures and exciting new places to explore. I'm always looking for new ways to make my life more interesting and enjoyable. I'm always looking for new ways to learn and grow as a person. I'm always looking for new ways to make my life more fulfilling and meaningful. I'm always looking for new ways to connect with others and make new friends. I'm always

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and festivals throughout the year. Paris is a popular tourist destination, known for its rich history, beautiful architecture, and vibrant nightlife. The city is home to many notable French artists, writers, and musicians, and is a major center for the arts and culture in Europe. Paris is also known for its cuisine, with many famous restaurants and food festivals throughout the year. Overall, Paris is a vibrant and dynamic city that is a must

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased Use of AI in Healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and improve efficiency. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare.

2. Increased Use of AI in Finance: AI is already being used in finance to improve fraud detection, risk management, and investment decision-making. As AI technology continues to improve, we can



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm [Age]. I'm a [occupation], and I'm currently [current job title]. I'm really excited to meet you and tell you about my adventures and experiences. 

I'm a [job title] because [reason for job title] and I've always had a passion for [occupation]. I enjoy [job title] because [reason for interest] and I have always dreamed of [job title] because [reason for passion]. 

What excites me most about my job is [reason for job excitement]. I love [job title] because [reason for job excitement] and I hope to make

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, an ancient city located in the region of Haut-Rhin on the French Riviera. It is known for its rich history, beautiful architecture, and lively culture. Paris is home to iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Ar

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

]

 and

 I

'm

 a

 [

insert

 profession

]

 at

 [

insert

 location

].

 I

'm

 excited

 to

 meet

 you

!

 As

 a

 writer

 and

 business

woman

,

 I

'm

 always

 looking

 to

 create

 something

 new

 and

 different

.

 I

'm

 driven

 by

 a

 love

 for

 adventure

 and

 a

 desire

 to

 challenge

 myself

 in

 all

 aspects

 of

 life

.

 My

 background

 in

 marketing

 and

 social

 media

 management

 has

 given

 me

 a

 unique

 perspective

 on

 how

 to

 connect

 with

 and

 influence

 others

.

 I

'm

 also

 passionate

 about

 using

 my

 writing

 to

 help

 others

 achieve

 their

 goals

 and

 give

 them

 the

 confidence

 to

 succeed

.

 If

 you

're

 looking

 to

 read

 about

 a

 new

 book

,

 meet

 an

 interesting

 person

,

 or

 have

 a

 great

 time



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

,

 officially

 known

 as

 the

 City

 of

 Paris

 and

 the

 French

 Capital

 of

 France

,

 is

 the

 capital

 city

 of

 France

.

 It

 is

 located

 on

 the

 northern

 bank

 of

 the

 Se

ine

 River

 and

 is

 the

 largest

 city

 in

 France

 by

 population

,

 containing

 

2

.

 

9

 million

 people

.

 The

 city

 is

 the

 seat

 of

 government

,

 legislature

,

 and

 highest

 administrative

 authority

 in

 France

.

 Its

 skyline

 features

 tall

 buildings

 such

 as

 the

 E

iff

el

 Tower

,

 Mus

ée

 d

'

Or

say

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 known

 for

 its

 historical

 landmarks

,

 vibrant

 music

,

 chocolate

,

 and

 elegant

 French

 cuisine

.

 It

 is

 also

 renowned

 for

 its



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 constantly

 evolving

.

 Here

 are

 some

 potential

 trends

 and

 developments

 that

 could

 shape

 the

 field

:



1

.

 Increased

 transparency

 and

 accountability

:

 As

 AI

 becomes

 more

 advanced

,

 we

 may

 see

 more

 emphasis

 on

 transparency

 and

 accountability

 in

 the

 development

 and

 use

 of

 AI

 systems

.

 This

 could

 include

 increased

 data

 privacy

 and

 security

 measures

,

 as

 well

 as

 clear

 communication

 about

 how

 AI

 decisions

 are

 made

.



2

.

 Enhanced

 emotional

 intelligence

:

 AI

 is

 already

 capable

 of

 processing

 vast

 amounts

 of

 information

,

 but

 there

's

 still

 room

 for

 improvement

 in

 terms

 of

 emotional

 intelligence

 and

 empathy

.

 As

 AI

 becomes

 more

 sophisticated

,

 we

 may

 see

 more

 focus

 on

 developing

 algorithms

 that

 can

 understand

 and

 respond

 to

 human

 emotions

 and

 nuances

.






In [6]:
llm.shutdown()