# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0807 17:07:25.720000 1627365 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0807 17:07:25.720000 1627365 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0807 17:07:35.568000 1628431 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0807 17:07:35.568000 1628431 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.56it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Maarten de Boer and I am 54 years old and have been in this field of IT for 25 years. I've studied IT, and I'm a computer scientist. This year, I've decided to leave IT and change my career direction to work with virtual and augmented reality technology. I have some experience with the Oculus Rift, a popular VR headset, and my research focuses on virtual environments. As a virtual assistant, I'm looking for a new project that I can work on, which will allow me to apply my knowledge and experience. Can you please provide me with some ideas for virtual environments that I can create
Prompt: The president of the United States is
Generated text:  5 feet 3 inches tall. How tall is the president in feet if the height conversion from inches to feet is 12 inches per foot? To determine the president's height in feet, we need to convert the height given in inches to feet. We know that 12 inches per foot converts the height from inches to feet.

First, w

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill] with [Number] years of experience in [Field]. I'm passionate about [What I Love to Do]. I'm always looking for new challenges and opportunities to grow and learn. I'm [What I Do Best]. I'm [What I Can Do]. I'm excited to meet you and learn more about you. [Name] [Age] [Occupation] [Skill] [Number] [Field] [What I Love to Do] [What I Can Do] [What I Do Best] [What I Can

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its fashion industry, art, and cuisine. Paris is a popular tourist destination and a major economic center in France. It is home to many famous museums, theaters, and restaurants. The city is also known for its annual festivals and cultural events. Paris is a vibrant and dynamic city that is a must-visit for anyone interested in French

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased automation: AI is likely to become more prevalent in many industries, with automation becoming the norm rather than the exception. This will lead to the development of new types of AI that can perform tasks that are currently done by humans, such as data analysis, decision-making, and problem-solving.

2. Enhanced privacy and security: As AI becomes more prevalent, there will be a need to ensure that it is used in a way that respects privacy and security. This will require the development of new



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [career objective]. I am a [position] specialist with [number of years of experience]. I bring [specific skill or expertise] to my career. What excites you the most about your career goals? I am always looking for opportunities to grow and challenge myself. What do you enjoy most about your job? I enjoy the variety of tasks and the ability to work in a team. Lastly, what are your goals for the next five years? I am excited to continue learning and growing as a professional. What's your future plan for the next five years? I aim to continue my education and seek

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the largest city in the country and home to many of the country's cultural and historical landmarks. The city is known for its rich history, including its role as the cap

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

 

3

2

-year

-old

 software

 engineer

 with

 a

 passion

 for

 both

 technology

 and

 the

 arts

.

 I

'm

 excited

 to

 learn

,

 grow

,

 and

 collaborate

 with

 others

.

 I

 enjoy

 exploring

 new

 technologies

 and

 trying

 out

 new

 projects

 to

 see

 what

 works

 best

 for

 me

.

 What

 brings

 you

 here

 today

?



About

 the

 author

:

 The

 author

 has

 a

 passion

 for

 writing

,

 and

 they

 have

 been

 working

 on

 their

 first

 book

 for

 the

 past

 year

.

 They

 are

 excited

 to

 share

 their

 work

 and

 connect

 with

 readers

 who

 are

 interested

 in

 technology

,

 creativity

,

 and

 world

-class

 storytelling

.

 What

 brings

 you

 to

 this

 platform

?

 [

name

]

!

 



Title

:

 [

name

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

's

 correct

!

 The

 capital

 city

 of

 France

,

 which

 is

 also

 known

 as

 Paris

,

 is

 located

 in

 the

 heart

 of

 the

 French

 Riv

iera

 and

 is

 considered

 one

 of

 the

 most

 important

 cities

 in

 the

 world

.

 Paris

 is

 home

 to

 numerous

 cultural

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 the

 Notre

-D

ame

 Cathedral

,

 and

 the

 Palace

 of

 Vers

ailles

.

 The

 city

 is

 also

 famous

 for

 its

 fashion

 industry

 and

 its

 annual

 fashion

 week

,

 known

 as

 the

 "

Mar

in

-m

arie

."

 Paris

 is

 a

 popular

 tourist

 destination

 and

 a

 popular

 international

 city

 for

 business

 and

 culture

.

 It

's

 also

 the

 largest

 city

 in

 France

 and

 one

 of

 the

 most



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 uncertain

,

 with

 many

 potential

 developments

 shaping

 its

 shape

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 automation

 and

 robotics

:

 AI

 is

 already

 becoming

 more

 widely

 used

 in

 manufacturing

,

 transportation

,

 healthcare

,

 and

 other

 industries

.

 As

 technology

 continues

 to

 improve

,

 we

 can

 expect

 more

 automation

 and

 robotics

 to

 become

 commonplace

.



2

.

 Enhanced

 human

-com

puter

 interaction

:

 AI

 is

 increasingly

 being

 integrated

 into

 our

 daily

 lives

,

 from

 virtual

 assistants

 to

 smart

 home

 devices

.

 However

,

 as

 AI

 continues

 to

 learn

 and

 improve

,

 it

 may

 become

 more

 human

-like

 in

 its

 interactions

 with

 humans

.



3

.

 AI

 ethics

 and

 privacy

 concerns

:

 As

 AI

 becomes

 more

 advanced

,

 we

 will

 need




In [6]:
llm.shutdown()