# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.84it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lucy and I am currently an undergraduate student at Texas Tech University. I am from the Western United States and I am a proud Texan. I have been studying abroad in Tokyo, Japan for the past year. I speak English and my favorite subject is Math and science.
I love to explore the world and visit different places, especially Japan and Korea. I am always looking for a good coffee and sushi! My favorite restaurant is "Bon Appetit" in Tokyo, Japan. I also like to cook and bake my own recipes. I have a pet cat named Spot, and I love to spend time with her. My favorite sports team is
Prompt: The president of the United States is
Generated text:  interested in the statistics of a large city. To do this, he selects a sample of 200 people. The mean age of these people is 40 and the standard deviation is 10. 

1. If the president wants to ensure that the probability that the sample mean age is within 2 years of the true mean age is at least 95%, what sh

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I am a [Age] year old [Occupation]. I am a [Type of Character] who has always been [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and art galleries. Paris is a popular tourist destination and a major hub for international business and diplomacy. Its rich history and diverse culture make it a fascinating city to explore and experience. 

The city is also home to the French Parliament, the French National Library, and the French Academy of Sciences. It is a major transportation hub, with the Eiffel Tower serving as a symbol of the city's importance. Paris is a city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to evolve and improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks with increasing accuracy and efficiency. Some possible future trends in AI include:

1. Increased focus on ethical considerations: As AI systems become more sophisticated, there will be a greater emphasis on ethical considerations, such as privacy, bias, and transparency. This will require developers to consider the potential impacts of their AI systems on society and to ensure that they are designed in a way that is fair and unbiased.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name] and I am a [insert occupation or profession] with a passion for [insert a relevant hobby or activity]. I am an [insert age], [insert gender] and I am [insert nationality or culture]. I am from [insert location] and I have always been [insert a character trait or personality]. I have always loved [insert something specific] and I strive to make the world a better place by [insert relevant action or goal]. I am constantly learning and growing, and I am always looking for new experiences and ways to improve myself. I am [insert a character trait or personality]. I am excited to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city located in the south of the country and served as the capital for much of the 20th century. It is home to many world-renowned cultural institutions and landmarks,

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

age

]

 year

 old

 aspiring

 software

 engineer

.

 What

's

 your

 profession

 and

 what

 do

 you

 do

 for

 a

 living

?

 I

 develop

 software

,

 helping

 people

 use

 it

 to

 achieve

 their

 goals

.

 What

 inspires

 you

 and

 what

 motiv

ates

 you

 to

 continue

 learning

 and

 growing

 as

 a

 software

 engineer

?

 I

'm

 inspired

 by

 the

 power

 of

 collaboration

,

 technology

,

 and

 the

 endless

 possibilities

 of

 the

 future

.

 How

 do

 you

 find

 your

 creative

 juices

,

 and

 what

 drives

 you

 to

 pursue

 your

 passion

?

 I

 love

 getting

 lost

 in

 a

 code

base

,

 finding

 hidden

 features

,

 and

 imm

ers

ing

 myself

 in

 a

 world

 of

 possibility

.

 Lastly

,

 what

's

 your

 biggest

 challenge

 so



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 located

 in

 the

 north

western

 part

 of

 the

 country

 and

 has

 been

 the

 seat

 of

 government

 and

 capital

 of

 France

 since

 

1

8

0

4

.


Paris

 is

 the

 largest

 city

 in

 France

 and

 has

 a

 population

 of

 around

 

2

 million

 people

.

 It

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 lively

 cultural

 scene

.

 Paris

 is

 also

 the

 birth

place

 of

 many

 famous

 artists

,

 writers

,

 and

 composers

,

 including

 the

 famous

 painter

 Ed

ou

ard

 Man

et

 and

 the

 singer

 Ed

ith

 P

ia

f

.

 It

 is

 also

 known

 for

 its

 many

 museums

,

 including

 the

 Lou

vre

 and

 the

 Mus

ée

 d

'

Or

say

,

 which

 are

 world

-ren

owned

.

 Finally



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 involves

 numerous

 possibilities

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

:



1

.

 Increased

 integration

 with

 human

 decision

-making

:

 As

 AI

 becomes

 more

 advanced

,

 we

 can

 expect

 to

 see

 more

 integration

 between

 AI

 and

 human

 decision

-making

.

 AI

 systems

 will

 be

 able

 to

 make

 decisions

 based

 on

 human

 values

 and

 ethics

,

 rather

 than

 just

 using

 a

 list

 of

 rules

 and

 instructions

.



2

.

 Greater

 autonomy

 for

 humans

:

 AI

 will

 be

 able

 to

 make

 decisions

 autonom

ously

,

 which

 will

 allow

 humans

 to

 have

 greater

 control

 over

 their

 lives

.

 AI

 systems

 will

 be

 able

 to

 understand

 and

 respond

 to

 the

 world

 around

 them

,

 allowing

 humans

 to

 focus

 on




In [6]:
llm.shutdown()