# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.72it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Raoul and I am a C++, C#, and C++++ programmer. I have a knack for creating truly amazing programs with an incredible amount of functionality. My favorite programming languages are C++, C#, C++++, and my favorite programming libraries are STL, Qt, and Boost. I also have a skill for creating solutions for complex and intricate problems using a variety of approaches and techniques.
I have a passion for game development and have written numerous games for the Nintendo 3DS, Wii, and PC. I have also worked on a number of projects that required high levels of graphical complexity and performance, including a custom rendering engine for a
Prompt: The president of the United States is
Generated text:  32 years older than the president of Florida. If the president of Florida is 34 years old now, how old will the president of Florida be in 5 years?
To determine the president of Florida's age in 5 years, we need to follow these steps:

1. Identify the cu

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French National Radio and Television Network. Paris is a bustling city with a rich history and culture, and it is a popular tourist destination. It is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. The city is home to many famous French artists, including Pablo Picasso and Henri Matisse. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars and robots in factories to personalized medicine and virtual assistants. AI will also continue to play an increasingly important role in solving complex problems and improving human well-being. However, there are also potential risks and challenges associated with AI, such as the potential for job displacement and the need for ethical and responsible development and deployment of AI systems. Overall, the future of AI is likely to be a rapidly evolving and complex field, with many opportunities and challenges



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [Your Profession/Position] with [Your Company Name] working in [Your Role]. I am an [age/years] years old. I am fluent in [the language you speak] and enjoy [the hobbies or interests you have]. I am a [language] speaker and I am always looking for [how you can improve]. I am passionate about [the hobbies you love] and I am always eager to learn new things. I have always been [an example of] [a character trait or quality], so I am confident that I am a [positive trait] and a [positive

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as “La République” or “La Région.” It is a major city and the largest city in France, and the third largest in Europe. The city is located in the south of France, near the Mediterranean Sea, and is the cultural, economic, and political cente

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 

3

4

 years

 old

,

 and

 I

 have

 a

 strong

 work

 ethic

 and

 a

 great

 deal

 of

 dedication

 to

 my

 craft

.

 I

'm

 a

 person

 of

 my

 own

,

 with

 a

 natural

 ability

 to

 see

 the

 beauty

 in

 everyday

 things

 and

 turn

 them

 into

 something

 beautiful

.

 I

'm

 passionate

 about

 using

 my

 creativity

 to

 create

 something

 truly

 unique

 and

 innovative

.

 I

 love

 to

 challenge

 myself

 and

 push

 boundaries

,

 and

 I

'm

 always

 looking

 for

 new

 ideas

 and

 approaches

 to

 improve

 my

 craft

.

 I

 believe

 that

 education

 is

 the

 key

 to

 success

 and

 that

 I

 believe

 that

 the

 world

 needs

 more

 people

 like

 me

 who

 can

 make

 a

 difference

 in

 the

 world

.

 I

'm

 excited

 to

 be



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 capital

 and

 largest

 city

 of

 France

.

 It

 is

 located

 on

 the

 River

 Se

ine

 in

 the

 Western

 suburbs

 of

 the

 French

 department

 of

 Paris

.

 The

 city

 is

 known

 as

 "

la

 Ville

 Bl

anche

",

 "

the

 white

 city

"

 or

 simply

 "

Paris

".

 Paris

 is

 also

 the

 second

 most

 populous

 city

 in

 France

,

 after

 Paris

.

 It

 is

 a

 European

 cultural

 and

 economic

 centre

.

 In

 

2

0

1

3

,

 its

 population

 was

 

2

,

1

0

8

,

9

7

2

.

 The

 French

 Parliament

,

 called

 the

 Ch

ambre

 des

 Conse

ill

ers

,

 is

 located

 in

 the

 former

 Pal

ais

 de

 Justice

 building

.

 The

 Palace

 of

 Vers

ailles

,

 the

 former



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 number

 of

 trends

 that

 are

 expected

 to

 shape

 the

 way

 we

 interact

 with

 technology

 and

 the

 world

 around

 us

.

 Here

 are

 some

 of

 the

 most

 likely

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Increasing

ly

 personalized

 technology

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 is

 expected

 to

 become

 more

 personalized

 and

 tailored

 to

 individual

 users

.

 This

 could

 mean

 more

 complex

 algorithms

 and

 machine

 learning

 that

 can

 analyze

 vast

 amounts

 of

 data

 to

 provide

 highly

 accurate

 and

 personalized

 recommendations

 and

 suggestions

.



2

.

 More

 reliance

 on

 AI

 for

 decision

-making

:

 As

 AI

 becomes

 more

 integrated

 into

 everyday

 life

,

 we

 may

 see

 a

 greater

 reliance

 on

 AI

 for

 decision

-making

 in

 areas

 such

 as




In [6]:
llm.shutdown()