# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.73it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex. I am a high school student who loves to read. One day I went to a book fair and saw a book that was about finding missing things. I wanted to learn more about this topic and decided to read it. What book did I read? I read "The Lost Thing" by Sara J. Bendler. The book is about a young girl named Cassie who is living in the jungle. She is searching for something that is always missing from the jungle, which is a water pump. As she searches, she discovers that the pump has disappeared from her world and must be found. The book explores the theme of finding
Prompt: The president of the United States is
Generated text:  now running for a second term. As of 2021, he has been in office for 4 years. If his last term was from the year 2016 to the year 2019, how old is the president in 2021? To determine the president's age in 2021, we need to find out how many years he has been in office and then subtract that from 2021. Here are the steps:

1. 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville-Marie" or "La Ville de Paris". It is the largest city in France and the second-largest city in the European Union, with a population of over 2. 5 million people. Paris is known for its rich history, art, and culture, and is a major tourist destination. The city is also home to many important institutions, including the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is a popular destination for business and leisure, and is a major center for French culture and politics. The city is also home to many international organizations and institutions,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and adaptive AI systems that can better understand and respond to human emotions and preferences.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more stringent



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a/an [Age] year old. I have lived in [City/State/Place] for [Number] years, but I am [Age] years old now, which is my actual [Age]. I currently reside in [Current Residence], and I have worked in [Professional Title] for [Number] years, but I am now [Age] years old, which is my actual [Age]. I have traveled to [Number] countries and [Number] cities, but I have only [Number] of personal experiences, which is my actual [Number]. I am a [Gender] [Race

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city renowned for its history, art, and architecture. It is the capital of the country and is the seat of government, politics, and culture. Paris is also known for its unique architecture, including the Eiffel Tower and the Louvre Museum. The city has a rich cultural heritage and is home to 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 [

Age

]

 years

 old

 and

 [

Occup

ation

].

 I

'm

 [

Prof

ession

]

 and

 [

Title

].

 I

'm

 [

Current

 Occupation

]

 and

 [

Current

 Position

].

 I

'm

 [

Favorite

 Hobby

].

 I

'm

 [

Religious

 or

 Eth

ical

]

 and

 [Political

/Rel

igious

 Bel

iefs].

 I

'm

 [

Current

 State

 of

 Mind

].

 Thank

 you

.

 [

G

reeting

/

Introduction

]

 I

'm

 [

Name

].

 I

'm

 [

Age

]

 years

 old

 and

 [

Occup

ation

].

 I

'm

 [

Prof

ession

]

 and

 [

Title

].

 I

'm

 [

Current

 Occupation

]

 and

 [

Current

 Position

].

 I

'm

 [

Favorite

 Hobby

].

 I

'm

 [

Rel

igious

 or

 Eth



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 and

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 iconic

 Notre

-D

ame

 Cathedral

,

 and

 its

 historical

 landmarks

 such

 as

 the

 Lou

vre

 and

 the

 Palace

 of

 Vers

ailles

.

 Paris

 is

 a

 major

 cultural

,

 economic

 and

 political

 center

 of

 France

 and

 plays

 a

 key

 role

 in

 its

 ongoing

 development

 and

 growth

.

 The

 French

 government

 and

 the

 European

 Union

 are

 headquartered

 in

 Paris

,

 and

 the

 country

 is

 home

 to

 the

 headquarters

 of

 many

 of

 the

 world

's

 leading

 media

 companies

.

 The

 city

 is

 also

 known

 for

 its

 diverse

 and

 vibrant

 population

,

 including

 French

,

 African

,

 Arab

,

 and

 immigrant

 communities

.

 Paris

 is

 a

 major

 tourist

 destination

 and

 is

 known



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 promising

,

 with

 a

 range

 of

 potential

 trends

 shaping

 how

 we

 use

 and

 develop

 the

 technology

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 AI

 will

 become

 more

 accessible

 to

 everyone

:

 With

 the

 widespread

 adoption

 of

 AI

,

 we

 can

 expect

 to

 see

 more

 of

 it

 being

 used

 in

 everyday

 life

,

 such

 as

 in

 voice

 assistants

,

 virtual

 assistants

,

 and

 even

 in

 the

 healthcare

 industry

.

 This

 will

 bring

 a

 level

 of

 convenience

 and

 accessibility

 that

 has

 never

 been

 seen

 before

.



2

.

 AI

 will

 become

 more

 sophisticated

:

 As

 technology

 continues

 to

 advance

,

 we

 can

 expect

 AI

 to

 become

 even

 more

 sophisticated

 and

 capable

.

 This

 will

 involve

 developing

 more

 powerful

 neural

 networks

,

 better

 algorithms




In [6]:
llm.shutdown()