# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.18it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Anna and I am from Austria. I am 28 and I have been using U-Boot for 3 years. I have a question about my U-Boot configuration. I am using an ARDUINO 1024CPU 32-bit microcontroller. I have the following configuration:

ARM Cortex-M4
Linker - u-boot-3.5.0-arm-none-eabi

I need to set the CLUSTER and FPROT options in the configuration file. I have also added the "CONFIG_COMPRESS" option in the USB device drivers in the configuration file. However, when I use the following
Prompt: The president of the United States is
Generated text:  an elected official. The last president of the United States to serve two full terms was Ronald Reagan. He served from 1981 to 1989. What is the president of the United States who was born in 1965?
To determine the president of the United States who was born in 1965, we need to follow these steps:

1. Identify the birth year of Ronald Reagan.
2. Calculate the president who served in the year 1965.
3. Confirm that the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? I'm a [age] year old, [gender] and [occupation]. I have a [number] degree in [field of study]. I'm a [job title] at [company name], and I enjoy [what you do best]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [what you do best] and I enjoy [what you do best].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is known for its diverse cuisine, including French cuisine, and is a popular destination for tourists and locals alike. Paris is a city of contrasts, with its historical architecture and modern amenities. It is a city that has a rich history and continues to be a major cultural and economic center in France. The city is also known

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that we interact with technology and the world around us. Here are some potential trends that are likely to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, including in areas such as diagnosis, treatment planning, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection and risk management. As AI technology



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [Age], [Job] at [Company], and I'm excited to start my new journey as an [Industry] professional. I am driven by a strong sense of [Motivational Trait or Goal], and I'm always looking for opportunities to make a positive impact. Whether it's through leadership, innovation, or problem-solving, I'm always looking to innovate and help others. I believe in the power of [Skill or Ability], and I'm always willing to learn and grow. I'm passionate about [Professional Interest or Industry], and I'm dedicated to [Short Statement About My Passion for [Industry

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

A) Yes
B) No
B) No
A) Yes
Paris is the capital of France and the largest city in the European Union. The city is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and famous landmarks s

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]

 and

 I

 am

 a

 [

occupation

]

 at

 [

company

].

 I

 have

 always

 been

 passionate

 about

 [

interest

 or

 hobby

].

 I

 enjoy

 [

details

 about

 your

 interests

 or

 hobbies

].

 I

 am

 [

age

 or

 current

 date

]

 and

 I

 am

 currently

 [

occupation

].

 My

 character

 traits

 include

 [

mention

 three

 traits

 or

 qualities

 that

 describe

 you

].

 I

 am

 [

name

]

 and

 I

 am

 [

character

 traits

]

.



As

 [

character

 traits

]

 I

 have

 a

 strong

 [

mot

ivation

 or

 goal

]

 to

 [

describe

 how

 you

 can

 achieve

 your

 goal

].

 I

 work

 tirelessly

 to

 [

mention

 how

 you

 get

 your

 work

 done

].

 I

 am

 a

 [

person

ality

 type

]

 and

 I

 am

 [

describe



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 as

 the

 "

City

 of

 Love

"

 due

 to

 its

 romantic

 history

 and

 cultural

 significance

.

 The

 city

 is

 home

 to

 numerous

 museums

,

 art

 galleries

,

 and

 historical

 sites

,

 and

 is

 a

 popular

 tourist

 destination

 for

 its

 stunning

 architecture

 and

 cuisine

.

 Paris

 is

 a

 vibrant

 and

 diverse

 city

 with

 a

 rich

 history

 of

 cultural

 and

 artistic

 influence

,

 making

 it

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 France

.

 The

 city

 is

 also

 known

 for

 its

 unique

 culinary

 traditions

,

 including

 Paris

ian

 cuisine

 and

 local

 specialties

 such

 as

 cro

iss

ants

,

 petit

 fours

,

 and

 terr

ines

.

 Overall

,

 Paris

 is

 a

 truly

 unique

 and

 unforgettable

 destination

 that

 offers

 a

 glimpse

 into

 the

 rich

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 more

 integration

 with

 other

 technologies

 and

 systems

,

 with

 the

 aim

 of

 improving

 efficiency

,

 reducing

 errors

,

 and

 expanding

 our

 understanding

 of

 complex

 systems

.



AI

 will

 continue

 to

 become

 more

 widespread

,

 with

 more

 and

 more

 applications

 in

 various

 fields

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 manufacturing

.

 AI

 will

 also

 become

 more

 integrated

 with

 the

 internet

 of

 things

 (

Io

T

),

 enabling

 devices

 and

 systems

 to

 communicate

 with

 each

 other

 and

 share

 data

.



AI

 will

 also

 be

 more

 efficient

,

 with

 more

 powerful

 algorithms

 and

 hardware

 available

 to

 perform

 tasks

 at

 a

 faster

 rate

.

 AI

 will

 become

 more

 sophisticated

,

 with

 the

 ability

 to

 learn

 and

 adapt

 to

 new

 situations

,

 and

 to

 better

 understand

 human




In [6]:
llm.shutdown()