# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.78it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Daniel, and I am a 23-year-old musician who makes my living playing and writing music. I live in a small town outside of the city, and I have an interest in music for many different reasons, but music in general and in particular, the music of Bob Dylan is my favorite.
I started my career in the music business way back in 2015, when I founded my own band called the Punchmark Shouters. It was the first band I ever made with a friend, and I did it for fun, as a way to work with a new band and to mix it with the local music
Prompt: The president of the United States is
Generated text:  250 cm tall. If the president walks a distance of 150 meters to the east, how tall will he appear to be on a highway with a map scale of 1 cm : 50 meters?

To determine how tall the president will appear to be on the highway, we need to follow these steps:

1. **Understand the highway scale**: The highway scale is given as 1 cm : 50 meters. This means that every 1 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name], and I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name], and I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name], and I'm excited to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French National Opera. Paris is a bustling city with a rich cultural heritage and is a major tourist destination. The city is known for its cuisine, fashion, and art scene. It is also home to the French Riviera, a popular tourist destination for its beaches and luxury resorts. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city of people, with a diverse population and a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased integration of AI into everyday life: AI is already being integrated into our daily lives, from voice assistants like Siri and Alexa to self-driving cars. As AI becomes more integrated into our daily lives, we may see even more widespread adoption of AI in areas such as healthcare, finance, and transportation.

2. Greater emphasis on ethical and responsible AI: As AI becomes more integrated into our lives, there will be a greater emphasis on ensuring that AI is used ethically and responsibly. This may



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name] and I am [age] years old. I am a [occupation] with [interest or hobby]. I am always eager to learn new things and [list any skills or qualities that distinguish you from others]. What can you tell us about yourself? Let's get to know you better! [Name] is a [specific skill or hobby], [specific interest or hobby], and [specific trait or quality]. What do you love to do? What do you like to do? [Name] is a [specific skill or hobby], [specific interest or hobby], and [specific trait or quality]. What do you like to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Its name comes from the French word "Paris," meaning "city of many" or "city of many people." It is the cultural and economic center of the country. The city is home to many world-renowned landmarks, including the Eiffel Tower and Notre-Dam

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 an

 [

Age

]

 year

 old

 person

.

 I

've

 always

 been

 curious

 about

 the

 world

 around

 me

,

 and

 I

 love

 to

 learn

 new

 things

 and

 explore

 new

 experiences

.

 I

 enjoy

 sharing

 my

 knowledge

 with

 others

 and

 helping

 them

 to

 understand

 the

 world

 around

 me

.

 I

 am

 always

 up

 for

 a

 challenge

,

 and

 I

'm

 always

 willing

 to

 learn

 new

 things

 and

 expand

 my

 own

 knowledge

.

 My

 passion

 for

 learning

 and

 my

 enthusiasm

 for

 sharing

 my

 knowledge

 are

 what

 make

 me

 a

 great

 person

 to

 have

 on

 my

 team

.

 Thank

 you

.

 G

reetings

,

 my

 name

 is

 [

Name

].

 I

 am

 a

 [

Age

]

 year

 old

 person

.

 I

 love

 to

 learn

 new

 things



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

,

 located

 on

 the

 Se

ine

 River

,

 and

 is

 known

 for

 its

 rich

 history

,

 art

,

 and

 architecture

.

 It

 is

 also

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.

 The

 city

 is

 home

 to

 a

 vibrant

 cultural

 scene

,

 with

 a

 multitude

 of

 museums

,

 galleries

,

 and

 theaters

,

 as

 well

 as

 many

 music

 and

 theater

 venues

.

 Paris

 is

 also

 a

 major

 economic

 center

 and

 a

 major

 center

 for

 the

 arts

,

 making

 it

 an

 important

 hub

 for

 the

 French

 economy

.

 The

 city

 is

 home

 to

 the

 French

 government

,

 the

 headquarters

 of

 many

 major

 companies

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 increasing

 sophistication

,

 broader

 adoption

,

 and

 a

 growing

 emphasis

 on

 ethical

 considerations

.

 Some

 possible

 trends

 include

:



1

.

 Increased

 focus

 on

 AI

 ethics

:

 AI

 systems

 are

 becoming

 increasingly

 integrated

 into

 our

 daily

 lives

,

 and

 their

 use

 raises

 ethical

 concerns

.

 As

 such

,

 there

 is

 a

 growing

 emphasis

 on

 developing

 ethical

 guidelines

 and

 standards

 for

 AI

 development

 and

 deployment

.



2

.

 Rise

 of

 AI

-driven

 automation

:

 AI

 is

 already

 making

 significant

 in

roads

 into

 many

 industries

,

 and

 the

 trend

 is

 likely

 to

 continue

.

 Automation

 will

 likely

 become

 more

 prevalent

,

 with

 AI

 systems

 taking

 on

 tasks

 that

 were

 previously

 the

 domain

 of

 humans

.



3

.

 Integration

 with

 human

 decision

-making

:

 AI

 is




In [6]:
llm.shutdown()