# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.04it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.65it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Daniel. I am a 25-year-old man from the United States. I am currently residing in a small town in the Midwest. I am a bit of a homebody and enjoy spending my free time reading, listening to music, and playing video games. I am also an avid fan of the band Rush and have a great appreciation for the music of Neil Peart. I have a Bachelor's degree in music education and am currently pursuing a Master's degree in music composition. My favorite things in life are music, nature, and my family.
It's great to meet you! I'm a bit of a introvert, so it
Prompt: The president of the United States is
Generated text:  expected to be a leader in a global economy that is rapidly changing due to the rise of emerging markets and new technologies. In this context, it is essential that the president have a deep understanding of the major trends that are shaping the global economy and the implications of these trends for the United States. The president also needs

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm looking forward to meeting new people and learning about their experiences.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states the character's name, age, occupation, and interests, without expressing any enthusiasm or passion. The introduction also mentions the character's current projects and goals, which gives a sense of their professional and personal aspirations. Overall, this self-introduction is a good starting point

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city. The capital of France is Paris.
The statement is already concise and factual. It simply states that the capital of France is Paris, without any additional information or opinions. This meets the requirements of a concise and factual statement. Therefore, the statement is correct and complete. The final answer is: The capital of France is Paris. ## Step 1: Identify the task
The task is to provide a concise factual statement about France’s capital city.

## Step 2: Recall the capital of France
The capital of France is Paris.

## Step 3: Formulate the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries, including



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Zephyr and I'm a 25-year-old wind mage who has been traveling the world for years, mastering my craft and studying the ancient magic of the wind. I'm a bit of a loner, but I enjoy meeting new people and learning from their experiences. What's your story? (Note: you can choose to add or remove details as you see fit to make the character more interesting or fitting to your story.)

# Introduction
Hello, my name is Zephyr and I'm a 25-year-old wind mage who has been traveling the world for years, mastering my craft and studying the ancient magic of the wind

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The capital of France is Paris. The city is situated in the northern part of the country, within the Île-de-France region. Paris is famous for its stunning architecture, rich history, and vibrant culture

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 In

grid

.

 I

 work

 as

 a

 librarian

 at

 the

 local

 public

 library

.

 I

've

 been

 working

 here

 for

 five

 years

 and

 really

 enjoy

 helping

 people

 find

 the

 books

 and

 resources

 they

're

 looking

 for

.

 In

 this

 introduction

,

 identify

 the

 character

's

 name

,

 their

 job

 title

,

 the

 place

 they

 work

,

 and

 a

 personal

 detail

 about

 their

 personality

 or

 interests

.

 In

 this

 case

,

 the

 character

's

 name

 is

 In

grid

,

 and

 she

 works

 as

 a

 librarian

 at

 the

 local

 public

 library

.

 A

 personal

 detail

 is

 included

 about

 In

grid

's

 enjoyment

 of

 helping

 people

 find

 books

 and

 resources

.

 This

 introduction

 is

 neutral

 and

 doesn

't

 reveal

 any

 personal

 biases

 or

 opinions

.

 Read

 more

:

 <

a

 href



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 This

 statement

 con

veys

 a

 simple

 yet

 accurate

 piece

 of

 information

 about

 France

’s

 capital

 city

.

 It

 does

 not

 include

 any

 additional

 details

 or

 opinions

,

 making

 it

 a

 clear

 and

 concise

 statement

.

 This

 type

 of

 statement

 is

 useful

 for

 providing

 a

 quick

 and

 accurate

 answer

 to

 a

 question

,

 and

 it

 serves

 as

 a

 foundation

 for

 further

 discussion

 or

 exploration

 of

 the

 topic

.


The

 capital

 of

 France

 is

 Paris

.

 This

 statement

 is

 a

 good

 starting

 point

 for

 a

 broader

 discussion

 about

 France

’s

 capital

 city

,

 including

 its

 history

,

 culture

,

 landmarks

,

 and

 significance

.

 For

 example

,

 one

 might

 explore

 the

 reasons

 why

 Paris

 has

 been

 the

 capital

 of

 France

 for

 centuries

,

 its

 role

 as

 a

 global



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 that

 is

 widely

 discussed

 and

 debated

.

 As

 AI

 becomes

 more

 advanced

 and

 integrated

 into

 various

 aspects

 of

 life

,

 several

 trends

 are

 expected

 to

 emerge

 or

 continue

 in

 the

 coming

 years

.


1

.

 

 

Increased

 Focus

 on

 Edge

 AI

:

 As

 the

 Internet

 of

 Things

 (

Io

T

)

 expands

,

 AI

 will

 need

 to

 operate

 more

 efficiently

 on

 devices

 and

 sensors

,

 rather

 than

 relying

 on

 cloud

-based

 processing

.

 Edge

 AI

 will

 become

 more

 prevalent

,

 allowing

 for

 real

-time

 processing

 and

 decision

-making

 at

 the

 edge

 of

 the

 network

.


2

.

 

 

More

 Em

phasis

 on

 Explain

ability

 and

 Transparency

:

 As

 AI

 becomes

 more

 widespread

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make




In [6]:
llm.shutdown()