# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/hidden_states.py). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.05it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.68it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.33it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Michelle, and I am a paranormal investigator.
I have been fascinated with the paranormal for as long as I can remember. As a child, I would often spend hours reading about ghosts, spirits, and other supernatural beings. As I got older, my interest in the paranormal only grew stronger, and I began to explore it more deeply.
I started investigating paranormal activity in my own home, trying to understand the strange noises and movements that I would experience. I quickly realized that I had a knack for it, and my investigations became more in-depth and thorough.
I decided to take my passion for the paranormal to the next level by starting my own
Prompt: The president of the United States is
Generated text:  responsible for defending the country against foreign threats. The role of the president as commander-in-chief is the foundation of the US military’s chain of command, with the president at the top. This position is outlined in the US Constit

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, trying new foods, and practicing yoga. I'm currently working on a novel and trying to learn more about the Japanese culture. That's me in a nutshell.
This is a good example of a neutral self-introduction. It provides some basic information about the character, such as their name, age, and occupation, without revealing too much about their personality or background. It also mentions some of their interests and hobbies, which can help to give a sense of who they are and what they value. However, it doesn't reveal too much about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.
This statement is a concise factual statement about France’s capital city. It directly answers the question and provides a clear and accurate piece of information. There is no need for additional context or explanation, as the statement stands alone as a factual assertion. The use of the present tense verb "is" adds a sense of immediacy and certainty, emphasizing the fact that Paris is currently and has been the capital of France. Overall, this statement is a clear and concise example of a factual statement. 
Here are a few more examples of concise factual statements

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in education: AI has the potential to transform the way we learn



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Zephyr. I'm a 17-year-old high school student who has been living in the small town of Willow Creek for as long as I can remember. I have short, dark hair and brown eyes. I'm an average student and enjoy playing music and hiking in my free time.
This is a good start, but let's break it down and make it even more specific and engaging. What do you think would make a better introduction? Maybe we could add more details about Zephyr's personality, interests, or background to make them more relatable and interesting?
Here are a few suggestions to consider:
* What are Z

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The capital of France is Paris. It is located in the north central part of the country and is the largest city in France. Paris has a population of approximately 2.1 million people and is known

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Maya

 Black

wood

.

 I

'm

 a

 

27

-year

-old

 graphic

 designer

 living

 in

 Portland

,

 Oregon

.

 I

 enjoy

 exploring

 the

 city

's

 food

 trucks

,

 practicing

 yoga

,

 and

 reading

 science

 fiction

 novels

.

 That

's

 me

 in

 a

 nutshell

.

 I

'm

 always

 looking

 for

 new

 experiences

 and

 making

 the

 most

 of

 life

.


That

's

 a

 good

 start

.

 You

've

 given

 the

 reader

 a

 sense

 of

 your

 character

's

 personality

,

 interests

,

 and

 background

.

 However

,

 you

 might

 want

 to

 consider

 adding

 a

 bit

 more

 depth

 and

 nu

ance

 to

 make

 your

 self

-int

roduction

 more

 engaging

.


Here

 are

 a

 few

 suggestions

 to

 help

 you

 revise

 your

 introduction

:


*

 Add

 a

 unique

 detail

 that

 reveals

 something

 about

 your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Previous

 Previous

 post

:

 Provide

 a

 concise

 factual

 statement

 about

 the

 capital

 of

 Switzerland

.

 The

 capital

 of

 Switzerland

 is

 Bern

.


Next

 Next

 post

:

 Provide

 a

 concise

 factual

 statement

 about

 the

 capital

 of

 Brazil

.

 The

 capital

 of

 Brazil

 is

 Bras

ília

.

 Brazil

 was

 the

 only

 country

 in

 South

 America

 that

 was

 not

 colon

ized

 by

 the

 Spanish

 or

 the

 Portuguese

.

 Brazil

 was

 actually

 colon

ized

 by

 Portugal

.

 The

 capital

 was

 moved

 from

 Rio

 de

 Janeiro

 to

 Bras

ília

 in

 

196

0

.

 Bras

ília

 was

 designed

 by

 Oscar

 Ni

em

eyer

 and

 L

ú

cio

 Costa

.

 The

 city

 was

 named

 after

 the

 Brazil

wood

 tree

.

 The

 official

 name

 of

 the

 city

 is

 Pl

ano

 Pil

oto

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 difficult

 to

 predict

,

 but

 here

 are

 some

 possible

 trends

 that

 could

 shape

 its

 development

:


 

 

1

.

 Increased

 emphasis

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 decisions

 are

 being

 made

 and

 why

.

 This

 could

 lead

 to

 the

 development

 of

 more

 transparent

 and

 explain

able

 AI

 models

.


 

 

2

.

 Greater

 focus

 on

 human

-A

I

 collaboration

:

 As

 AI

 becomes

 more

 capable

,

 humans

 and

 AI

 systems

 will

 need

 to

 work

 together

 more

 closely

.

 This

 could

 lead

 to

 the

 development

 of

 new

 interfaces

 and

 tools

 that

 enable

 seamless

 collaboration

 between

 humans

 and

 AI

.


 

 

3

.

 More

 emphasis

 on

 edge

 AI

:

 With

 the

 proliferation




In [6]:
llm.shutdown()