# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/hidden_states.py). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.08it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.78it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.33it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jake!
Hello, my name is Jake! I am a biologist by training and have a passion for learning about the natural world. I have worked on various research projects studying animal behavior, ecology, and conservation biology. My goal is to communicate the importance of conservation to the public and inspire people to take action to protect our planet’s biodiversity.
What do you think about this image?
I'd love to hear your thoughts on this image! What do you think it's about? Is there anything in particular that stands out to you? Let's have a conversation about it!
What do you think about this image?
I'd love to hear your
Prompt: The president of the United States is
Generated text:  hosting a private dinner with a high-profile foreign leader at the White House. As the guests take their seats and the servers begin to circulate with appetizers, the dinner conversation turns to a sensitive topic: North Korea's nuclear program.
The president leans in,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm a bit of a introvert, but I'm always up for a good conversation. I'm looking forward to meeting new people and learning about their experiences.
This is a good start, but it's a bit too focused on the character's professional life. You might want to add a bit more personality and flair to make it more engaging. Here's an example of how you could revise it:
"Hi, I'm K

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, art, fashion, and culture. It is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is also a major economic and financial center, hosting many international organizations and companies. The city has a population of over 2.1 million people and is a popular tourist destination. Paris is known for its romantic atmosphere, beautiful architecture, and vibrant cultural scene. It is a city that is steeped in history and tradition, yet

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, with applications in medical diagnosis, personalized medicine, and patient care. AI-powered chatbots and virtual assistants may become more common in healthcare settings, helping patients navigate the healthcare system and providing support for chronic disease management.
2. Advancements in natural language processing (NLP): NLP is a key area of AI research, and future advancements in this field may enable more sophisticated and human-like language understanding



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jamie Chen. I am a software engineer and have been working in the field for over five years. I am currently working on several projects to improve the efficiency of our company's software development processes. I enjoy learning about new technologies and contributing to open-source projects in my free time. What is your role in this story? I am the lead engineer of the development team, and I oversee the work of other engineers. My goal is to ensure that our software meets the needs of our users and is delivered on time. I am responsible for managing the team's workload and making key decisions about project direction. What are your strengths and weaknesses? My

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a brief description of a country or place. France is a country located in Western Europ

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

.

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 who

 enjoys

 playing

 guitar

 and

 listening

 to

 music

.

 I

 spend

 most

 of

 my

 free

 time

 at

 home

,

 where

 I

 like

 to

 read

 and

 relax

.

 I

'm

 a

 bit

 of

 a

 quiet

 person

,

 but

 I

'm

 friendly

 once

 you

 get

 to

 know

 me

.


In

 this

 self

-int

roduction

,

 K

aida

 briefly

 mentions

 her

 name

,

 age

,

 and

 interests

.

 She

 also

 gives

 a

 hint

 about

 her

 personality

 by

 describing

 herself

 as

 "

a

 bit

 of

 a

 quiet

 person

."

 The

 tone

 is

 neutral

,

 which

 means

 she

's

 not

 revealing

 too

 much

 about

 herself

,

 making

 it

 easy

 for

 others

 to

 form

 their

 own

 opinions

 about

 her

.

 The



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 known

 for

 its

 beautiful

 architecture

,

 famous

 museums

 and

 art

 galleries

,

 and

 historic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.


Here

 are

 some

 possible

 ways

 to

 expand

 this

 basic

 statement

:


Use

 descriptive

 language

 to

 create

 vivid

 images

:


Paris

,

 the

 City

 of

 Light

,

 is

 the

 capital

 of

 France

.

 Its

 stunning

 architecture

,

 from

 Gothic

 cath

ed

r

als

 to

 elegant

 art

 nouveau

 buildings

,

 creates

 a

 breathtaking

 backdrop

 for

 the

 city

's

 famous

 museums

 and

 art

 galleries

.

 The

 iconic

 E

iff

el

 Tower

,

 a

 symbol

 of

 French

 ing

enuity

,

 stands

 tall

 alongside

 the

 beautiful

 Notre

-D

ame

 Cathedral

,

 a

 masterpiece

 of

 Gothic

 architecture

.


Provide

 interesting

 facts



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 transformative

,

 with

 potential

 implications

 across

 various

 industries

 and

 aspects

 of

 life

.

 Future

 AI

 trends

 are

 likely

 to

 involve

 more

 sophisticated

 AI

 systems

 that

 integrate

 various

 technologies

 and

 capabilities

,

 leading

 to

 significant

 improvements

 in

 efficiency

,

 productivity

,

 and

 decision

-making

.


1

.

 Increased

 Adoption

 of

 Explain

able

 AI

 (

X

AI

):


X

AI

 aims

 to

 develop

 AI

 systems

 that

 can

 provide

 transparent

 and

 understandable

 explanations

 for

 their

 decisions

.

 As

 the

 demand

 for

 trust

 and

 accountability

 in

 AI

 grows

,

 X

AI

 is

 expected

 to

 become

 more

 prevalent

,

 particularly

 in

 high

-st

akes

 applications

 like

 healthcare

 and

 finance

.


2

.

 Aug

mentation

 of

 Human

 Cap

abilities

:


Future

 AI

 systems

 will

 focus

 on

 augment

ing

 human

 capabilities

 rather




In [6]:
llm.shutdown()