# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.37s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.67s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:08<00:02,  2.74s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.04s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.26s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elizabeth and I am an MFA student in the Creative Writing program at Chatham University. I am thrilled to be a part of the 2018-2020 class of students, and I look forward to sharing my writing with you.
I am originally from the Philadelphia area, but I have lived in Pittsburgh for the past few years. I have always been passionate about writing, and I am excited to explore the many forms and styles that creative writing has to offer.
My interests include poetry, short story writing, and creative nonfiction. I enjoy writing about topics such as identity, social justice, and personal relationships. I am also interested
Prompt: The president of the United States is
Generated text:  calling for unity, but is that really what Americans want?
In the aftermath of the presidential election, many Americans are still reeling from the outcome. The nation is more divided than ever, and many people are left wondering if unity is even possible.
President-ele

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying out new restaurants. I'm a bit of a homebody, but I love exploring the city and discovering new hidden gems. I'm a bit of a perfectionist, which can sometimes make it difficult for me to relax and enjoy the moment. But I'm working on it. I'm excited to meet new people and make new connections. That's me in a nutshell.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, interests, or background. It provides a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, art, fashion, and cuisine. It is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is also a major economic and cultural center, attracting millions of tourists and business travelers each year. The city has a population of over 2.1 million people and is a hub for international business, finance, and diplomacy. Paris is also known for its romantic atmosphere, beautiful parks and gardens, and vibrant cultural scene. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it's difficult to predict exactly what the future will hold, here are some possible trends that could shape the development and impact of artificial intelligence in the years to come:
1. Increased Adoption in Industries: AI will continue to be adopted in various industries, including healthcare, finance, transportation, and education. This will lead to increased efficiency, productivity, and innovation in these sectors.
2. Advancements in Natural Language Processing (NLP): NLP will continue to improve, enabling AI systems to better understand and generate human-like language. This will lead to more effective communication between humans and machines.
3



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lyra Flynn, and I'm a 23-year-old musician and writer living in a small coastal town. I'm still figuring out my place in the world, but for now, I'm enjoying the simple things like playing my guitar on the beach and watching the sunsets. What do you think of my introduction?
Lyra Flynn's introduction is concise and informative, conveying the basics of her identity, interests, and lifestyle. It's neutral, avoiding any sensational or attention-grabbing language, which is suitable for a self-introduction. The details about playing her guitar on the beach and watching sunsets create a peaceful and serene atmosphere

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which has a population of more than 2.1 million people. It is situated in the northern part of the country, near the Seine River.
Paris, the capit

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ara

 and

 I

'm

 a

 

25

-year

-old

 bot

an

ist

 working

 as

 a

 research

 assistant

 at

 a

 botanical

 garden

.

 I

'm

 from

 a

 small

 town

 in

 the

 Pacific

 Northwest

,

 where

 I

 grew

 up

 surrounded

 by

 the

 forests

 and

 coast

lines

 that

 now

 inspire

 my

 work

.

 My

 parents

 were

 both

 scientists

,

 and

 I

 think

 that

's

 where

 I

 inherited

 my

 love

 of

 learning

 and

 exploration

.

 Currently

,

 I

'm

 involved

 in

 a

 project

 studying

 the

 effects

 of

 climate

 change

 on

 plant

 species

 in

 the

 region

.

 When

 I

'm

 not

 in

 the

 lab

 or

 greenhouse

,

 I

 enjoy

 hiking

 and

 trying

 out

 new

 recipes

 in

 the

 kitchen

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 but

 I

'm

 always



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 

 To

 answer

 this

 question

,

 I

 have

 provided

 a

 concise

 factual

 statement

.


The

 statement

 meets

 all

 the

 requirements

 of

 the

 question

.

 It

 is

 concise

,

 as

 it

 is

 short

 and

 to

 the

 point

.

 It

 is

 also

 factual

,

 as

 it

 accurately

 states

 the

 capital

 of

 France

.

 The

 statement

 does

 not

 contain

 any

 unnecessary

 words

 or

 information

,

 making

 it

 a

 clear

 and

 direct

 answer

.

 Therefore

,

 the

 statement

 is

 suitable

 for

 answering

 the

 question

 about

 France

’s

 capital

 city

.

 



However

,

 the

 statement

 could

 be

 further

 improved

 by

 providing

 additional

 information

,

 such

 as

 the

 population

 of

 the

 city

 or

 notable

 landmarks

.

 This

 would

 make

 it

 more

 informative

 and

 engaging

.

 But

 for

 the

 purpose

 of

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 it

 is

 expected

 to

 have

 a

 significant

 impact

 on

 various

 industries

 and

 our

 daily

 lives

.


Ex

plain

 possible

 future

 trends

 in

 artificial

 intelligence

.

 The

 future

 of

 AI

 is

 exciting

 and

 it

 is

 expected

 to

 have

 a

 significant

 impact

 on

 various

 industries

 and

 our

 daily

 lives

.

 In

 this

 discussion

,

 I

 will

 outline

 some

 possible

 future

 trends

 in

 AI

.


Increased

 use

 of

 machine

 learning

 and

 deep

 learning




One

 of

 the

 possible

 future

 trends

 in

 AI

 is

 the

 increased

 use

 of

 machine

 learning

 and

 deep

 learning

.

 Machine

 learning

 is

 a

 subset

 of

 AI

 that

 enables

 computers

 to

 learn

 from

 data

 without

 being

 explicitly

 programmed

.

 Deep

 learning

 is

 a

 type

 of

 machine

 learning

 that

 uses

 neural

 networks

 to

 analyze

 data

.

 It




In [6]:
llm.shutdown()