# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.28it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.08it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.01it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Catherine Schuessler and I am excited to be your host for the upcoming Online Retreat and Conference. As a speaker, author, and educator, my passion is helping people find hope, healing, and transformation in the midst of life's challenges. I am grateful to be a part of this community and I look forward to connecting with you!
I have been in ministry for over 20 years, teaching and speaking at conferences, retreats, and local churches. My ministry focuses on helping people develop a deeper understanding of God's love, identity, and purpose. My books, "The Beauty of Brokenness" and "The Power of Un
Prompt: The president of the United States is
Generated text:  not the CEO of the country. The president is the head of the executive branch of the federal government, which is just one of the three branches of the federal government established by the Constitution.
The Constitution divides power among three branches:
The legislative branch (Congress

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new coffee shops. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal any personal opinions or biases. It simply states the character's name, age, occupation, living situation, and interests. It also mentions a few personality traits, but in a way that is neutral and doesn't make any judgments. For example, calling herself a "bit of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about the population of France’s capital city. The population of Paris is approximately 2.1 million people.
Provide a concise factual statement about the location of France’s capital city. Paris is located in the northern part of France, in the Île-de-France region.
Provide a concise factual statement about the climate of France’s capital city. Paris has a temperate oceanic climate, characterized by mild winters and warm summers.
Provide a concise factual statement about the economy of France’s capital city. Paris is a major economic hub, with a strong focus on finance, fashion, and tourism.
Provide a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with applications such as:
2. AI-powered robots: Robots are becoming increasingly common in industries such as manufacturing, logistics, and healthcare. In the future, AI-powered robots are likely to become even more sophisticated, with capabilities such as:
3. AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Petra. I'm a 25-year-old artist living in a small coastal town in the Pacific Northwest. I spend most of my time painting landscapes and working on my latest project, a series of abstract pieces inspired by the ocean's moods.
I'm a creative person with a passion for art and a love for the outdoors. My favorite things to do are hiking, kayaking, and simply sitting on the beach, watching the waves roll in. I'm also an avid reader and enjoy collecting rare books on art and history.
I'm a bit of a introvert, but once you get to know me, I'm a warm and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This is a factual statement that contains no opinion or bias. It simply presents information about the capital of France.
Provide a statement that conveys a positive opinion about Paris. Paris is a city of unpar

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lena

 Taylor

.

 I

'm

 a

 young

 adult

 in

 my

 early

 twenties

,

 living

 in

 the

 city

.

 I

've

 been

 working

 as

 a

 freelance

 artist

,

 focusing

 on

 digital

 media

 and

 graphic

 design

.

 I

'm

 currently

 taking

 online

 courses

 to

 further

 develop

 my

 skills

 in

 animation

.

 When

 I

'm

 not

 working

 or

 studying

,

 I

 enjoy

 exploring

 the

 city

,

 trying

 out

 new

 restaurants

 and

 cafes

,

 and

 spending

 time

 with

 friends

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 but

 I

'm

 working

 on

 becoming

 more

 outgoing

 and

 confident

.

 I

'm

 excited

 to

 see

 where

 my

 creative

 journey

 takes

 me

.


This

 self

-int

roduction

 is

 neutral

 because

 it

 doesn

't

 reveal

 any

 specific

 personality

 traits

,

 interests

,

 or

 goals



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 largest

 city

.

 The

 largest

 city

 in

 France

 is

 Lyon

.

 


Provide

 a

 concise

 factual

 statement

 about

 the

 economic

 status

 of

 France

.

 France

 has

 the

 sixth

 largest

 economy

 in

 the

 world

.

 


Provide

 a

 concise

 factual

 statement

 about

 the

 population

 of

 France

.

 The

 population

 of

 France

 is

 approximately

 

67

 million

 people

.

 


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 official

 language

.

 The

 official

 language

 of

 France

 is

 French

.

 


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 system

 of

 government

.

 France

 is

 a

 semi

-pres

idential

 constitutional

 republic

.

 


Provide

 a

 concise

 factual

 statement

 about

 the

 currency

 of

 France

.

 The

 official

 currency

 of

 France

 is

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 far

 from

 certain

,

 and

 there

 are

 numerous

 potential

 trends

 that

 could

 shape

 the

 development

 of

 this

 technology

.

 Some

 of

 these

 trends

 include

:

 -

 Rise

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 integrated

 into

 our

 lives

,

 there

 will

 be

 a

 growing

 need

 for

 transparency

 and

 accountability

 in

 AI

 decision

-making

.

 X

AI

 aims

 to

 make

 AI

 more

 interpre

table

 and

 transparent

,

 allowing

 users

 to

 understand

 the

 reasoning

 behind

 AI

-driven

 decisions

.

 -

 Increased

 Adoption

 of

 Edge

 AI

:

 As

 the

 number

 of

 IoT

 devices

 continues

 to

 grow

,

 edge

 AI

 will

 become

 increasingly

 important

 for

 processing

 data

 locally

 and

 making

 decisions

 in

 real

-time

.

 This

 will

 enable

 faster

 and

 more

 efficient

 decision

-making

,

 while

 also




In [6]:
llm.shutdown()