# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.39it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.36it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.36it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.64it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Samantha (Sam) and I am the owner of Sam’s Artistic Creations, a unique boutique gift shop in downtown Marshfield, Wisconsin.
I have been in the gift shop business for over 20 years and have always had a passion for art and creating things. I decided to open my own business after I retired from my previous career as a school teacher.
My shop features a wide variety of items such as handmade jewelry, candles, soaps, knitted goods, and more. I also have a section of my shop dedicated to local artisans where they can sell their handmade items. It’s a great way for people to find one
Prompt: The president of the United States is
Generated text:  the leader of the federal government of the United States. The president serves a four-year term and is both the head of state and the head of government of the United States. The president is directly elected by the people through the Electoral College.
The president is responsible for the execution of la

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my free time. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states the character's name, occupation, and interests in a straightforward and factual way. This type of introduction is often used in professional or social settings where you want to present yourself in a clear and concise manner.
Here are

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. The city is known for its rich history, art museums, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major hub for international business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a long history dating back to the 3rd century BC and has been influenced

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need to understand how AI systems make decisions. Explainable AI (XAI) will become increasingly important to ensure that AI systems



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Rowan Stone. I'm a 25-year-old freelance writer living in Portland, Oregon. I spend most of my free time exploring the city's hidden corners and writing in my small apartment.
The name of the character is Rowan Stone. Rowan is a 25-year-old freelance writer who lives in Portland, Oregon. The introduction is neutral, giving no indication of Rowan's personality, background, or motivations. The description of Rowan's life is also neutral, focusing on their occupation and daily habits without expressing any emotions or opinions. The tone is calm and matter-of-fact, which is typical for a neutral self-int

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Provide a concise factual statement about France’s capital city. The capital of France is Paris. This statement is already in a concise and factual format, s

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Hiro

shi

 Ak

ari

.

 I

'm

 a

 

25

-year

-old

 information

 security

 consultant

 living

 in

 Tokyo

,

 Japan

.

 My

 friends

 describe

 me

 as

 calm

 and

 collected

 under

 pressure

,

 which

 has

 helped

 me

 excel

 in

 my

 line

 of

 work

.

 I

 enjoy

 hiking

 and

 practicing

 a

ik

ido

 in

 my

 free

 time

.

 What

 are

 some

 ways

 you

 can

 improve

 this

 introduction

?


The

 introduction

 is

 clear

 and

 neutral

.

 However

,

 it

 is

 fairly

 standard

 and

 doesn

’t

 reveal

 much

 about

 Hiro

shi

’s

 personality

 or

 character

.

 Here

 are

 a

 few

 suggestions

 to

 make

 the

 introduction

 more

 engaging

 and

 character

-driven

:


1

.

 

 Add

 more

 details

 about

 Hiro

shi

’s

 background

 and

 experiences

:

 What

 inspired

 him

 to

 become

 an

 information



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 statement

 is

 a

 straightforward

 declaration

 that

 names

 the

 capital

 of

 France

 as

 Paris

.

 It

 avoids

 unnecessary

 descriptions

 or

 embell

ishments

,

 presenting

 the

 information

 in

 a

 clear

 and

 direct

 manner

.


The

 following

 example

 is

 a

 concise

 factual

 statement

 about

 the

 capital

 city

 of

 France

.

 It

 meets

 the

 criteria

 by

 providing

 a

 straightforward

 declaration

 without

 additional

 information

 or

 descriptions

.


Paris

 is

 the

 capital

 of

 France

.

 


This

 statement

 adher

es

 to

 the

 format

 by

 stating

 the

 name

 of

 the

 city

 (

Paris

)

 and

 declaring

 its

 relation

 to

 being

 the

 capital

 of

 France

,

 without

 adding

 extra

 details

.

 It

 is

 a

 clear

 and

 concise

 declaration

.

 


Next

,

 I

 will

 address

 the

 final

 example

.

 This

 example

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 increasing

 autonomy

,

 decision

-making

,

 and

 integration

 with

 various

 aspects

 of

 life

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 More

 sophisticated

 and

 autonomous

 decision

-making

 systems

:

 As

 AI

 technology

 advances

,

 we

 can

 expect

 to

 see

 more

 complex

 and

 autonomous

 decision

-making

 systems

 that

 can

 learn

 from

 data

,

 adapt

 to

 new

 situations

,

 and

 make

 decisions

 with

 minimal

 human

 intervention

.


2

.

 Increased

 use

 of

 explain

able

 AI

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 transparency

 and

 accountability

 in

 AI

 decision

-making

.

 Explain

able

 AI

 (

X

AI

)

 will

 become

 increasingly

 important

 to

 ensure

 that

 AI

 systems

 are

 transparent

 and

 understandable

.


3

.

 Rise




In [6]:
llm.shutdown()