# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.09it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.02s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.03s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.17it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sabine. I am a 31-year-old Registered Nurse (RN) and I am a part of the greatest profession in the world. I was born and raised in the beautiful country of the Netherlands. In my free time, I love to hike, practice yoga, read and enjoy good food and wine.
As a nurse, I have had the privilege of caring for patients from all walks of life and backgrounds. This has broadened my perspective and taught me the importance of empathy, compassion, and kindness. In my professional career, I aim to provide exceptional care and support to my patients, and to make a positive impact on their lives
Prompt: The president of the United States is
Generated text:  an elected official who serves as the head of state and government of the United States. The president is both the head of state and the head of government of the country. The president is directly elected by the people through the Electoral College system and serves a four-year term.
The president has

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation.
This is a good start, but it's a bit too focused on the mundane aspects of your life. You might want to add a bit more depth or interest to your self-introduction. For example, you could mention a hobby or interest that you're particularly passionate about, or a goal or aspiration that you're

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. This is a simple and straightforward statement that provides a basic fact about France’s capital city. It does not include any additional information or context, but rather serves as a clear and concise statement of fact. This type of statement is often used in educational or informational contexts, such as in a textbook or encyclopedia entry. It is a good example of a factual statement because it is based on verifiable evidence and is not subject to personal opinion or interpretation. In this case, the statement is supported by a wide range of sources, including official government websites

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even larger role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely to become even more widespread, with the potential to automate many tasks and improve



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Odessa. I'm a 21-year-old student at the University of California, Berkeley. I'm a psychology major with a minor in creative writing. I enjoy reading and writing in my free time, and I'm currently working on my first novel. I also like hiking and trying new foods. I'm a bit of a introvert and value my alone time, but I'm always up for a good conversation when I feel comfortable.
Write a short, neutral self-introduction for a fictional character. Hello, my name is Ruby and I'm a 25-year-old freelance graphic designer. I'm originally from a small town in the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the most populous city in France and the most visited city in the world, with more than 23 million tourists per year. The city is known for its iconic landmarks, including the Eiffel Tower, Not

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 S

ora

 Kit

ano

.

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 from

 Tokyo

.

 I

 enjoy

 reading

 science

 fiction

 and

 taking

 long

 walks

 in

 the

 city

.


S

ora

 Kit

ano

 is

 the

 protagonist

 of

 an

 original

 Japanese

 novel

.

 S

ora

 lives

 with

 her

 family

 in

 Tokyo

 and

 attends

 a

 local

 high

 school

.

 She

 loves

 science

 fiction

 novels

 and

 takes

 long

 walks

 around

 the

 city

 to

 clear

 her

 mind

.

 The

 other

 students

 at

 school

 think

 she

's

 a

 bit

 quiet

 and

 reserved

,

 but

 S

ora

 is

 actually

 a

 very

 observ

ant

 person

 who

 notices

 many

 things

 that

 others

 overlook

.


S

ora

 is

 an

 average

 student

,

 but

 she

's

 very

 interested

 in

 science

 and

 technology

.

 She

 often

 spends



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 French

 and

 English

 are

 both

 widely

 spoken

 in

 France

.

 The

 E

iff

el

 Tower

 is

 an iconic

 landmark

 located

 in

 Paris

.


The

 above

 statement

 is

 factual

 and

 concise

,

 providing

 a

 simple

 overview

 of

 France

's

 capital

 city

.

 To

 make

 it

 more

 engaging

 and

 detailed

,

 the

 statement

 could

 be

 expanded

 to

 include

 more

 information

 about

 Paris

,

 such

 as

 its

 history

,

 culture

,

 and

 attractions

.

 For

 example

:


"The

 capital

 of

 France

 is

 Paris

,

 a

 city

 steep

ed

 in

 history

 and

 culture

.

 French

 and

 English

 are

 both

 widely

 spoken

 in

 France

,

 making

 it

 easy

 for

 tourists

 to

 navigate

.

 The

 E

iff

el

 Tower

,

 an

 iconic

 landmark

,

 is

 a

 must

-

see

 attraction

 in

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 not

 yet

 certain

,

 but

 several

 trends

 are

 emerging

 as

 technology

 continues

 to

 evolve

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Adoption

 of

 Edge

 AI

:

 With

 the

 proliferation

 of

 IoT

 devices

,

 edge

 AI

 is

 becoming

 increasingly

 important

.

 Edge

 AI

 enables

 AI

 to

 be

 processed

 on

 devices

 at

 the

 edge

 of

 the

 network

,

 reducing

 latency

 and

 improving

 real

-time

 processing

.


2

.

 Rise

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 pervasive

,

 there

 is a

 growing

 need

 to

 understand

 how

 AI

 decisions

 are

 made

.

 X

AI

 aims

 to

 provide

 transparency

 and

 explain

ability

 into

 AI

 decision

-making

 processes

,

 which

 will

 be

 essential

 for

 building

 trust

 in

 AI

 systems

.


3




In [6]:
llm.shutdown()