# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.03s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.50it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jane Smith, but my friends call me Smitty. I am a proud member of the Fort Worth Historical Society and the Fort Worth Stockyards National Historic District. I am an 8th generation Texan and have lived in Fort Worth all of my life. My family has been ranching in the area since 1870 and we have a long history of cattle ranching and farming.
I have always been fascinated with the history of our great city and the surrounding areas. I have spent countless hours researching and learning about the history of Fort Worth and the Stockyards. I have also been involved in various historical reenactments and living
Prompt: The president of the United States is
Generated text:  the leader of the country and the commander-in-chief of its armed forces. The president is elected by the people through the Electoral College, and serves a four-year term. The president is responsible for executing the laws of the land, commanding the military, and serving as the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, trying new foods, and practicing yoga. I'm currently working on a novel and experimenting with different writing styles. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states the character's name, age, occupation, and interests. It also mentions a current project, which can give insight into the character's personality and goals. The tone is friendly and approachable, making it suitable for a character

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major hub for business, culture, and tourism.
The best answer is: The capital of France is Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, from diagnosing diseases to developing personalized treatment plans. AI-powered chatbots and virtual assistants may become more common in healthcare settings, helping patients navigate the healthcare system and providing support for patients with chronic conditions.
2. Widespread adoption of AI in education: AI is likely to transform the education sector, from personalized learning to automated grading. AI-powered adaptive learning systems may become more prevalent, allowing students



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Astrid Larsen. I'm a 20-year-old student majoring in environmental science at a university in Portland, Oregon. I'm interested in sustainability and renewable energy. Outside of academics, I enjoy hiking and practicing yoga. That's me. I don't have any pets or significant relationships. I live alone in a small apartment near campus. I'm easy-going and like to keep things simple. That's it.
This self-introduction is neutral because it doesn't reveal any particularly exciting or unusual aspects of the character's personality or background. It doesn't even hint at any flaws or contradictions. It simply presents Astrid as a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
France is a country with a rich history and culture, and its capital city is a testament to this. Paris, the capital of France, is a city

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Helen

.

 I

'm

 

25

 years

 old

,

 and

 I

 work

 as

 a

 data

 analyst

 in

 a

 small

 marketing

 firm

.

 I

 enjoy

 hiking

 and

 trying

 new

 restaurants

 in

 my

 free

 time

.


How

 to

 Write

 a

 Great

 Self

-

Introduction




When

 writing

 a

 self

-int

roduction

,

 keep

 it

 brief

 and

 to

 the

 point

.

 Aim

 for

 a

 length

 of

 one

 or

 two

 paragraphs

 at

 most

.


Focus

 on

 your

 professional

 experience

 and

 skills

,

 as

 well

 as

 any

 personal

 qualities

 that

 are

 relevant

 to

 your

 work

 or

 goals

.


Use

 a

 formal

 tone

,

 but

 make

 sure

 it

 still

 sounds

 natural

 and

 authentic

.


Here

 are

 some

 tips

 to

 help

 you

 write

 a

 great

 self

-int

roduction

:


Keep

 it

 concise

:

 Aim

 for



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


When

 was

 the

 first

 prototype

 of

 the

 E

iff

el

 Tower

 built

?

 The

 first

 prototype

 of

 the

 E

iff

el

 Tower

 was

 built

 in

 

188

4

.


What

 does

 the

 name

 ‘

E

iff

el

’

 refer

 to

?

 The

 name

 of

 the

 E

iff

el

 Tower

 comes

 from

 its

 designer

,

 Gust

ave

 E

iff

el

.


The

 E

iff

el

 Tower

 was originally

 meant

 to

 be

 temporary

.

 What

 was

 its

 intended

 purpose

 in

 

188

9

?

 The

 E

iff

el

 Tower

 was

 built

 as

 the

 entrance

 arch

 for

 the

 

188

9

 World

’s

 Fair

.


What

 was

 the

 height

 of

 the

 original

 E

iff

el

 Tower

?

 The

 original

 E

iff

el

 Tower

 stood

 at

 a

 height

 of

 



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 unpredictable

.

 Here

 are

 some

 potential

 future

 trends

:


1

.

 

 

Increased

 Integration

 with

 the

 Internet

 of

 Things

 (

Io

T

):

 As

 the

 IoT

 continues

 to

 expand

,

 we

 can

 expect

 AI

 to

 play

 a

 larger

 role

 in

 managing

 and

 analyzing

 the

 vast

 amounts

 of

 data

 generated

 by

 connected

 devices

.


2

.

 

 

Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

):

 N

LP

 will

 become

 increasingly

 sophisticated

,

 enabling

 AI

 systems

 to

 better

 understand

 and

 respond

 to

 human

 language

,

 potentially

 leading

 to

 more

 intuitive

 and

 user

-friendly

 interfaces

.


3

.

 

 

R

ise

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 transparency

 and




In [6]:
llm.shutdown()