# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.03s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.55it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.27it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rebecca Hall and I am thrilled to be your tour guide today. I have lived in Savannah for most of my life and I love sharing my knowledge of this beautiful city with visitors. As we walk through the historic district, I will point out some of the city's most charming landmarks, share stories about the people who lived here, and give you a glimpse into the city's rich history.
We will start at the iconic Forsyth Park fountain, a popular spot for photos and a must-see for any visitor to Savannah. From there, we will make our way through the picturesque streets and alleys of the historic district, stopping at some
Prompt: The president of the United States is
Generated text:  also known as the head of the executive branch. The president is also the commander in chief of the United States Armed Forces. The president also has the power to issue executive orders. The president also has the power to appoint federal judges and ambassadors. The presiden

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm looking forward to meeting new people and learning from their experiences.
This is a good example of a neutral self-introduction because it:
Provides basic information about the character, such as their name, age, and occupation.
Mentions their interests and hobbies, which can help to create a sense of personality and character.
Avoids making any bold or attention-seeking statements.
Does not reveal too much about the character's personal life

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city. The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
The capital of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased Adoption of AI in Everyday Life: AI is likely to become more ubiquitous in our daily lives, with applications in areas such as healthcare, finance, transportation, and education.
2. Advancements in Machine Learning: Machine learning, a subset of AI, is expected to continue to improve, enabling AI systems to learn from data and adapt to new situations.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need to understand how AI systems make



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Finnley and I am a historian of sorts, focusing on the intricacies of forgotten events. I spend most of my days researching and recording tales from the past, hoping to shed light on overlooked stories and experiences.
I'm a quiet and observant person who values knowledge and insight. I often find myself lost in thought, pondering the complexities of the world and the people within it. My work is a reflection of my curiosity and dedication to understanding the world around me.
I'm not one for grand speeches or flashy displays, preferring to keep a low profile and let my work speak for itself. I'm more interested in uncovering the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is also the country’s largest city.
The statement is 1 sentence.
The answer is: The capital of France is Paris, which is 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 J

axon

 Lee

.

 I

'm

 a

 

22

-year

-old

 artist

 and

 writer

 living

 in

 a

 small

 town

 surrounded

 by

 nature

.

 I

 enjoy

 hiking

 and

 reading

 in

 my

 free

 time

.

 I

'm

 currently

 working

 on

 a

 novel

 and

 experimenting

 with

 different

 mediums

 in

 my

 art

 studio

.

 That

's

 a

 bit

 about

 me

.

 What

 do

 you

 think

?


This

 introduction

 provides

 a

 clear

 and

 concise

 overview

 of

 the

 character

's

 identity

,

 interests

,

 and

 current

 pursuits

.

 It

 establishes

 a

 neutral

 tone

,

 giving

 the

 reader

 a

 sense

 of

 who

 J

axon

 is

 without

 revealing

 too

 much

 about

 his

 personality

 or

 backstory

.

 The

 language

 is

 straightforward

 and

 easy

 to

 follow

,

 making

 it

 accessible

 to

 a

 wide

 range

 of

 readers

.

 The



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 France

 and

 is

 situated

 in

 the

 middle

 of

 the

 Î

le

-de

-F

rance

 region

.

 It

 is

 home

 to

 over

 

2

.

1

 million

 people

,

 making

 it

 one

 of

 the

 most

 populous

 cities

 in

 the

 European

 Union

.

 The

 city

 is

 known

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

,

 and

 is

 a

 major

 cultural

,

 artistic

,

 and

 economic

 center

.

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 the

 

3

rd

 century

 BC

 and

 has

 played

 a

 significant

 role

 in

 shaping

 European

 culture

 and

 politics

.

 Source

:

 Encyclopedia

 Britann

ica

,

 Wikipedia

.

 ##

 Step

 

1

:

 Identify

 the

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 and

 there

 are

 many

 different

 predictions

 and

 trends

 that

 experts

 are

 making

 about

 what

 AI

 will

 look

 like

 in

 the

 future

.

 Some

 potential

 trends

 in

 AI

 that

 could

 be

 significant

 in

 the

 future

 include

:


The

 development

 of

 more

 sophisticated

 machine

 learning

 algorithms

 that

 can

 learn

 and

 improve

 on

 their

 own

,

 potentially

 leading

 to

 more

 efficient

 and

 effective

 decision

-making

.


The

 integration

 of

 AI

 with

 other

 technologies

,

 such

 as

 robotics

,

 the

 Internet

 of

 Things

 (

Io

T

),

 and

 blockchain

,

 to

 create

 more

 powerful

 and

 autonomous

 systems

.


The

 development

 of

 more

 human

-like

 AI

 systems

 that

 can

 understand

 and

 interact

 with

 humans

 in

 a

 more

 natural

 way

,

 potentially

 leading

 to

 more

 widespread

 adoption

 of

 AI

 in

 areas

 such




In [6]:
llm.shutdown()