# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.08it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.04it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.04it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Maria Magdalena, and I am an expert in History of Art from the University of Barcelona. I have a degree in Art History and History of Art and have a wide range of knowledge in the field.
I am very passionate about art and history, and I love sharing my knowledge with others. I am a patient and enthusiastic teacher who will make sure you understand the concepts and ideas behind the artworks we explore together.
My experience as a teacher spans over 10 years, during which I have taught students of all ages and levels. I have also worked as a researcher and have published several articles on art history.
I am confident that my
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president is elected through the Electoral College system, where each state is allocated a certain number of electoral votes based on its population. The president serves a four-year term and is 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new restaurants. I'm currently working on a novel and trying to learn more about the world through travel and exploration. That's me in a nutshell. What do you think? Is it too long or too short? Should I add or remove anything?
Your self-introduction is concise and to the point. It provides a good overview of who you are and what you do. However, it may be a bit too brief for some readers. Consider

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. The statement is also grammatically correct and easy to understand. 
Note: This response is a simple and direct answer to the question, as requested. However, if you would like me to provide more information or context about Paris, I can do so in a separate response. Let me know! 
Here is a longer response with more information about Paris:
Paris is the capital and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and explainability in AI decision-making. Explainable AI (XAI) aims to provide insights into how AI models make decisions, enabling



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Alexander “Lex” Jensen. I’m a 25-year-old journalist who recently moved to the city to work for a local newspaper. I enjoy hiking and reading in my free time, and I’m still getting used to the fast pace of urban life. I have a younger sister who’s in college and a cat named Luna who keeps me company at home. That’s me in a nutshell. I’m looking forward to seeing what the future holds.
What is the narrator's personality?
The narrator seems to be a reserved, possibly introverted person who is used to keeping to himself. He mentions that he enjoys reading and hiking, which suggests

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Located in the north-central part of the country, it is the country’s largest city, with a population of around 2.1 million people in the city proper and over 12 million in the me

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Samantha

.

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 living

 in

 Brooklyn

.

 I

 have

 a

 cat

 named

 Luna

 and

 enjoy

 hiking

,

 reading

,

 and

 trying

 out

 new

 restaurants

.

 I

'm

 currently

 working

 on

 a

 novel

 and

 experimenting

 with

 different

 writing

 styles

.

 How

 might

 you

 improve

 this

 introduction

?

 One

 approach

 is

 to

 add

 more

 depth

 or

 nu

ance

 to

 the

 character

's

 personality

.

 Here

's

 an

 example

:

 I

'm

 Samantha

,

 a

 

25

-year

-old

 freelance

 writer

 with

 a

 passion

 for

 storytelling

 and

 a

 penchant

 for

 getting

 lost

 in

 the

 city

.

 My

 cat

,

 Luna

,

 is

 my

 trust

y

 side

kick

,

 and

 we

 love

 to

 explore

 the

 outdoors

 together

.

 When

 I

'm

 not

 writing

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 and

 is

 located

 in

 the

 northern

 region

 of

 the

 country

.

 Paris

 is

 known

 for

 its

 cultural

 and

 artistic

 heritage

 and

 is

 home

 to

 many

 famous

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 a

 major

 economic

 and

 financial

 hub

,

 with

 many

 multinational

 companies

 headquartered

 there

.

 Paris

 is

 also

 a

 popular

 tourist

 destination

,

 attracting

 millions

 of visitors

 each

 year

.


The

 capital

 of

 France

 is

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 and

 is

 located

 in

 the

 northern

 region

 of

 the

 country

.

 Paris

 is

 known

 for

 its

 cultural

 and

 artistic

 heritage

 and

 is

 home

 to

 many

 famous

 landmarks

,

 such

 as

 the

 E



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 not

 just

 about

 automation

,

 but

 also

 about

 augmentation

.


Art

ificial

 intelligence

 (

AI

)

 is

 advancing

 rapidly

,

 and

 its

 impact

 on

 various

 industries

 and

 aspects

 of

 life

 is

 expected

 to

 be

 significant

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 

 **

Hyper

-P

erson

al

ization

**:

 AI

 will

 enable

 businesses

 to

 offer

 highly

 personalized

 experiences

 to

 their

 customers

,

 taking

 into

 account

 their

 preferences

,

 behaviors

,

 and

 emotions

.

 This

 will

 lead

 to

 increased

 customer

 satisfaction

 and

 loyalty

.


2

.

 

 **

Edge

 AI

**:

 As

 the

 number

 of

 IoT

 devices

 grows

,

 AI

 will

 be

 used

 to

 process

 data

 closer

 to

 the

 source

,

 reducing

 latency

 and

 improving

 efficiency

.

 This

 will

 enable

 real

-time

 decision




In [6]:
llm.shutdown()