# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 03-24 04:54:11 __init__.py:190] Automatically detected platform cuda.




INFO 03-24 04:54:29 __init__.py:190] Automatically detected platform cuda.
INFO 03-24 04:54:29 __init__.py:190] Automatically detected platform cuda.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.00it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.62it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Femi and I am a software engineer. I am passionate about creating innovative solutions using technology. My main areas of interest include machine learning, web development, and mobile app development. I love to learn new technologies and share my knowledge with others.
What do you think about the use of machine learning in mobile apps?
I think it has the potential to revolutionize the way we interact with mobile apps. With the ability to analyze user behavior and preferences, machine learning can enable apps to provide more personalized experiences, improve performance, and reduce battery consumption.
How do you approach the design and development of a mobile app?
My approach involves understanding the problem
Prompt: The president of the United States is
Generated text:  the head of the executive branch and is both the head of state and the head of government. He or she is directly elected by the people through the Electoral College system.


### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation when the mood strikes me. I'm currently working on a novel and a collection of short stories, and I'm excited to see where my creative projects take me. What do you think? Is this a good self-introduction for a fictional character?
This is a good start, but it could be improved. Here are a few suggestions:
1.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, along the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, finance, and tourism. Paris is also known for its romantic atmosphere, with its charming streets, cafes, and parks. The city has a long history dating back to the 3rd century BC

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with applications such as predictive analytics, robotic surgery, and personalized medicine.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Dr. Lilith Brightstone, and I'm a geologist studying the paleoecology of the Appalachian Mountains. I specialize in analyzing fossil records to understand the region's climate history and how ancient ecosystems have evolved over time. I'm currently working on a research project to investigate the effects of Pleistocene glaciations on the mountain's flora and fauna. Outside of work, I enjoy hiking, rock climbing, and experimenting with natural dyeing techniques using local plant materials.
Choose a specific time period and location for this character to explore. The Great Dismal Swamp in North Carolina during the Pleistocene era (about 20,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  located in the Île-de-France region. This region is also known as the Paris metropolitan area. The city of Paris is home to f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ara

 "

Z

ee

"

 Thompson

.

 I

'm

 a

 twenty

-year

-old

,

 non

-binary

 artist

 who

 specializes

 in

 mixed

 media

 sculpture

.

 I

've

 recently

 moved

 to

 Brooklyn

,

 New

 York

,

 and

 I

'm

 excited

 to

 immer

se

 myself

 in

 the

 city

's

 diverse

 artistic

 community

.

 My

 work

 often

 explores

 the

 intersection

 of

 technology

 and

 nature

,

 and

 I

'm

 eager

 to

 learn

 from

 others

 who

 share

 similar

 passions

.

 I

'm

 looking

 forward

 to

 meeting

 new

 people

 and

 discovering

 opportunities

 to

 grow

 as

 an

 artist

.

 Z

ara

 "

Z

ee

"

 Thompson

,

 nice

 to

 meet

 you

.

 I

'll

 use

 this

 as

 a

 starting

 point

 and

 adjust

 as

 needed

.


Z

ara

 "

Z

ee

"

 Thompson

's

 self

-int



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

 at

 the

 point

 where

 the

 Se

ine

 River

 flows

 into

 the

 River

 Mar

ne

.

 The

 city

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

.

 The

 city

 is

 known

 for

 its

 historic

 buildings

,

 art

 museums

,

 and

 fashion

.


The

 major

 industries

 in

 Paris

 are

 finance

,

 fashion

,

 and

 tourism

.

 The

 city

 has

 a

 high

 standard

 of

 living

 and

 a

 strong

 economy

.

 It

 is

 also

 known

 for

 its

 rich

 cultural

 heritage

,

 including

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.


In

 addition

 to

 its

 cultural

 attractions

,

 Paris

 is

 also

 a

 hub

 for

 education

 and

 research

,

 with

 several



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 but

 there

 are

 many

 possible

 trends

 that

 could

 shape

 its

 development

.


Art

ificial

 intelligence

 (

AI

)

 is

 a

 rapidly

 evolving

 field

 that

 has

 been

 advancing

 at

 an

 incredible

 pace

 in

 recent

 years

.

 As

 AI

 continues

 to

 improve

 and

 become

 more

 pervasive

 in

 our

 lives

,

 several

 trends

 are

 likely

 to

 shape

 its

 future

 development

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 to

 be

 transparent

 and

 explain

able

.

 This

 means

 that

 AI

 systems

 will

 need

 to

 provide

 clear

 explanations

 for

 their

 decisions

 and

 actions

,

 which

 will

 be

 essential

 for

 building

 trust

 and

 accountability

.


2




In [6]:
llm.shutdown()