# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.55s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.84s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:08<00:02,  2.83s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.34s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sydney and I am a senior at St. Mary's High School. I am super excited to be interning at the San Francisco Chronicle this summer! I am a bit nervous, but I am confident that this internship will be an amazing opportunity to learn and grow as a young journalist. I am a bit of a news junkie, and I love staying up to date on current events. I enjoy writing, photography, and reporting, and I hope to one day become a journalist. I am also super passionate about social justice and advocating for marginalized communities. I am looking forward to working with such an esteemed newspaper and contributing my skills and perspectives
Prompt: The president of the United States is
Generated text:  constitutionally mandated to report to Congress on the state of the union at the start of each year. The report is often seen as an opportunity for the president to present their vision and priorities to the legislative branch, and to make a case for their policie

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my free time. I'm a bit of a introvert, but I love connecting with others through my writing. That's me in a nutshell. What do you think? Is it too short or too long? Should I add anything else?
Your self-introduction is concise and to the point. It provides a good overview of who you are and what you do. However, it may benefit from a bit more depth and personality

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also known for its romantic atmosphere and is often referred to as the City of Light. Paris is a popular tourist destination and is considered one of the most beautiful and culturally significant cities in the world. Paris is the seat of the French government and is home to many international organizations, including

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Rise of autonomous vehicles: Autonomous vehicles are already being tested on public roads, and it's likely



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kiara Wilder, I’m a 28-year-old freelance writer and artist. I enjoy writing fiction and creating digital art in my free time. I live in a small apartment in a busy city, surrounded by the sounds and smells of urban life. I’m a bit of a night owl, often working late into the evening to meet deadlines and bring my creative projects to life. I’m always looking for new ways to express myself and explore my imagination. What do you think of my self-introduction? Does it sound like a believable character? How can it be improved? The self-introduction is quite short, so it may be

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the country and is home to the Eiffel Tower and the Louvre Museum, among other famous landmarks. Paris is a major cultural and economic center, known for its 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Aurora

 Wy

nt

er

.

 I

'm

 a

 student

 at

 the

 prestigious

 Silver

mist

 Academy

 for

 the

 Arts

,

 studying

 theatre

.

 I

 enjoy

 writing

,

 singing

,

 and

 exploring

 the

 city

 of

 Ash

wood

 where

 the

 academy

 is

 located

.


Here

's

 an

 updated

 version

 of

 the

 introduction

 with

 a

 few

 minor

 changes

:


Hello

,

 my

 name

 is

 Aurora

 Wy

nt

er

.

 I

'm

 a

 student

 at

 the

 prestigious

 Silver

mist

 Academy

 for

 the

 Arts

,

 where

 I

'm

 pursuing

 a

 degree

 in

 theatre

.

 When

 I

'm

 not

 rehe

arsing

 for

 the

 next

 production

,

 you

 can

 find

 me

 scri

b

bling

 in

 my

 journal

,

 bel

ting

 out

 tunes

,

 or

 wandering

 the

 cob

ble

stone

 streets

 of

 Ash

wood

,

 discovering

 new



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 and

 is

 situated

 on

 the

 Se

ine

 River

.

 The

 city

 has

 a

 population

 of

 approximately

 

2

.

1

 million

 people

.

 Paris

 is

 a

 major

 tourist

 destination

 and

 is

 known

 for

 its

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 Notre

 Dame

 Cathedral

.

 The

 city

 is

 also

 home

 to

 many

 museums

,

 art

 galleries

,

 and

 other

 cultural

 institutions

,

 including

 the

 Lou

vre

 Museum

,

 which

 houses

 some

 of

 the

 world

’s

 most

 famous

 artworks

,

 such

 as

 the

 Mona

 Lisa

.

 Paris

 is

 also

 a

 major

 center

 for

 education

 and

 research

,

 with

 several

 universities

 and

 research

 institutions

 located

 in

 the

 city

.

 The

 city

 has

 a

 rich

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 much

 speculation

 and

 debate

.

 Some

 experts

 predict

 a

 ut

opian

 future

 where

 AI

 enhances

 human

 life

 in

 numerous

 ways

,

 while

 others

 foresee

 a

 dyst

opian

 future

 where

 AI

 surpass

es

 human

 intelligence

 and

 becomes

 a

 threat

 to

 humanity

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Adoption

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 pervasive

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 makes

 decisions

.

 X

AI

 aims

 to

 provide

 transparency

 and

 interpret

ability

 into

 AI

 models

,

 making

 them

 more

 trustworthy

 and

 accountable

.


2

.

 Rise

 of

 Edge

 AI

:

 With

 the

 increasing

 use

 of

 IoT

 devices

,

 there

 is

 a

 growing

 need

 for

 AI

 to

 be

 processed




In [6]:
llm.shutdown()