# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.40s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.66s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:08<00:02,  2.75s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.26s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chris and I am the new chaplain at the Indianapolis Motor Speedway. I was hired as the Director of Spiritual Life and the Pastor of the Speedway Chapel.
I have a Bachelor’s degree in Music Ministry from Indiana Wesleyan University, a Master of Divinity degree from Indiana Wesleyan University, and a Doctor of Ministry degree from Wesley Theological Seminary.
I am excited to serve the Speedway and its community through spiritual care and support. I am available to provide guidance and support to anyone who may need it, regardless of their faith background or denomination.
I can be reached at 317-492-6370 or [ccrist@indy
Prompt: The president of the United States is
Generated text:  elected through the Electoral College system. Here's how it works: each state is allocated a certain number of electoral votes based on its population, with a minimum of three votes per state. Candidates from each party compete in the general election by campaigning a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Elianore Quasar. I'm a 25-year-old astrophysicist with a passion for understanding the mysteries of the cosmos. I'm currently working on a project to develop a more efficient method for harnessing solar energy. When I'm not in the lab, you can find me reading about the latest breakthroughs in quantum mechanics or practicing my guitar. I'm a bit of a introvert, but I enjoy meeting new people and engaging in thought-provoking conversations. I'm excited to learn more about the world and its many wonders. This introduction is neutral because it doesn't reveal any personal biases or opinions, and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  Paris is the most populous city in France and the center of the Île-de-France region. It is one of the world's leading business and cultural centers and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major hub for international diplomacy and is home to many international organizations, including the United Nations Educational, Scientific and Cultural Organization (UNESCO).  The city has a rich history dating back to the 3rd century BC and has

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Widespread adoption of AI in industries: AI is expected to become more prevalent in various industries, including finance, transportation, and education. AI-powered systems may be able to automate tasks, improve efficiency, and enhance



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Asher Blackwood. I'm a software engineer and a self-taught programmer. I have a bachelor's degree in computer science from the University of California, Berkeley, but most of my experience comes from working on personal projects and freelance gigs. I'm currently based in San Francisco. That's me in a nutshell.
Hi, I'm Elara Vex. I'm a 25-year-old freelance writer and editor, living in New York City. I've been writing for various online publications and have worked on several personal projects, including a novel and a few short stories. I'm a bit of a coffee snob and enjoy

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Briefly describe the significance of the location in relation to the country and the world. As the capital city, Paris is the seat of government and a symbol of French culture and histor

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

 S

ait

o

.

 I

'm

 a

 

20

-year

-old

 university

 student

 major

ing

 in

 environmental

 science

.

 I

'm

 from

 Tokyo

,

 Japan

,

 but

 I

've

 been

 living

 in

 the

 United

 States

 for

 the

 past

 three

 years

.

 I

'm

 interested

 in

 sustainability

 and

 conservation

.

 I

'm

 a

 bit

 of

 a

 nature

 lover

 and

 enjoy

 hiking

 and

 exploring

 the

 outdoors

 in

 my

 free

 time

.


Here

 are

 some

 key

 points

 to

 keep

 in

 mind

 when

 writing

 a

 self

-int

roduction

:



 

 

1

.

 Be

 concise

:

 Keep

 your

 introduction

 short

 and

 to

 the

 point

.


 

 

2

.

 Be

 neutral

:

 Avoid

 expressing

 strong

 opinions

 or

 biases

.


 

 

3

.

 Be

 informative

:

 Provide

 relevant

 information

 about



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Next

,

 describe

 the

 main

 features

 of

 the

 city

.

 The

 city

 is

 famous

 for

 the

 E

iff

el

 Tower

,

 a

 well

-known

 landmark

.

 Paris

 is

 also

 known

 for

 its

 art

 museums

,

 including

 the

 Lou

vre

,

 which

 is

 home

 to

 the

 Mona

 Lisa

.

 The

 city

’s

 architecture

 includes

 a

 mix

 of

 Gothic

,

 Renaissance

,

 and

 modern

 styles

.

 The

 Se

ine

 River

 runs

 through

 the

 city

,

 and

 it

 has

 many

 famous

 bridges

 and

 squares

.


Finally

,

 provide

 two

 key

 facts

 about

 the

 city

.

 One

 key

 fact

 is

 that

 Paris

 has

 a

 significant

 history

 of

 artistic

 and

 intellectual

 movements

,

 such

 as

 Imp

ression

ism

 and

 Surre

al

ism

.

 Another

 key

 fact

 is

 that

 Paris

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 much

 speculation

 and

 debate

.

 While

 it

 is

 difficult

 to

 predict

 with

 certainty

 what

 the

 future

 holds

,

 here

 are

 some

 possible

 trends

 in

 artificial

 intelligence

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

:


1

.

 **

Increased

 use

 of

 Explain

able

 AI

 (

X

AI

)**

:

 As

 AI

 becomes

 more

 pervasive

 in

 decision

-making

 processes

,

 there

 is

 a

 growing

 need

 for

 transparency

 and

 accountability

.

 X

AI

 aims

 to

 make

 AI

 models

 more

 explain

able

,

 which

 will

 help

 build

 trust

 in

 AI

 systems

 and

 identify

 biases

.


2

.

 **

Hy

brid

 approaches

**:

 AI

 systems

 will

 increasingly

 combine

 machine

 learning

 with

 traditional

 rule

-based

 systems

,

 human

 expertise

,

 and

 other

 techniques

 to

 improve

 performance

 and

 decision

-making




In [6]:
llm.shutdown()