# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.02s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.52it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.38it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Shelly, and I am the owner of Bubbly Bath and Body. I live in the beautiful state of Ohio with my wonderful husband, John. I have been married for over 20 years and we have two beautiful children, Alex and Emily.
As a mom, I have always been very interested in creating products that are not only good for your skin but also safe and healthy for the whole family. After years of using commercial soaps, lotions, and body products, I discovered that many of them contained harsh chemicals and artificial fragrances that could actually be damaging to our skin and health. That's when I decided to take
Prompt: The president of the United States is
Generated text:  pushing for a $1.3 trillion spending bill that includes $25 billion for border security, the largest appropriation for the U.S.-Mexico border in U.S. history.
President Donald Trump is calling on Congress to pass the spending bill, which includes a provision for the border security funding. Th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys reading and playing the guitar. I'm a bit of a bookworm and like to spend my free time curled up with a good novel. I'm also pretty passionate about music and try to play as much as I can. I'm a bit shy, but I'm working on being more outgoing and confident. I'm a junior in high school and I'm looking forward to the rest of the year. That's me in a nutshell! I'm a bit of a creative and like to express myself through writing and music. I'm still figuring out who I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. The city is known for its rich history, art, fashion, and cuisine. It is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is a major cultural and economic center and is one of the most visited cities in the world.
The best answer is: The capital of France is Paris. Paris is located in the northern part of the country and is situated on the Seine River. The city is known for its rich history, art, fashion, and cuisine

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI aims to provide transparent and interpretable AI models, enabling humans to understand the reasoning behind AI-driven decisions.
2. Rise of Edge AI: With the proliferation of IoT devices, Edge AI will become increasingly important. Edge AI involves processing data closer to the source, reducing latency and improving real-time decision-making.
3. Growing importance of Human



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Eve Everwood. I'm a graphic designer and illustrator currently living in Portland, Oregon. I enjoy hiking and trying new craft beers. What do you think? Is there anything you'd change or add? Here are some suggestions to improve your self-introduction: 1. Add a personal touch: While "enjoy hiking and trying new craft beers" is a good start, consider adding a personal anecdote or a quirky interest to make your character more relatable and interesting. For example, "I'm particularly fond of hiking to the top of Mount Hood on full moon nights to watch the sunrise." 2. Provide more context:

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The capital of France is Paris. The city is home to the iconic Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles.
The city of Paris is lo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

 and

 I

 am

 a

 

25

-year

-old

 ast

roph

ys

ic

ist

.

 I

 have

 spent

 most

 of

 my

 life

 studying

 the

 mysteries

 of

 the

 universe

 and

 I

 am

 currently

 working

 on

 a

 project

 to

 develop

 a

 new

 propulsion

 system

 for

 space

 exploration

.

 I

 am

 driven

 by

 a

 desire

 to

 advance

 our

 understanding

 of

 the

 cosmos

 and

 to

 push

 the

 boundaries

 of

 human

 knowledge

.

 I

 am

 a

 bit

 of

 a

 intro

vert

 and

 prefer

 to

 focus

 on

 my

 work

 rather

 than

 social

izing

,

 but

 I

 am

 always

 eager

 to

 engage

 with

 others

 who

 share

 my

 passion

 for

 space

 and

 science

.

 I

 value

 honesty

,

 integrity

,

 and

 intellectual

 curiosity

 above

 all

 else

.

 I

 am

 excited

 to

 learn



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.


The

 capital

 of

 France

 is

 Paris

.

 Wiki

 Page

 for

 Paris

,

 France

,

 

202

3




This

 statement

 can

 be

 used

 for

 educational

 purposes

,

 informational

 brief

s

,

 or

 other

 contexts

 where

 a

 concise

 statement

 about

 France

’s

 capital

 is

 required

.

 It

 provides

 a

 straightforward

 and

 accurate

 answer

 to

 the

 question

 about

 the

 capital

 of

 France

.

 Wiki

 Page

 for

 Paris

,

 France

,

 

202

3




This

 response

 is

 within

 the

 required

 

200

-word

 limit

 and

 provides

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 It

 adher

es

 to

 a

 formal

 and

 informative

 tone

 suitable

 for

 educational

 or

 informational

 contexts

.

 Wiki

 Page

 for

 Paris

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 significant

 advancements

 in

 areas

 such

 as

 natural

 language

 processing

,

 computer

 vision

,

 and

 reinforcement

 learning

.

 Here

 are

 some

 potential

 trends

 and

 developments

 that

 might

 shape

 the

 future

 of

 AI

:


1

.

 

 

Increased

 use

 of

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 pervasive

 in

 various

 industries

,

 there

 is

 a

 growing

 need

 for

 transparency

 and

 accountability

 in

 AI

 decision

-making

 processes

.

 Explain

able

 AI

 (

X

AI

)

 is

 a

 sub

field

 of

 AI

 that

 aims

 to

 provide

 insights

 into

 how

 AI

 models

 make

 decisions

,

 making

 it

 easier

 to

 understand

 and

 trust

 AI

-driven

 outcomes

.


2

.

 

 

Adv

ancements

 in

 Edge

 AI

:

 Edge

 AI

 refers

 to

 the

 processing

 of

 AI

-related

 tasks




In [6]:
llm.shutdown()