# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.01it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.53it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lorne Lanning and I was the co-founder of Oddworld Inhabitants. Our studio is best known for its critically acclaimed Oddworld games. I founded the studio in 1994 and have been a game designer, writer, and producer for over two decades. I was also the co-creator of Abe's Oddysee, Munch's Oddysee, and Stranger's Wrath. I'm excited to share some of my thoughts on the gaming industry, game design, and the future of gaming. Let's get started!
Lorne Lanning (LL): Welcome to my blog! I'm excited to share my thoughts and
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the federal government of the United States. The president leads the executive branch and is the commander-in-chief of the United States Armed Forces.
What is the role of the president of the United States?
The president is responsible for executing the laws passed by Congress and has the power to veto laws, although Congress ca

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking and reading in my free time. I'm currently working on a novel and trying to learn more about the local art scene. That's me in a nutshell. What do you think? Is it a good introduction?
This is a good introduction because it is neutral and doesn't reveal too much about Kaida's personality or background. It gives a brief overview of who she is and what she does, and it also mentions some of her interests. However, it could be improved by adding a bit more detail or personality

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. The city is home to many famous museums, such as the Louvre and the Orsay, and iconic landmarks like the Eiffel Tower and Notre-Dame Cathedral. Paris is also a major hub for fashion, cuisine, and art, and is considered one of the most beautiful and vibrant cities in the world. The city has a population of over 2.1 million people and is a popular tourist destination, attracting

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical data, identify patterns, and make predictions, leading to more accurate diagnoses and personalized treatment plans.
2. Advancements in natural language processing: NLP is a key area of AI research, and future advancements are expected to enable more sophisticated human-computer interactions. This could include more accurate language translation, better sentiment analysis,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Rilian. I'm a 22-year-old art student from a small town in the Midwest.
This self-introduction is neutral because it simply states the character's name, age, occupation, and hometown without adding any personal details or biases. It also sets the scene for the character's background without revealing too much.
Here are a few more examples of neutral self-introductions for fictional characters:
My name is Jaxon, and I'm a 25-year-old freelance writer from Brooklyn.
Hi, I'm Kaida, a 20-year-old engineering student from a suburb of Tokyo.
Hello, I'm Ryker, a 

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a concise factual statement about Italy’s capital city. The capital of Italy is Rome.
Provide a concise factual statement about Germany’s capital city. The capital of Germany is Berlin.
Provide 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

.

 I

 am

 a

 

25

-year

-old

 inter

gal

actic

 explorer

.

 I

 have

 spent

 most

 of

 my

 life

 traveling

 through

 space

,

 discovering

 new

 worlds

 and

 civilizations

.

 My

 home

 planet

 is

 X

an

the

a

,

 a

 small

,

 blue

-green

 planet

 on

 the

 outer

 rim

 of

 the

 galaxy

.

 I

 am

 currently

 on

 a

 solo

 mission

 to

 explore

 the

 un

charted

 regions

 of

 the

 galaxy

.

 I

 am

 equipped

 with

 a

 state

-of

-the

-art

 spacecraft

,

 the

 Cele

stial

 Quest

,

 and

 have

 a

 talent

 for

 navigating

 through

 un

charted

 space

.

 I

 am

 neutral

 in

 my

 approach

 to

 new

 worlds

 and

 civilizations

,

 preferring

 to

 observe

 and

 learn

 before

 making

 any

 decisions

 or

 judgments

.

 I

 am



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


This

 is

 a

 very

 simple

 statement

 about

 Paris

.

 The

 statement

 is

 clear

,

 concise

,

 and

 factual

.

 This

 statement

 is

 the

 kind

 of

 thing

 that

 would

 be

 appropriate

 in

 an

 encyclopedia

 or

 a

 dictionary

.


Provide

 a

 concise

 factual

 statement

 about

 Paris

.

 Paris

 is

 the

 capital

 of

 France

.


This

 statement

 is

 very

 similar

 to

 the

 first

 one

,

 but

 it

 includes

 more

 information

.

 This

 statement

 is

 also

 clear

,

 concise

,

 and

 factual

.

 This

 statement

 is

 also

 the

 kind

 of

 thing

 that

 would

 be

 appropriate

 in

 an

 encyclopedia

 or

 a

 dictionary

.


Provide

 a

 descriptive

 statement

 about

 Paris

.

 Paris

 is

 known

 for

 its

 beautiful

 architecture

,

 rich

 history

,

 and

 romantic

 atmosphere

.


This

 statement

 is

 more

 descriptive

 than



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 complex

 and

 multif

ac

eted

,

 with

 various

 trends

 and

 predictions

 emerging

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 explain

ability

 and

 transparency

 in

 AI

 decision

-making

.

 This

 will

 involve

 developing

 techniques

 to

 interpret

 and

 understand

 AI

 models

,

 ensuring

 that

 they

 are

 fair

,

 accountable

,

 and

 transparent

.


2

.

 Rise

 of

 Edge

 AI

:

 As

 the

 Internet

 of

 Things

 (

Io

T

)

 expands

,

 edge

 AI

 will

 become

 increasingly

 important

.

 Edge

 AI

 refers

 to

 the

 processing

 of

 AI

 tasks

 at

 the

 edge

 of

 the

 network

,

 close

 to

 where

 the

 data

 is

 generated

.

 This

 will




In [6]:
llm.shutdown()