# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/hidden_states.py). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.05s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.58it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sherry.
I'm a 45-year-old woman who loves my family, my friends, and my life. I'm a bit of a goofball, a hopeless romantic, and a lover of all things sweet. I love to laugh, to cry, and to experience all the emotions in between. I'm a bit of a mess, but I'm working on it.
I'm also a recovering addict, a wife, a mom, and a daughter. I've been through some tough times in my life, but I've come out stronger on the other side. I'm proud of myself for facing my demons and for fighting for
Prompt: The president of the United States is
Generated text:  attempting to silence his critics.
It is a desperate attempt, and it is doomed to fail.
Since the day Donald Trump was elected, I have been saying that his presidency would be marked by division, polarization and chaos. The chaos has been constant, with Trump tweeting outrageous, uninformed and sometimes threatening statements on a daily basis.
Many of us have been critical of Trump’s behavior, his pol

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm looking forward to meeting new people and learning more about their experiences.
This is a good start, but it could be improved by adding a bit more depth and personality to the character. Here are some suggestions:
* Instead of saying "I'm a freelance writer," consider adding a bit more context about what kind of writing you do. For example, "I'm a freelance writer specializing in science fiction and fantasy novels."
* You

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. The city is home to many famous museums, such as the Louvre and the Orsay, and is famous for its art, fashion, and cuisine. Paris is also known for its iconic landmarks, such as the Eiffel Tower and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for business, culture, and tourism. Paris is a popular destination

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased Adoption of AI in Everyday Life: AI is likely to become increasingly integrated into our daily lives, from virtual assistants like Siri and Alexa to more advanced applications in healthcare, finance, and education.
2. Advancements in Machine Learning: Machine learning, a subset of AI, is expected to continue to improve, enabling AI systems to learn from data and improve their performance over time.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need to understand



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Amaranth. I'm a 23-year-old botanist who has spent the last three years studying the unique flora of a remote, tropical island. I've recently returned to the mainland to continue my research and share my findings with the scientific community.
The introduction is brief and to the point, providing a basic overview of the character's identity, background, and profession. It does not contain any sensational or attention-grabbing language, and avoids revealing any personal biases or opinions. Instead, it presents a neutral, factual account of the character's experiences and goals. This type of introduction is suitable for a formal setting, such as

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the northern part of the country. It is situated along the Seine River. Paris is known for it

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 J

ax

son

 Vincent

.

 I

 am

 

22

 years

 old

 and

 work

 as

 a

 freelance

 writer

.


J

ax

son

 Vincent

 is

 a

 neutral

-s

ounding

 name

 with

 a

 fairly

 common

 first

 and

 middle

 name

.

 The

 statement

 of

 occupation

 is

 a

 fairly

 common

 one

 in

 today

's

 job

 market

,

 though

 it

's

 worth

 noting

 that

 freelance

 writing

 can

 be

 a

 challenging

 profession

 for

 many

 people

.


Overall

,

 this

 self

-int

roduction

 doesn

't

 reveal

 much

 about

 the

 character

 beyond

 their

 age

 and

 occupation

.

 This

 can

 be

 a

 useful

 starting

 point

,

 as

 it

 allows

 the

 reader

 to

 make

 their

 own

 assumptions

 about

 the

 character

 based

 on

 the

 information

 provided

.

 However

,

 it

's

 worth

 considering

 adding

 a

 bit

 more

 depth

 or

 personality



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 


Paris

 is

 the

 capital

 and

 largest

 city

 of

 France

,

 situated

 in

 the

 north

-central

 part

 of

 the

 country

.

 It

 is

 situated

 along

 the

 Se

ine

 River

 and

 has

 a

 population

 of

 around

 

2

.

1

 million

 people

.


What

 is

 the

 primary

 language

 spoken

 in

 France

?

 The

 primary

 language

 spoken

 in

 France

 is

 French

.

 


French

 is

 the

 official

 language

 of

 France

,

 spoken

 by

 the

 vast

 majority

 of

 the

 population

.

 It

 is

 the

 language

 used

 in

 government

,

 education

,

 media

,

 and

 daily

 life

.

 


What

 is

 the

 primary

 religion

 of

 France

?

 The

 primary

 religion

 of

 France

 is

 Christianity

,

 specifically

 Roman

 Catholic

ism

.

 


About

 

60

-

70

%

 of

 the

 French



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 rapid

 advancements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 Some

 possible

 future

 trends

 in

 AI

 include

:


Impro

vements

 in

 natural

 language

 processing

,

 enabling

 computers

 to

 understand

 and

 generate

 human

-like

 language

.


Adv

ances

 in

 computer

 vision

,

 allowing

 machines

 to

 interpret

 and

 understand

 visual

 data

 from

 images

 and

 videos

.


Increased

 use

 of

 edge

 AI

,

 where

 AI

 algorithms

 are

 deployed

 on

 devices

 at

 the

 edge

 of

 the

 network

,

 rather

 than

 in

 the

 cloud

.


Growing

 adoption

 of

 explain

able

 AI

,

 which

 provides

 transparency

 into

 AI

 decision

-making

 processes

.


R

ise

 of

 multim

odal

 AI

,

 which

 integrates

 multiple

 forms

 of

 data

,

 such

 as

 text

,

 images

,

 and

 audio

,




In [6]:
llm.shutdown()