# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.03it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.67it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.28it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Bryan and I'm a former Marine and current IT professional. I'm excited to start my new journey as a full-time writer.
I've always been drawn to writing, even as a kid. I would write short stories and poems for my friends and family, and I even self-published a few small books on Amazon back in the day.
However, after leaving the Marine Corps, I struggled to find a career path that fit my skills and interests. I worked in a few different fields, but nothing ever seemed quite right. That was until I discovered IT.
I fell in love with the world of technology and quickly realized that my military
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, indirectly elected to a four-year term by the people through the Electoral College. The officeholder serves as the commander-in-chief of the armed forces, the director of national policy, and the representative of the U.S. internat

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my free time. I'm a bit of a introvert and prefer to spend time alone, but I'm always up for a good conversation or a quiet evening with friends. I'm a bit of a perfectionist, which can sometimes make it difficult for me to start new projects, but I'm working on it. I'm excited to meet new people and explore new opportunities. That's me in a nutshell! What do you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This is a factual statement that provides a concise piece of information about France’s capital city. It is a simple and direct statement that answers the question about the capital of France. The statement is also accurate and up-to-date, as Paris has been the capital of France for centuries. This type of statement is useful for providing a quick and easy-to-understand answer to a question about a specific topic. It can be used in a variety of contexts, such as in a history textbook, a travel guide, or a trivia game. Overall, the statement is clear, concise, and accurate, making

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, from diagnosing diseases to developing personalized treatment plans.
2. Rise of explainable AI: As AI becomes more pervasive, there will be a growing need to understand how AI systems make decisions. Explainable AI will become increasingly important to build trust in AI systems.
3. Growth of edge AI: Edge AI refers to the use of AI on devices at the edge of the network, such



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Axel Reed and I'm a 25-year-old freelance journalist from Los Angeles. I cover various topics, including entertainment, technology, and social justice. I'm always looking for new angles and perspectives to share with readers. I'm a bit of a skeptic, but I'm passionate about uncovering the truth and giving voice to marginalized communities. What do you think? Sounds good? I want my character to come across as relatable and down-to-earth. I'd like to convey a sense of curiosity and open-mindedness without being too "nice" or insincere. Any suggestions or feedback? 

Here are a few suggestions to make

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Click here to learn more about France.
Paris, the capital of France, is a city with a rich history and a blend of modern and traditional architecture. The city

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

ael

in

 V

ex

.

 I

'm

 a

 

19

-year

-old

 student

 currently

 attending

 the

 University

 of

 Ash

wood

,

 where

 I

'm

 studying

 environmental

 science

.

 I

'm

 not

 really

 sure

 what

 my

 long

-term

 goals

 are

,

 but

 for

 now

,

 I

'm

 just

 trying

 to

 make

 the

 most

 of

 my

 college

 experience

 and

 see

 where

 life

 takes

 me

.

 I

 enjoy

 hiking

 and

 reading

 in

 my

 free

 time

,

 and

 I

'm

 always

 up

 for

 trying

 new

 things

.


Write

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

.

 Hello

,

 my

 name

 is

 Em

ilia

 Grey

.

 I

'm

 a

 

25

-year

-old

 software

 engineer

 currently

 working

 at

 a

 small

 startup

 in

 downtown

 New

 Haven

.

 I

 have



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 and

 is

 situated

 along

 the

 Se

ine

 River

.

 Paris

 is

 known

 for

 its

 beauty

,

 history

,

 and

 culture

,

 as

 well

 as

 its

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 a

 major

 hub

 for

 international

 business

,

 fashion

,

 and

 tourism

,

 and

 is

 considered

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

.

 Paris

 is

 also

 home

 to

 many

 of

 France

’s

 most

 famous

 universities

 and

 research

 institutions

,

 and

 is

 a

 center

 for

 the

 arts

 and

 sciences

.

 Overall

,

 Paris

 is

 a

 vibrant

 and

 iconic

 city

 that

 is

 steep

ed

 in

 history

 and

 culture

.

 

 I

 was

 unable



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 difficult

 to

 predict

,

 but

 some

 trends

 are

 emerging

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

.


Art

ificial

 intelligence

 (

AI

)

 has

 made

 tremendous

 progress

 in

 recent

 years

,

 and

 its

 future

 is

 expected

 to

 be

 shaped

 by

 several trends

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 

 

Increased

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 pervasive

 in

 our

 lives

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 it

 makes

 decisions

.

 This

 trend

 is

 expected

 to

 continue

,

 with

 a

 focus

 on

 developing

 more

 transparent

 and

 explain

able

 AI

 models

.


2

.

 

 

Adv

ances

 in

 natural

 language

 processing

 (

N

LP

):

 N

LP

 has

 made

 significant

 progress

 in




In [6]:
llm.shutdown()