# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/hidden_states.py). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.10s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.43it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.23it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.15it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mary and I am a first-time mom to my beautiful baby boy, Lucas. I just wanted to share my birth story with you all. I had been induced on a Tuesday evening and was excited to meet my little man. I was 37 weeks and 4 days pregnant.
After a few hours of contractions that were getting stronger and closer together, I was hooked up to the fetal monitor to keep an eye on Lucas. The nurses came in to break my water, and I was finally ready to start pushing! I had been thinking about using a birthing tub, but it ended up not working out. I didn't mind,
Prompt: The president of the United States is
Generated text:  the head of the executive branch of the federal government of the United States. The president serves as both the head of state and the head of government of the United States. The president is responsible for enforcing the laws passed by Congress, for negotiating treaties and executive agreements, and for serving as the commander-in-chief o

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys reading and playing video games in my free time. I'm a bit of a introvert and prefer to spend time alone, but I'm not antisocial and enjoy talking to people when I feel comfortable. I'm a bit of a perfectionist and can get stressed when things don't go according to plan. I'm a hard worker and try to do my best in everything I do. That's me in a nutshell.
This self-introduction is neutral because it doesn't reveal any personal secrets or biases. It simply presents Kaida's personality, interests, and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." The city has a rich cultural heritage and is home to many museums

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and accountability. Explainable AI (XAI) aims to provide insights into how AI models make decisions, enabling users to understand and trust



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I'm a 22-year-old student majoring in business at a university in Tokyo. I enjoy hiking and trying new foods, especially those with unusual ingredients. I'm still figuring out my career path, but I'm excited to see where my studies will take me. I'm a pretty laid-back person who values independence and self-reliance, but I also enjoy spending time with friends and family. I'm looking forward to seeing what the future holds.
This is an example of a neutral self-introduction for a fictional character. It provides basic information about the character's background, interests, and personality without revealing too much

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the north-central part of the country. It is situated along the Seine River. Paris has a population of over 2.1 mil

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Jen

,

 and

 I

 work

 as

 a

 freelance

 writer

 and

 editor

 in

 San

 Francisco

.

 I

'm

 a

 creative

 problem

-s

olver

 and

 enjoy

 helping

 others

 tell

 their

 stories

 through

 compelling

 writing

.

 I

'm

 a

 bit

 of

 a

 coffee

 sn

ob

,

 and

 you

 can

 usually

 find

 me

 s

ipping

 on

 a

 c

app

ucc

ino

 at

 a

 local

 café

.


I

 think

 this

 is

 a

 good

 example

 of

 a

 neutral

 self

-int

roduction

 because

 it

:


Includes

 Jen

's

 name

 and

 profession

 (

fre

el

ance

 writer

 and

 editor

)


M

entions

 a

 relevant

 skill

 or

 interest

 (

help

ing

 others

 tell

 their

 stories

)


Avoid

s

 personal

 or

 provocative

 language




Provides

 a

 sense

 of

 place

 and

 culture

 (

San

 Francisco

 and

 coffee

 culture

)




Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 and

 is

 situated

 on

 the

 Se

ine

 River

.

 The

 city

 is

 known

 for

 its

 cultural

 and

 historical

 landmarks

,

 art

 museums

,

 fashion

,

 and

 romantic

 atmosphere

.

 Paris

 has

 been

 the

 capital

 of

 France

 since

 

987

.

 Source

:

 Encyclopedia

 Britann

ica

.


Find

 out

 where

 you

 can

 get

 a

 free

 French

 dictionary

 for

 the

 iPhone

.

 Dictionary

.com

 offers

 a

 free

 French

-

English

 dictionary

 for

 the

 iPhone

.

 You

 can

 download

 the

 app

 from

 the

 App

 Store

.

 The

 dictionary

 includes

 over

 

250

,

000

 definitions

 and

 translations

.

 Source

:

 Dictionary

.com

.


What

 are

 the

 French

 words

 for

 the

 following

 words

:

 (

1

)

 thank

 you

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 not

 only

 changing

 the

 world

 but

 also

 causing

 a

 significant

 change

 in

 the

 way

 people

 work

,

 think

,

 and

 interact

 with

 one

 another

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Hybrid

 Intelligence

:

 The

 future

 of

 AI

 may

 involve

 the

 fusion

 of

 human

 and

 artificial

 intelligence

 to

 create

 hybrid

 intelligence

.

 This

 could

 enable

 humans

 and

 machines

 to

 work

 together

 more

 effectively

,

 resulting

 in

 better

 decision

-making

 and

 problem

-solving

.


2

.

 Increased

 Em

phasis

 on

 Explain

ability

:

 As

 AI

 becomes

 more

 prevalent

,

 there

 will

 be

 an

 increased

 focus

 on

 explain

ability

.

 This

 means

 that

 AI

 systems

 will

 need

 to

 provide

 clear

 and

 transparent

 explanations

 for

 their

 decision

-making

 processes

,

 making

 it

 easier

 for

 humans

 to




In [6]:
llm.shutdown()