# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 03-19 07:13:43 __init__.py:190] Automatically detected platform cuda.




INFO 03-19 07:14:01 __init__.py:190] Automatically detected platform cuda.
INFO 03-19 07:14:01 __init__.py:190] Automatically detected platform cuda.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.75it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.46it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Abdulkadir. I am from Somalia. I am in 7th grade. I like playing football and watching the news. I am a bit shy, but I like making friends and learning new things. My dream is to become a doctor and help people in need. I am proud to be a part of the class of 2033. Hello, my name is Abdulkadir. I am from Somalia. I am in 7th grade. I like playing football and watching the news. I am a bit shy, but I like making friends and learning new things. My dream is to become a doctor and help people in
Prompt: The president of the United States is
Generated text:  the head of the executive branch, which is one of the three branches of the federal government. The president is responsible for enforcing laws, making treaties, appointing federal judges, and acting as commander-in-chief of the armed forces. The president also serves as the head of state, representing the United States at home and abroad.
The president is elected to a four-year term through t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking and reading in my free time. I'm a bit of a introvert and prefer to keep to myself, but I'm always up for a good conversation. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
I think your self-introduction is great! It's concise, informative, and gives a good sense of who Kaida is as a person

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.
This statement is a concise factual statement about France’s capital city. It clearly and directly states the name of the capital city, which is Paris. This statement is a good example of a concise factual statement because it is short, clear, and to the point, providing the reader with the necessary information without any unnecessary details or elaboration. It is also a factual statement because it is based on verifiable evidence and is not subjective or opinion-based. Overall, this statement is a good example of a concise factual statement about France’s capital city.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased Adoption of Edge AI: Edge AI refers to the processing of AI algorithms at the edge of the network, closer to the source of the data. This trend is likely to continue as the need for real-time processing and reduced latency increases.
2. Rise of Explainable AI: As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. Explainable AI (XAI) aims to provide insights into the decision-making process of AI models, which will



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Rowan and I'm a 20-year-old student at the local community college. I enjoy hiking and reading in my free time. I'm looking forward to getting to know my classmates and professors better this semester. What's up?
This text is a neutral self-introduction. It does not express the character's personality, emotions, or values, and it does not provide any information that would reveal their background or motivations. It simply states the character's name, age, and current circumstances, as well as a couple of their hobbies and interests. This kind of introduction is suitable for a character who is trying to present themselves in a friendly

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. There is a global consensus that the Eiffel Tower is the icon of Paris. The Eiffel Tower was built in 1889. The Eiffel Tow

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 E

ira

 and

 I

'm

 a

 quiet

,

 observ

ant

,

 and

 intros

pective

 person

 who

 apprec

iates

 simplicity

 and

 solitude

.

 I

 value

 honesty

 and

 authenticity

 and

 enjoy

 reading

,

 drawing

,

 and

 walking

 in

 nature

.

 I

'm

 currently

 living

 in

 a

 small

 town

 surrounded

 by

 woods

,

 where

 I can

 focus

 on

 my

 hobbies

 and

 personal

 growth

.


This

 self

-int

roduction

 is

 neutral

 because

 it

 doesn

't

 reveal

 too

 much

 about

 E

ira

's

 personality

,

 background

,

 or

 motivations

.

 It

 simply

 presents

 her

 as

 a

 person

 who

 values

 certain

 things

 and

 enjoys

 certain

 activities

,

 without

 making

 any

 judgments

 or

 assumptions

.

 It

's

 a

 good

 starting

 point

 for

 a

 character

,

 as

 it

 allows

 the

 writer

 to

 build

 upon

 this

 foundation



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 home

 to

 many

 cultural

 and

 historical

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 the

 country

’s

 largest

 city

 and

 serves

 as

 the

 center

 of

 politics

,

 culture

,

 and

 economy

 for

 the

 country

.


The

 E

iff

el

 Tower

,

 located

 in

 the

 heart

 of

 Paris

,

 is

 one

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

.

 It

 was

 originally

 built

 as

 the

 entrance

 arch

 for

 the

 

188

9

 World

’s

 Fair

 and

 has

 since

 become

 a

 symbol

 of

 French

 culture

 and

 engineering

.

 The

 tower

 is

 made

 of

 iron

 and

 stands

 

324

 meters

 tall

,

 making

 it

 the

 tallest

 structure

 in

 Paris

.


The

 Lou



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 becoming

 increasingly

 popular

 to

 explore

.

 Here

 are

 some

 possible

 trends

 that

 may

 shape

 the

 future

 of

 artificial

 intelligence

:


1

.

 Increased

 Use

 of

 AI

 in

 Everyday

 Life




Art

ificial

 intelligence

 (

AI

)

 is

 becoming

 increasingly

 ubiquitous

 and

 is

 being

 used

 in

 various

 aspects

 of

 our

 daily

 lives

.

 This

 trend

 is

 expected

 to

 continue

,

 with

 AI

 being

 used

 in

 everything

 from

 smart

 home

 devices

 to

 healthcare

.


2

.

 Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

)


Natural

 Language

 Processing

 (

N

LP

)

 is

 a

 sub

field

 of

 AI

 that

 deals

 with

 the

 interaction

 between

 computers

 and

 humans

 in

 natural

 language

.

 N

LP

 is

 expected

 to

 become

 more

 sophisticated

,

 enabling

 computers

 to

 understand

 and

 respond

 to

 human

 language




In [6]:
llm.shutdown()