# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.05s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.49it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.38it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Oded and I am a 3D Artist from Israel. I specialize in creating photorealistic 3D visualizations and animations for architectural and interior design projects. My goal is to provide high-quality visualizations that help architects, designers, and builders to communicate their ideas effectively and bring their projects to life.
I have a strong background in 3D modeling, texturing, and lighting, which enables me to create realistic and detailed visualizations that meet the needs of my clients. I am also familiar with various software such as 3ds Max, V-Ray, and Adobe Creative Suite, which allows me to work efficiently and deliver
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, and is the ceremonial and political leader of the country. The president is directly elected by the people through the Electoral College, and serves a four-year term. The president has a wide ran

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new coffee shops. I'm currently working on a novel and trying to learn more about the world of publishing. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
I think your self-introduction is clear and concise. It gives a good sense of who Kaida is and what she's about. However, it's a bit on the bland side. To make it more interesting,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also a major hub for business, education, and tourism. Paris is a popular destination for visitors from around the world, attracting over 23 million tourists each year. The city has a population of over 2.1 million people and is a major economic and cultural center in Europe

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the way we learn, with the ability to personalize education and make it more accessible to people all over the world. AI-powered adaptive learning systems can adjust to the



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jaxon Wilder, but my friends call me Wilder. I'm a 22-year-old.
This response has a neutral tone and provides a brief introduction to the character.
This response is missing the requested detail about hobbies or interests. Please provide more information about Wilder's personality, background, or interests to create a more well-rounded self-introduction.
The introduction does not have a neutral tone. It comes across as more of an introduction than an actual self-introduction, and it seems to be written from the perspective of someone else.
This response is still missing some details, such as hobbies or interests. The description is also quite

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This statement is concise and factual as it only provides the name of the capital city of France, which is widely 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

.

 I

'm

 a

 skilled

 programmer

 and

 software

 developer

 with

 a

 passion

 for

 artificial

 intelligence

 and

 machine

 learning

.

 I

 work

 in

 a

 cutting

-edge

 tech

 firm

,

 where

 I

 lead

 a

 team

 of

 developers

 to

 design

 and

 implement

 innovative

 solutions

 for

 various

 industries

.


This

 self

-int

roduction

 seems

 to

 be

 written

 in

 a

 fairly

 formal

 and

 professional

 tone

,

 which

 is

 common

 in

 business

 and

 professional

 settings

.

 The

 language

 is

 straightforward

,

 and

 the

 structure

 is

 simple

 and

 easy

 to

 follow

.

 However

,

 to

 make

 this

 introduction

 more

 engaging

,

 consider

 adding

 a

 personal

 touch

 or

 a

 unique

 aspect

 that

 sets

 Eli

an

ore

 apart

 from

 others

.

 For

 example

:


"I

'm

 Eli

an

ore

 Qu

asar

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 capital

 of

 France

 is

 Paris

.


Note

:

 The

 statement

 is

 concise

 and

 factual

,

 providing

 the

 essential

 information

 about

 the

 capital

 of

 France

.

 There

 is

 no

 need

 for

 additional

 details

 or

 explanations

 as

 it

 is

 a

 straightforward

 statement

.

 However

,

 the

 following

 section

 can

 provide

 more

 information

 about

 Paris

 if

 required

.

 



##

 Step

 

1

:

 Identify

 the

 topic




The

 topic

 is

 about

 the

 capital

 of

 France

.



##

 Step

 

2

:

 Provide

 a

 concise

 factual

 statement




The

 capital

 of

 France

 is

 Paris

.



##

 Step

 

3

:

 Review

 the

 statement

 for

 accuracy




The

 statement

 is

 accurate

 and

 concise

,

 providing

 the

 essential

 information

 about

 the

 capital

 of

 France

.



##

 Step

 

4

:



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 predicted

 to

 be

 increasingly

 diverse

 and

 sophisticated

,

 with

 significant

 impacts

 on

 various

 aspects

 of

 society

.

 Possible

 future

 trends

 in

 AI

 include

:


Adv

ancements

 in

 AI

 research

 and

 development

:

 Continued

 breakthrough

s

 in

 areas

 like

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

 will

 lead

 to

 more

 sophisticated

 AI

 systems

 that

 can

 learn

,

 reason

,

 and

 interact

 with

 humans

 in

 more

 natural

 ways

.


Increased

 deployment

 of

 AI

 in

 various

 industries

:

 AI

 will

 be

 increasingly

 integrated

 into

 various

 sectors

,

 including

 healthcare

,

 finance

,

 transportation

,

 education

,

 and

 manufacturing

,

 transforming

 the

 way

 these

 industries

 operate

 and

 improving

 efficiency

,

 productivity

,

 and

 decision

-making

.


Growing

 use

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 systems




In [6]:
llm.shutdown()