# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.02it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.53it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.39it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.28it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Heather Garcia and I am the Director of the United Way of San Antonio and Bexar County. I am thrilled to be a part of this community and to have the opportunity to serve alongside such a dedicated team of staff, volunteers, and community partners.
As Director, I am committed to advancing our mission of improving lives and creating lasting change in Bexar County by focusing on education, income, health, and basic needs. I believe that by working together, we can create a more equitable and just community where everyone has the opportunity to thrive.
Before joining United Way, I spent over 20 years working in various roles in the non
Prompt: The president of the United States is
Generated text:  not required to have a college degree. While most presidents have had college educations, some, such as George Washington, Abraham Lincoln, and Harry Truman, did not complete college. In fact, Harry Truman was the first president to have a high school ed

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my tiny kitchen. I'm a bit of a introvert and prefer to spend my free time alone, but I'm always up for a good conversation or a spontaneous adventure. I'm currently working on a novel and a few art projects, and I'm excited to see where my creative endeavors take me. What do you think? Is there anything you'd like to add or change?
Here are a few suggestions to consider:
* You might want to add a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major center for business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a long history dating back to the 3rd century

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with applications such as:
a. Predictive analytics: AI will be used to predict patient outcomes, identify high-risk patients, and prevent hospital readmissions.
b. Personalized medicine: AI will help tailor treatment plans to individual patients based on their genetic profiles, medical histories, and lifestyle factors.
c. Virtual nursing assistants



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Celeste Thompson. I'm a 25-year-old freelance writer living in Portland, Oregon. I'm a fan of old movies, good coffee, and hiking in the Columbia River Gorge. I'm working on a novel and trying to keep up with my blog, where I write about books and life. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a new project and I'm excited to see where it takes me. That's me in a nutshell. Feel free to ask me anything.
This self-introduction is neutral, which means it doesn’t reveal anything

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The French name for the city is Paris. Paris is located in the northern part of France. The city is situated on the Seine River. The river flows through the center of the city. Paris is a popular tourist destination due to its histo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Elian

ore

 Qu

asar

 and

 I

'm

 a

 freelance

 astro

-cart

ographer

.

 I

 specialize

 in

 chart

ing

 celestial

 bodies

 and

 their

 movements

 in

 relation

 to

 terrestrial

 locations

.

 I

've

 worked

 with

 various

 clients

 in

 the

 field

,

 including

 planet

arium

s

,

 space

 agencies

,

 and

 private

 astronomical

 research

 institutions

.

 I

'm

 currently

 based

 in

 a

 small

 town

 in

 the

 southeastern

 United

 States

.

 In

 my

 free

 time

,

 I

 enjoy

 studying

 the

 works

 of

 

19

th

-century

 astronomers

 and

 practicing

 the

 art

 of

 cart

ography.


Now

,

 write

 a

 short

,

 personal

 introduction

 for

 the

 same

 character

.

 When

 I

'm

 not

 chart

ing

 stars

,

 you

 can

 find

 me

 st

arg

azing

 in

 the

 quiet

 hours

 of

 the

 morning

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris.


Here

 are

 a

 few

 examples

 of

 how

 you

 might

 expand

 on

 this

 statement

:


Paris

 is

 often

 called

 the

 City

 of

 Light

.


The

 city

 is

 a

 major

 center

 of

 fashion

 and

 art

.


Paris

 is

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

.


The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 Middle

 Ages

.


Paris

 is

 a

 popular

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.


Here

 is

 the

 statement

 with

 a

 few

 supporting

 details

:


The

 capital

 of

 France

 is

 Paris

,

 a

 city

 known

 as

 the

 City

 of

 Light

 due

 to

 its

 rich

 cultural

 and

 artistic

 heritage

.

 It

 is

 a

 major

 center

 of

 fashion



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 going

 to

 be

 shaped

 by

 several

 factors

,

 including

 technological

 advancements

,

 social

,

 political

,

 and

 economic

 developments

,

 as

 well

 as

 the

 ways

 in

 which

 humans

 interact

 with

 AI

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:


Increased

 use

 of

 Explain

able

 AI

 (

X

AI

)

 and

 Transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 to

 explain

 AI

 decisions

 and

 outcomes

,

 particularly

 in

 high

-st

akes

 applications

 such

 as

 healthcare

,

 finance

,

 and

 law

 enforcement

.

 X

AI

 will

 help

 build

 trust

 in

 AI

 systems

 and

 enable

 humans

 to

 understand

 the

 reasoning

 behind

 AI

-driven

 decisions

.


More

 widespread

 adoption

 of

 Edge

 AI

:

 As

 the

 Internet

 of

 Things

 (

Io

T

)

 continues

 to




In [6]:
llm.shutdown()