# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.17it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.80it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.48it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Artie. I'm a 6 year old yellow Labrador Retriever. I'm very excited to be starting my training at Hounds Hounds Hounds! My hooman (mom) says I'm a smart boy and I'll do great. I promise to listen and learn. I love treats and belly rubs, so I hope there are plenty of those coming my way. My favorite thing to do is play fetch, so I'm hoping we can work on that some during my training. I'm a little nervous, but I know I'll make lots of new friends and have a blast. See you soon!
This
Prompt: The president of the United States is
Generated text: , arguably, the most powerful person in the world. But how did we get to this point?
The framers of the US Constitution, who wrote the document in 1787, were a group of wise and forward-thinking individuals. They included George Washington, James Madison, and Benjamin Franklin, among others. These men were guided by a vision of a new kind of government that would balance power, protect individual rights, a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and have a passion for public speaking. I'm a bit shy, but I'm working on being more outgoing and confident. I'm a junior, so I'm looking forward to the challenges and opportunities that come with this year. That's me in a nutshell! What do you think? Is there anything you'd like to add or change?
Here are a few suggestions to make your self-introduction more engaging and effective:
1.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city is also known for its romantic atmosphere and is a popular tourist destination. Paris is the center of France’s government, economy, and culture, and is considered one of the most beautiful and iconic cities in the world. The city has a population of over 2.1 million people and is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to improve patient outcomes and reduce healthcare costs.
2. Rise of autonomous vehicles: Autonomous vehicles are already being tested on public roads, and it's likely that they



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida, and I'm a 17-year-old high school student. I like to draw and listen to music, and I'm trying to figure out my place in the world. I'm not really sure what I want to do with my life yet, but I'm taking classes in art, music, and English in hopes of finding something that sparks my passion.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions, and it focuses on Kaida's basic interests and goals. It doesn't try to impress or manipulate the reader, but rather provides a straightforward introduction to Kaida as a person.
Start

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The Eiffel Tower is a famous landmark located in which city? The Eiffel Tower is a famous landmark located in Paris.
Paris is known for its fashion. What other industry is Paris known for? P

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

 Yam

ato

.

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 living

 in

 Tokyo

.

 I

 have

 short

,

 sp

iky

 black

 hair

 and

 brown

 eyes

.

 I

'm

 a

 bit

 of

 an

 intro

vert

,

 but

 I

'm

 passionate

 about

 playing

 guitar

 and

 listening

 to

 music

.

 That

's

 a

 little

 about

 me

.


Next

,

 I

 will

 write

 a

 short

,

 descriptive

 self

-int

roduction

 for

 the

 same

 character

,

 with

 a

 focus

 on

 their

 personality

 and

 interests

.

 Hi

,

 I

'm

 K

aida

 Yam

ato

,

 a

 Tokyo

 teenager

 with

 a

 passion

 for

 music

 and

 a

 quiet

,

 intros

pective

 nature

.

 My

 love

 for

 playing

 the

 guitar

 is

 a

 source

 of

 comfort

 and

 creativity

 for

 me

,

 and

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 located

 on

 the

 Se

ine

 River

 in

 northern

 central

 France

.

 It

 is

 the

 largest

 city

 in

 France

 and

 is

 known

 for

 its

 historical

 landmarks

,

 museums

,

 and

 cultural

 attractions

.


Develop

 a

 thesis

 statement

 that

 can

 be

 supported

 with

 evidence

 from

 the

 city

.

 Paris

,

 the

 capital

 of

 France

,

 is

 a

 cultural

 and

 historical

 hub

 that

 offers

 a

 unique

 blend

 of

 art

,

 architecture

,

 and

 cuisine

,

 making

 it

 an

 attractive

 destination

 for

 tourists

 and

 a

 source

 of

 inspiration

 for

 artists

 and

 intellectuals

.


Discuss

 three

 historical

 landmarks

 that

 contribute

 to

 Paris

’s

 cultural

 and

 historical

 significance

.

 The

 E

iff

el

 Tower

,

 constructed

 for

 the

 

188

9

 World

’s

 Fair

,

 is

 an

 iconic

 symbol



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 rapid

 advancements

 in

 areas

 such

 as

:


*

  

 **

Edge

 AI

**:

 Edge

 AI

 involves

 processing

 and

 analyzing

 data

 at

 the

 edge

 of

 the

 network

,

 where

 it

 is

 generated

,

 rather

 than

 in

 the

 cloud

 or

 a

 central

 data

 center

.

 This

 approach

 reduces

 latency

 and

 improves

 real

-time

 decision

-making

.

 Future

 AI

 systems

 may

 rely

 more

 heavily

 on

 edge

 AI

 to

 enable

 faster

 and

 more

 efficient

 processing

 of

 data

.


*

  

 **

Ex

plain

ability

 and

 Transparency

**:

 As

 AI

 becomes

 increasingly

 prevalent

 in

 critical

 applications

,

 there

 is

 a

 growing

 need

 for

 explain

ability

 and

 transparency

 in

 AI

 decision

-making

 processes

.

 Future

 AI

 systems

 may

 incorporate

 advanced

 techniques

 to

 provide

 clear

 and

 interpre

table

 explanations

 for




In [6]:
llm.shutdown()