# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.08it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.02s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.05s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Donnie, and I am a husband, father, and semi-retired Technology Strategist. I am passionate about helping people, especially those who may not know how to use technology to their advantage.
I was born in 1963, grew up in a small town in Pennsylvania, and went on to study computer science at the University of Pittsburgh. After college, I worked for several major corporations, including IBM and AT&T, where I helped design and implement various technology solutions. In my free time, I enjoy reading, hiking, and spending time with my family.
As a technology strategist, I have worked with numerous individuals and organizations to
Prompt: The president of the United States is
Generated text:  taking a bold step in the country’s fight against opioid addiction.
The Washington Post reports that President Trump has ordered the Department of Health and Human Services (HHS) to declare the opioid epidemic a national emergency.
The emergency declaration wou

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city, where I spend most of my time working on various projects and trying to stay organized. I enjoy reading, hiking, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation when I'm feeling energized. I'm currently working on a novel, and I'm excited to see where it takes me. That's me in a nutshell.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also known for its romantic atmosphere and is a popular tourist destination. Paris is a global center for business, finance, and culture, and is considered one of the most beautiful and iconic cities in the world. The city has a population of over 2.1 million people and is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even larger role in healthcare, with AI-powered robots and virtual assistants helping to care for patients and improve health outcomes.
2. Widespread adoption of AI in education: AI is already being used in education to personalize learning, grade assignments, and provide feedback to students. In the future, AI is likely to become even more prevalent in education, with AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elena Thompson. I'm a 22-year-old student at the University of California, Berkeley. I'm a junior majoring in environmental science with a minor in sustainable agriculture. I like hiking and trying new foods. What do you think? Is there anything you would add or change?
I think your introduction is clear and concise. It gives a sense of who Elena is and what she does. However, I would suggest adding a bit more personality to make the introduction more engaging. Here's a revised version:
"Hi, I'm Elena Thompson. I'm a 22-year-old junior at the University of California, Berkeley, where I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This factual statement is just 2 words long, which makes it concise. The statement is also factual because it accurately states the capital of France, which is Paris.
Here’

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

.

 I

'm

 a

 

25

-year

-old

 astro

bi

ologist

 living

 in

 a

 small

 research

 station

 on

 the

 outskirts

 of

 a

 distant

 galaxy

.

 My

 work

 focuses

 on

 the

 discovery

 and

 analysis

 of

 extr

ater

restrial

 life

 forms

,

 particularly

 those

 that

 exist in

 extreme

 environments

.

 When

 I

'm

 not

 conducting

 experiments

 or

 analyzing

 data

,

 I

 enjoy

 reading

 about

 the

 history

 of

 space

 exploration

 and

 listening

 to

 ambient

 electronic

 music

.

 Outside

 of

 work

,

 I

'm

 a

 bit

 of

 a

 solitary

 person

,

 preferring

 the

 company

 of

 machines

 and

 the

 vast

ness

 of

 space

 to

 that

 of

 crowds

 and

 social

 events

.

 I

'm

 currently

 working

 on

 a

 research

 project

 that

 aims

 to

 identify

 patterns

 in

 the

 metabolic

 processes



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Next

 Next

 post

:

 What

 does

 the

 phrase

 “

write

 to

 the

 world

”

 mean

?

 Provide

 a

 definition

.

 “

Write

 to

 the

 world

”

 means

 to

 share

 your

 thoughts

,

 ideas

,

 and

 experiences

 with

 a

 large

 audience

,

 either

 through

 writing

,

 speaking

,

 or

 other

 forms

 of

 communication

,

 with

 the

 intention

 of

 inspiring

,

 educating

,

 or

 influencing

 others

.

 It

 can

 involve

 writing

 articles

,

 blog

 posts

,

 social

 media

 content

,

 or

 even

 creating

 art

 or

 music

 that

 con

veys

 a

 message

 or

 perspective

 that

 reson

ates

 with

 a

 wide

 range

 of

 people

.

 The

 phrase

 suggests

 that

 the

 writer

 or

 creator

 is

 aiming

 to

 connect

 with

 and

 impact

 a

 global

 audience

,

 rather

 than

 a

 small

,

 local



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 increasingly

 inevitable

,

 and

 it

 has

 the

 potential

 to

 impact

 various

 aspects

 of

 our

 lives

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 automation

:

 AI

 will

 continue

 to

 automate

 routine

 tasks

,

 freeing

 up

 human

 workers

 to

 focus

 on

 more

 complex

 and

 creative

 tasks

.

 This

 could

 lead

 to

 increased

 productivity

 and

 efficiency

,

 but

 also

 potentially

 dis

place

 jobs

 in

 sectors

 such

 as

 manufacturing

 and

 customer

 service

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

:

 AI

 will

 continue

 to

 improve

 its

 ability

 to

 understand

 and

 generate

 human

 language

,

 enabling

 more

 sophisticated

 chat

bots

,

 virtual

 assistants

,

 and

 language

 translation

 systems

.


3

.

 Rise

 of

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 prevalent

,

 there

 will




In [6]:
llm.shutdown()