# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.11it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.08it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.28it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Abby. I am a 6 year old girl who loves to learn and play. I live in a small town with my family. We have a big backyard with a swing set, a sandbox, and a big treehouse. I love to play on the swing set and climb up high in the treehouse. My favorite subject in school is art. I love to draw and paint pictures of my favorite things, like animals and flowers. I also love to read books about adventure and magic.
When I'm not in school, I like to play with my friends and ride my bike. We like to ride our bikes to the park and
Prompt: The president of the United States is
Generated text:  elected by the Electoral College, a system established by the Founding Fathers at the Constitutional Convention in 1787. The system was designed to balance the interests of large and small states, as well as urban and rural areas. Under the Electoral College system, each state is allocated a certain number of electoral votes based on its population, with a minimum 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. I'm a bit of a perfectionist, which can sometimes make me come across as stubborn or overly critical. I'm working on balancing my desire for perfection with being more open-minded and accepting of others' opinions. I'm looking forward to meeting new people and making new friends.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states facts about Kaida

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. The statement is also accurate, as Paris is widely recognized as the capital of France. 
Note: This response is a simple and direct answer to the question, as requested. If you would like me to provide more information or context about Paris or France, please let me know! 
Here is a more detailed response:
Paris is the capital and most populous city of France, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even larger role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely to become even more widespread, with the potential to automate many tasks and improve



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kai. I'm a 22-year-old law student who's trying to figure out what kind of lawyer I want to be. I'm a bit of a loner, but I enjoy hiking and reading. That's me.
This text has a clear, neutral tone and is short, as requested. It gives a brief introduction to the character, including their name, age, profession (or profession in training), and a few personal interests. It also includes a brief character trait, which can be useful for readers who are trying to get a sense of the character's personality. The sentence structure is simple and easy to follow, which is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This response provides a brief factual answer to the question about the capital of France. It does not include any additional information or context that might be considered necessary for a more co

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

ait

o

 Nak

amura

,

 and

 I

'm

 a

 

20

-year

-old

 college

 student

 studying

 computer

 science

 at

 the

 University

 of

 Tokyo

.

 I

'm

 a

 bit

 of

 a

 book

worm

 and

 enjoy

 playing

 strategy

 games

 in

 my

 free

 time

.


What

 makes

 this

 a

 good

 self

-int

roduction

?


It

's

 short

 and

 to

 the

 point

.


It

 provides

 essential

 information

 about

 the

 character

 (

name

,

 age

,

 college

 major

,

 interests

).


It

 gives

 a

 hint

 about

 the

 character

's

 personality

 (

book

worm

).


It

's

 neutral

,

 meaning

 it

 doesn

't

 reveal

 any

 dramatic

 secrets

 or

 emotions

.


How

 might

 you

 revise

 this

 self

-int

roduction

 to

 make

 it

 more

 engaging

?


You

 could

 add

 more

 details

 about

 the

 character

's



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 statement

 that

 uses

 a

 neutral

 tone

.

 Paris

 is

 located

 in

 the

 north

-central

 part

 of

 France

.


Provide

 a

 statement

 that

 includes

 a

 value

 judgment

.

 Paris

 is

 one

 of

 the

 most

 romantic

 and

 culturally

 rich

 cities

 in

 the

 world

.


The

 three

 statements

 differ

 in

 the

 way

 they

 present

 information

 about

 the

 capital

 of

 France

,

 which

 is

 Paris

.

 The

 first

 statement

 is

 a

 simple

 fact

.

 The

 second

 statement

 provides

 some

 additional

 context

.

 The

 third

 statement

 includes

 an

 opinion

.

 The

 first

 two

 statements

 are

 written

 in

 a

 neutral

 tone

.

 The

 third

 statement

 is

 written

 in

 a

 more

 subjective

 tone

,

 as

 it

 expresses

 a

 value

 judgment

.

 Writers

 may

 use

 statements

 that

 include

 value

 judgments

 to

 persuade



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 unpredictable

.

 Here

 are

 some

 potential

 trends

 and

 predictions

 that

 could

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:


1

.

 **

Increased

 Focus

 on

 Explain

ability

 and

 Transparency

**:

 As

 AI

 becomes

 more

 pervasive

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 decisions

 are

 made

 and

 why

 certain

 outcomes

 occur

.

 This

 trend

 will

 focus

 on

 developing

 AI

 systems

 that

 provide

 clear

 explanations

 for

 their

 actions

,

 promoting

 trust

 and

 accountability

.


2

.

 **

Adv

ancements

 in

 Edge

 AI

**:

 With

 the

 proliferation

 of

 IoT

 devices

,

 edge

 AI

 will

 become

 increasingly

 important

 for

 processing

 data

 in

 real

-time

,

 reducing

 latency

,

 and

 improving

 efficiency

.

 This

 trend

 will

 enable

 more

 widespread

 adoption

 of

 AI

 in

 various

 industries

,

 such

 as




In [6]:
llm.shutdown()