# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.20it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.46it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.37it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kevin and I am a software engineer at a small startup in California. I have been working on a project that requires me to implement a message queue using a combination of RabbitMQ, Celery, and Redis. While this combination is quite powerful, I have encountered an issue with the Celery task queue. Specifically, I am experiencing high memory usage and slow performance when handling large volumes of tasks.
To better understand the issue and provide a solution, let's break down the key components involved:

1.  **RabbitMQ**: It is a message broker that will handle the message queue. It allows producers to send messages and consumers to consume
Prompt: The president of the United States is
Generated text:  often considered the most powerful person in the world. As the head of state and head of government, the president has a wide range of powers and responsibilities. Some of the key powers and responsibilities of the president include:
Executive Po

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new coffee shops. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell. What do you think? Is it too short? Too long? Too boring?
This is a good start. It's concise and gives a sense of who Kaida is and what she does. However, it's a bit too straightforward and lacks a bit of personality. Consider adding a few more details that reveal Kaida's interests, values

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  Paris is the capital and most populous city of France, with an area of 1,080 square kilometers (415 square miles). It is situated in the northern part of the country, along the Seine River. Paris is known for its iconic landmarks, such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which houses the Mona Lisa. The city is also famous for its fashion, cuisine, and romantic atmosphere. Paris is a major cultural, economic, and intellectual center, and it attracts millions

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased Adoption of Edge AI: Edge AI refers to the processing of AI algorithms at the edge of the network, closer to the source of the data. This trend is expected to continue as more devices become connected to the internet and the need for real-time processing increases.
2. Rise of Explainable AI: As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. Explainable AI (XAI) is a subfield of AI that focuses on developing AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Faye Everwood. I'm a 22-year-old writer and artist, currently residing in the small town of Willow Creek. I spend most of my free time honing my craft, working on various creative projects, and exploring the beautiful countryside around me. I'm interested in learning more about the world and the people in it, and I'm always looking for new sources of inspiration.

## Step 1: Identify the key elements to include in a self-introduction.
The key elements to include in a self-introduction are the person's name, age, occupation or interests, and any relevant details about their current situation.

## Step

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Provide a concise factual statement about France’s second-largest city. The second-largest city in France is Marseille. Provide a concise factual statement a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ava

 Wong

.

 I

'm

 a

 

20

-year

-old

 anthropology

 major

 at

 the

 University

 of

 Colorado

 Boulder

.

 I

'm

 originally

 from

 a

 small

 town

 in

 California

,

 but

 I

've lived

 in

 several

 different

 places

,

 including

 Hawaii

 and

 Texas

.

 I

'm

 interested

 in

 learning

 about

 different

 cultures

 and

 how

 they

 impact

 the

 way

 people

 live

 their

 lives

.

 I

'm

 also

 really

 passionate

 about

 music

 and

 enjoy

 playing

 the

 guitar

 and

 singing

 in

 my

 free

 time

.

 I

'm

 looking

 forward

 to

 meeting

 new

 people

 and

 learning

 from

 them

.

 How

 would

 you

 describe

 Ava

 Wong

?

 What

 do

 you

 think

 her

 personality

 is

 like

?

 What

 are

 her

 values

 and

 interests

?


A

va

 Wong

 appears

 to

 be

 a

 young

 adult

 in

 her



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 E

iff

el

 Tower

 was

 the

 tallest

 building

 in

 the

 world

 when

 it

 was

 constructed

 in

 the

 late

 

19

th

 century

.

 What

 were

 some

 of

 the

 key

 features

 of

 this

 iconic

 structure

?

 The

 E

iff

el

 Tower

 features

 a

 lattice

-like

 iron

 framework

,

 four

 main

 pillars

,

 and

 an

 observation

 deck

 at

 the

 top

,

 which

 was

 

300

 meters

 above

 ground

 level

 at

 the

 time

 of

 its

 completion

 in

 

188

9

.


What

 is

 the

 primary

 source

 of

 income

 for

 the

 famous

 artist

 Claude

 Mon

et

?

 Claude

 Mon

et

 was

 a

 painter

,

 and

 his

 primary

 source

 of

 income

 was

 from

 selling

 his

 paintings

.

 He

 gained

 fame

 and

 success

 during

 his

 lifetime

,

 which

 allowed

 him

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 the

 convergence

 of

 multiple

 technological

 advancements

,

 societal

 needs

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Aut

onomy

:

 AI

 systems

 are

 expected

 to

 become

 more

 autonomous

,

 making

 decisions

 without

 human

 intervention

.

 This

 could

 lead

 to

 increased

 efficiency

 and

 productivity

,

 but

 also

 raises

 concerns

 about

 accountability

 and

 transparency

.


2

.

 Explain

ability

 and

 Transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 explain

ability

 and

 transparency

 in

 AI

 decision

-making

 processes

.

 This

 could

 involve

 developing

 techniques

 to

 interpret

 and

 understand

 AI

-driven

 decisions

.


3

.

 Human

-A

I

 Collaboration

:

 Future

 AI

 systems

 will

 likely

 be

 designed

 to

 collaborate

 with




In [6]:
llm.shutdown()