# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.24it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.17it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.20it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.65it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.45it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ixchel, and I will be your guide to the mystical world of Mayan astrology. In this article, we will explore the different types of Mayan astrology readings, and how they can help you navigate your life’s journey.
Mayan astrology, also known as the Tzolkin, is an ancient system of divination that originated in the Yucatan Peninsula in Mexico. It is a complex system that uses a 260-day cycle to predict future events and offer guidance on personal growth and spiritual evolution.
There are several types of Mayan astrology readings that you can choose from, depending on your interests and needs. Here
Prompt: The president of the United States is
Generated text:  the head of the executive branch and the commander-in-chief of the armed forces. He is also the leader of the federal government and the chief representative of the nation.
The president is elected by the Electoral College, which consists of 538 electors who are chosen by each state to cast

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team. I like to think I'm pretty laid-back and easy-going, but I can get pretty passionate about the topics I care about. I'm not really sure what I want to do with my life yet, but I'm open to exploring different options. I'm a bit of a introvert, but I'm working on being more outgoing and confident. I'm excited to see what the future holds.
This self-introduction is neutral

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
The capital of France is Paris. This is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a starting point for further discussion or research. The statement is also accurate and reliable, as it is a widely accepted fact about France’s capital city. Overall, this statement is a good example of a concise factual statement. The statement is also in the present tense, which is suitable for a factual statement. The use of the present tense implies that the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased Adoption of AI in Various Industries: AI is expected to become more pervasive across various industries, including healthcare, finance, education, and transportation. This will lead to increased efficiency, productivity, and innovation.
2. Advancements in Natural Language Processing (NLP): NLP is a key area of AI research, and future advancements in this field will enable more sophisticated human-computer interactions, such as conversational interfaces and language translation.
3. Rise of Explainable AI (XAI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jasmine Everett, and I am a 25-year-old administrative assistant. I have been working at the Smith Law Firm for five years and have gained a wealth of knowledge in office management and customer service. In my free time, I enjoy reading, hiking, and practicing yoga. I am a hard worker and am committed to delivering high-quality results in all that I do. I am excited to meet new people and build relationships. 
Here are a few things to note about Jasmine's self-introduction:
It's brief and to the point, making it easy to read and understand.
It provides relevant details about her work experience and skills.
It

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  located in the northeastern part of the country. The city is built on two islands in the Seine River and its banks. The city’s population is over 2.1 milli

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

.

 I

’m

 a

 

23

-year

-old

 anthropology

 student

 at

 a

 university

 in

 the

 Pacific

 Northwest

.

 I

 have

 short

,

 dark

 hair

 and

 haz

el

 eyes

.

 My

 interests

 include

 hiking

,

 music

,

 and

 exploring

 different

 cultures

.

 That

’s

 about

 it

 for

 now

.


This

 introduction

 is

 neutral

 because

 it

 doesn

’t

 reveal

 much

 about

 K

aida

’s

 personality

,

 background

,

 or

 motivations

.

 It

 simply

 presents

 her

 basic

 facts

 and

 interests

.

 A

 neutral

 introduction

 can

 be

 helpful

 if

 you

’re

 trying

 to

 create

 a

 character

 without

 any

 pre

con

ceptions

 or

 biases

.

 It

 also

 leaves

 room

 for

 development

 and

 growth

,

 as

 K

aida

 can

 reveal

 more

 about

 herself

 over

 time

.

 What

 do

 you

 think

?

 Would



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 the

 largest

 city

 in

 France

 and

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

.


Paris

 is

 a

 global

 hub

 for

 fashion

,

 art

,

 and

 culture

,

 attracting

 millions

 of

 tourists

 each

 year

.


Paris

 is

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.


The

 city

 is

 known

 for

 its

 picturesque

 streets

,

 charming

 neighborhoods

,

 and

 historic

 architecture

,

 including

 the

 Mont

mart

re

 and

 Le

 Mar

ais

 districts

.


Paris

 is

 a

 center

 for

 business

 and

 finance

,

 with

 the

 E

uron

ext

 stock

 exchange

 and

 the

 headquarters

 of

 many

 major

 corporations

 located

 in

 the

 city

.


Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 and

 some

 trends

 are

 likely

 to

 shape

 the

 technology

 in

 the

 coming

 years

.


Art

ificial

 intelligence

 is

 changing

 the

 world

 with

 each

 passing

 day

.

 With

 its

 rapidly

 evolving

 nature

,

 it

 is

 becoming

 increasingly

 difficult

 to

 predict

 what

 the

 future

 of

 AI

 holds

.

 However

,

 here

 are

 some

 possible

 trends

 that

 are

 likely

 to

 shape

 the

 technology

 in

 the

 coming

 years

:


1

.

 Increased

 adoption

 in

 various

 industries

:

 AI

 is

 already

 being

 used

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 and

 transportation

.

 In

 the

 future

,

 we

 can

 expect

 to

 see

 increased

 adoption

 of

 AI

 in

 other

 industries

 such

 as

 education

,

 marketing

,

 and

 manufacturing

.


2

.

 Greater

 emphasis

 on

 Explain

ability

:

 As

 AI

 becomes




In [6]:
llm.shutdown()