# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.38it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.36it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.65it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tammy and I am a non-traditional student. I am going back to college at the age of 40 to get a degree in Human Services. I have 2 beautiful children, a husband, and 2 fur babies. I am excited to share my experiences as a non-traditional student and the challenges I face in my journey to becoming a college student again. I will be sharing my thoughts, feelings, and experiences with you, and I hope that you will follow along and share your thoughts with me as well.
Welcome to my blog, where I will be sharing my journey as a non-traditional college student
Prompt: The president of the United States is
Generated text:  in Hawaii for the Asia-Pacific Economic Cooperation (APEC) summit, a two-day gathering of leaders from 21 countries that account for 60% of the world's economy.
Trump arrived at Joint Base Pearl Harbor-Hickam on Saturday, bringing his daughter Ivanka and son-in-law Jared Kushner, both of whom are senior White House advisors.
Trump's

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new coffee shops. I'm a bit of a introvert, but I'm working on being more outgoing. I'm interested in learning more about the world and meeting new people. That's me in a nutshell.
This is a good example of a neutral self-introduction because it doesn't reveal too much about Kaida's personality, background, or motivations. It simply provides a brief overview of who she is and what she's interested in. This type of introduction is useful for a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city. The capital of France is Paris.
The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. The statement is also accurate, as Paris has been the capital of France since 987. This statement is a good example of a concise factual statement, as it is brief, clear, and accurate. It provides a quick and easy way to answer the question about France’s capital city. The statement is also neutral and objective, without any emotional or biased

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future holds, here are some possible trends that could shape the development and impact of artificial intelligence:
1. Increased Adoption in Industries: AI is already being used in various industries, such as healthcare, finance, and transportation. In the future, we can expect to see increased adoption of AI in more industries, such as education, customer service, and manufacturing.
2. Advancements in Natural Language Processing: Natural language processing (NLP) is a key area of AI research, and we can expect to see significant advancements in this area in the future. This



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Mira. I'm a 25-year-old graphic designer who lives in a small apartment in the city. I spend most of my free time reading and sketching, and I love trying out new foods at local restaurants. I'm a bit of a homebody, but I enjoy meeting new people and exploring new places. I'm still figuring out my place in the world, but I'm excited to see what the future holds. I'm a bit of a introvert, but I'm always up for a good conversation.
Use commas to separate clauses or phrases within the sentence. Use periods to end the sentence.
Use proper grammar and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This statement is a concise and factual answer to the question about the capital of France, which is one of the basic facts about the country. It does not include any subjective opinions, emotions, or unnecessary

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Cele

ste

.

 I

'm

 a

 

19

-year

-old

 student

.

 I

'm

 studying

 psychology

 at

 a

 university

 in

 a

 big

 city

.

 I

'm

 interested

 in

 human

 behavior

 and

 how

 people

 interact

 with

 each

 other

.


The

 introduction

 is

 straightforward

 and

 factual

.

 It

 doesn

't

 reveal

 any

 personal

 opinions

,

 emotions

,

 or

 biases

.

 It

 simply

 provides

 basic

 information

 about

 the

 character

.


Cele

ste

 is

 a

 

19

-year

-old

 student

 who

 is

 studying

 psychology

 at

 a

 university

 in

 a

 big

 city

.

 She

 is

 interested

 in

 human

 behavior

 and

 how

 people

 interact

 with

 each

 other

.

 She

 is

 likely

 to

 be

 a

 curious

 and

 analytical

 person

 who

 is

 interested

 in

 understanding

 the

 complexities

 of

 human

 relationships

 and

 behavior

.


The

 introduction

 doesn



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Next

 Post

:

 What

 is

 the

 population

 of

 France

?

 The

 population

 of

 France

 is

 approximately

 

67

.

2

 million

 people

.

 Next

 Post

:

 What

 is

 the

 largest

 city

 in

 France

?

 The

 largest

 city

 in

 France

 is

 Paris

.

 Next

 Post

:

 What

 is

 the

 official

 language

 of

 France

?

 The

 official

 language

 of

 France

 is

 French

.

 Next

 Post

:

 What

 is

 the

 currency

 of

 France

?

 The

 currency

 of

 France

 is

 the

 Euro

.

 Next

 Post

:

 What

 are

 the

 borders

 of

 France

?

 The

 borders

 of

 France

 are

 the

 countries

 of

 Belgium

,

 Luxembourg

,

 Germany

,

 Switzerland

,

 Italy

,

 Spain

,

 And

orra

,

 and

 Monaco

.

 Next

 Post

:

 What

 is

 the

 largest

 river

 in

 France

?



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 being

 shaped

 by

 various

 factors

 including

 technological

 advancements

,

 societal

 needs

,

 and

 economic

 drivers

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


**

Increased

 Adoption

 and

 Integration

**:

 AI

 will

 become

 increasingly

 integrated

 into

 various

 aspects

 of

 our

 lives

,

 including

 healthcare

,

 transportation

,

 education

,

 and

 finance

.

 AI

-powered

 virtual

 assistants

,

 chat

bots

,

 and

 smart

 home

 devices

 will

 become

 more

 common

,

 making

 our

 lives

 more

 convenient

 and

 efficient

.


**

Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

)**

:

 N

LP

 will

 continue

 to

 improve

,

 enabling

 AI

 systems

 to

 understand

 and

 generate

 human

-like

 language

.

 This

 will

 lead

 to

 more

 sophisticated

 chat

bots

,

 voice

 assistants

,

 and

 language

 translation

 systems

.


**

Growing

 Importance




In [6]:
llm.shutdown()