# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.09it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.68it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mia and I am a young and ambitious artist from the United States. I specialize in creating beautiful and realistic acrylic paintings. I have a passion for capturing the beauty of the natural world and I am always looking for new ways to express my creativity.
My artistic journey began when I was a child, drawing and painting with my mother. As I grew older, my interest in art only deepened and I began to explore different mediums and techniques. I have been inspired by the works of the Old Masters, as well as contemporary artists like Andrew Wyeth and Mark Rothko.
In addition to my love of art, I am also a passionate advocate
Prompt: The president of the United States is
Generated text:  a key position, but it is also a unique and fleeting experience. No matter how long the president serves, it is a temporary position and one that has a profound impact on the lives of the people of the United States and the rest of the world. As a result, it i

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys reading and playing the guitar. I'm a bit of a bookworm and like to spend my free time curled up with a good novel. When I'm not studying or reading, you can find me strumming my guitar or trying out new chord progressions. I'm a bit of a quiet and introspective person, but I'm always up for a good conversation. I'm a junior in high school, and I'm looking forward to the rest of the year. That's me in a nutshell. What do you think? Is there anything you'd like

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country. It is situated on the Seine River. The city is known for its beautiful architecture, art museums, and fashion industry. Paris is a popular tourist destination. It is home to many famous landmarks, including the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a rich history and culture, and it is considered one of the most romantic cities in the world. Paris is also a major economic and cultural center, and it is home to many international organizations and institutions. The city has a population of over 2.1 million people,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need for transparency and explainability in AI decision-making. XAI will enable developers to understand how AI models make decisions



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Alice Winston, and I'm a 24-year-old software engineer who recently moved to Seattle. I've been working for a local tech firm for about two years, and I'm still getting used to the Pacific Northwest's rainy climate. Outside of work, I enjoy hiking and trying out new coffee shops.
The tone should be friendly, but not overly enthusiastic. Here's an example of how you might expand on this introduction:
Hello, my name is Alice Winston, and I'm a 24-year-old software engineer who recently moved to Seattle. I've been working for a local tech firm for about two years, and I'm still getting

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.  Paris is the largest city in France, home to more than 2.1 million people within its administrative limits. The city is situated in the north-central part of the country and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lena

 Grant

.

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 living

 in

 New

 York

 City

.

 I

 enjoy

 reading

,

 hiking

,

 and

 trying

 out

 new

 restaurants

.

 That

's

 me

 in

 a

 nutshell

.

 How

 can

 I

 help

 you

 today

?


This

 self

-int

roduction

 is

 short

,

 neutral

,

 and

 gets

 straight

 to

 the

 point

.

 It

 provides

 some

 basic

 information

 about

 the

 character

,

 Lena

 Grant

,

 and

 her

 interests

,

 without

 revealing

 too

 much

 about

 her

 personality

,

 motivations

,

 or

 backstory

.

 The

 phrase

 "

That

's

 me

 in

 a

 nutshell

"

 is

 a

 bit

 informal

,

 but

 it

 suggests

 that

 Lena

 is

 down

-to

-earth

 and

 rel

atable

.

 The

 final

 sentence

,

 "

How

 can

 I

 help

 you

 today



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Provide

 a

 concise

 factual

 statement

 about

 a

 city

 in

 Canada

.

 The

 capital

 of

 Canada

 is

 Ottawa

.

 Provide

 a

 concise

 factual

 statement

 about

 a

 city

 in

 Australia

.

 The

 capital

 of

 Australia

 is

 Canberra

.

 Provide

 a

 concise

 factual

 statement

 about

 a

 city

 in

 India

.

 The

 capital

 of

 India

 is

 New

 Delhi

.

 Provide

 a

 concise

 factual

 statement

 about

 a

 city

 in

 Brazil

.

 The

 capital

 of

 Brazil

 is

 Bras

ília

.

 Provide

 a

 concise

 factual

 statement

 about

 a

 city

 in

 Russia

.

 The

 capital

 of

 Russia

 is

 Moscow

.

 Provide

 a

 concise

 factual

 statement

 about

 a

 city

 in

 China

.

 The

 capital

 of

 China

 is

 Beijing

.

 Provide

 a

 concise

 factual

 statement

 about

 a

 city

 in

 South

 Africa

.

 The

 capital

 of

 South



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 that

 generates

 a

 lot

 of

 excitement

 and

 concern

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Automation

:

 AI

 will

 continue

 to

 automate

 more

 tasks

 and

 processes

,

 leading

 to

 increased

 productivity

 and

 efficiency

.

 This

 may

 lead

 to

 job

 displacement

,

 but

 it

 may

 also

 create

 new

 job

 opportunities

 in

 areas

 such

 as

 AI

 development

 and

 maintenance

.


2

.

 Adv

ancements

 in

 Natural

 Language

 Processing

:

 AI

 will

 become

 more

 capable

 of

 understanding

 and

 generating

 human

 language

,

 leading

 to

 more

 effective

 communication

 between

 humans

 and

 machines

.


3

.

 Integration

 with

 the

 Internet

 of

 Things

:

 AI

 will

 be

 integrated

 with

 the

 Internet

 of

 Things

 (

Io

T

)

 to

 create

 a

 more

 connected

 and

 automated

 world




In [6]:
llm.shutdown()