# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.02s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.51it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.34it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Robert G. Davis, and I am the Mayor of Birmingham, Alabama. I was born and raised in this city and have always been proud of its rich history, vibrant culture, and resilient people. As Mayor, I am committed to working tirelessly to ensure that Birmingham continues to thrive and grow, while preserving its unique character and charm.
I am a graduate of the University of Alabama at Birmingham, where I earned a degree in Business Administration. After college, I worked in the private sector for several years before entering public service. My first elected office was as a member of the Birmingham City Council, where I served for six years before being elected
Prompt: The president of the United States is
Generated text:  under attack, literally and figuratively.
The attack is coming from a combination of factors including a growing economy, a strong stock market, and record low unemployment numbers. The irony is that these are all positive indicat

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm looking forward to meeting new people and learning about their experiences.
This self-introduction is neutral because it doesn't reveal any personal opinions or biases. It simply states the character's name, age, occupation, and interests. It also mentions a current project and a goal, which gives a glimpse into the character's personality and motivations. The tone is friendly and open, which is suitable for a self-introduction.
Here are a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also known for its romantic atmosphere and is often referred to as the City of Light. Paris is a popular tourist destination and is home to many international organizations, including the United Nations Educational, Scientific and Cultural Organization (UNESCO) and the Organisation for Economic Co-operation and Development (OECD

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and accountability in AI decision-making. Explainable AI (XAI) aims to provide insights into how AI models make decisions



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Aurora E. Wynter. I'm a junior at Oakwood High School. I'm studying English literature and creative writing, which I find fascinating. When I'm not attending classes, you can find me reading a book, watching old movies, or simply observing the world around me. I have a strong interest in poetry and short stories, and I'm working on publishing my own writing pieces. That's a bit about me.
What literary devices does this self-introduction employ?
The self-introduction employs several literary devices to convey the character's personality and interests. Here are some examples:
1. **Imagery**: The introduction

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located at the heart of the country in the Île-de-France region.
Provide a concise factual statement about France’s capital city.
The capital 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lena

 Grant

.

 I

 work

 as

 a

 freelance

 writer

 and

 part

-time

 librarian

 at

 a

 small

 town

 library

.

 I

'm

 a

 quiet

 and

 observ

ant

 person

 who

 enjoys

 reading

 and

 research

.

 I

'm

 currently

 living

 in

 the

 town

 of

 Willow

 Creek

,

 where

 I

've

 been

 for

 a

 few

 years

.

 I

'm

 not

 particularly

 interested

 in

 social

 activities

,

 but

 I

 do

 enjoy

 quiet

 evenings

 with

 a

 good

 book

.

 I

'm

 looking

 to

 connect

 with

 like

-minded

 people

 who

 share

 my

 interests

.

 That

's

 me

 in

 a

 nutshell

.

 I

'm

 looking

 for

 someone

 who

 is

 respectful

,

 kind

,

 and

 open

-minded

.

 I

 value

 honesty

 and

 integrity

 in

 relationships

.

 I

'm

 not

 looking

 for

 anything

 too

 serious

 or

 complicated

 at



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 situated

 along

 the

 Se

ine

 River

.

 Paris

 is

 a

 major

 cultural

 and

 economic

 hub

 in

 Europe

.

 Famous

 for

 its

 art

,

 architecture

,

 cuisine

,

 fashion

,

 and

 romantic

 atmosphere

,

 Paris

 is

 one

 of

 the

 world's

 most

 popular

 tourist

 destinations

.

 The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 Gal

lic

 era

 and

 has

 been

 a

 major

 center

 of

 learning

,

 art

,

 and

 science

 throughout

 its

 history

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 population

.

 The

 population

 of

 France

 is

 approximately

 

67

 million

 people

,

 according

 to

 the

 

202

0

 estimates

.

 France

 is

 the

 third

 most

 populous

 country

 in

 the

 European

 Union

,

 after

 Germany

 and

 the

 United

 Kingdom

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 subject

 of

 much

 speculation

 and

 debate

.

 However

,

 based

 on

 current

 developments

 and

 trends

,

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 

 **

Increased

 Adoption

 in

 Various

 Industries

**:

 AI

 will

 continue

 to

 be

 adopted

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 This

 will

 lead

 to

 increased

 efficiency

,

 productivity

,

 and

 innovation

 in

 these

 sectors

.


2

.

 

 **

Adv

ancements

 in

 Deep

 Learning

**:

 Deep

 learning

,

 a

 type

 of

 machine

 learning

,

 will

 continue

 to

 improve

 and

 be

 used

 in

 more

 applications

.

 This

 will

 enable

 AI

 systems

 to

 learn

 from

 data and

 make

 decisions

 more

 accurately

.
3

.

 

 **

Growing

 Importance

 of

 Explain

ability

 and

 Transparency




In [6]:
llm.shutdown()