# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.02s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.50it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Olivia and I am a Junior here at the University of Oregon. I am currently studying Public Policy and Administration. I am excited to be a part of the Eugene community and to be working with this organization. I look forward to learning more about the city and its government and to finding ways to contribute to its growth and improvement.
I have lived in the Eugene area for most of my life and have seen firsthand the ways in which the city is changing. I believe that it is essential to be aware of and engaged with the policies and decisions that affect our community. Through my work with the City Club of Eugene, I hope to gain a deeper
Prompt: The president of the United States is
Generated text:  usually considered to be the highest office in the federal government. In the United States, the president is both the head of state and the head of government, serving as the commander-in-chief of the armed forces and the leader of the executive bran

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation when I'm feeling energized. I'm currently working on a novel and trying to build my freelance business. That's me in a nutshell.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states facts about Kaida's life and interests. This is a good approach for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, along the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city is also known for its romantic atmosphere and is often referred to as the City of Light. Paris is a major tourist destination and is considered one of the most beautiful and culturally significant cities in the world. The city has a population of over 2.1 million people and is a hub for business,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Widespread adoption of AI in education: AI is expected to transform the education sector, including personalized learning, adaptive assessments, and intelligent tutoring systems. AI-powered systems can analyze student performance, identify knowledge gaps, and provide targeted support



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Amelia Waverley, and I'm a 24-year-old botanist with a love for climbing plants and a passion for understanding the intricacies of plant communication. I've spent most of my life studying and working in the fields of botany, and I've developed a unique perspective on the natural world.
Use the following details to create a neutral self-introduction:
Name: Eleanor Blackwood
Age: 22
Occupation: History teacher
Hobbies: Playing the violin, collecting antique maps
Personality: reserved, analytical
Here's a neutral self-introduction for the fictional character:
Hello, my name is Eleanor

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
What is the capital of France?
The capital of France is Paris. This is a very short answer and it would not provide much information about the city.
Here are some additional de

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 A

ria

.

 I

'm

 a

 skilled

 bot

an

ist

 who

 studies

 the

 unique

 plant

 species

 found

 in

 various

 regions

 of

 the

 galaxy

.

 I

 have

 a

 degree

 in

 astro

biology

 from

 the

 University

 of

 And

rom

eda

 and

 have

 spent

 several

 years

 conducting

 research

 on

 distant

 planets

.

 My

 expertise

 includes

 plant

 identification

,

 taxonomy

,

 and

 ecosystem

 analysis

.

 I

'm

 also

 fluent

 in

 several

 alien

 languages

 and

 have

 a

 passion

 for

 exploring

 the

 unknown

.


What

 is

 the

 main

 event

 of

 the

 story

 "

The

 G

iver

"

?


The

 main

 event

 of

 "

The

 G

iver

"

 is

 the

 transition

 of

 Jonas

,

 the

 protagonist

,

 from

 his

 life

 in

 a

 ut

opian

 society

 to

 a

 life

 of

 freedom

 and

 individual

ity

.

 The



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 situated

 in

 the

 north

-central

 region

 of

 the

 country

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

,

 fashion

,

 cuisine

,

 and

 architecture

,

 making

 it

 a

 popular

 destination

 for

 tourists

 and

 a

 hub

 for

 international

 business

 and

 culture

.

 The

 city

 has

 a

 population

 of

 approximately

 

2

.

1

 million

 people

 and

 is

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 a

 major

 economic

 and

 financial

 center

,

 hosting

 the

 headquarters

 of

 many

 multinational

 companies

 and

 international

 organizations

.

 The

 city

 is

 served

 by

 two

 major

 airports

,

 Paris

 Charles

 de

 Gaul

le

 Airport

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 but

 several

 trends

 are

 likely

 to

 shape

 the

 future

.


The

 development

 of

 artificial

 intelligence

 (

AI

)

 has

 been

 rapidly

 advancing

 over

 the

 past

 few

 decades

,

 with

 significant

 improvements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 As

 AI

 continues

 to

 evolve

,

 several

 trends

 are

 likely

 to

 shape

 its

 future

:


1

.

 Increased

 Adoption

 of

 AI

 in

 Daily

 Life

:


AI

 will

 become

 increasingly

 integrated

 into

 various

 aspects

 of

 daily

 life

,

 from

 personal

 assistants

 to

 self

-driving

 cars

 and

 smart

 homes

.

 AI

-powered

 devices

 will

 become

 more

 prevalent

,

 making

 life

 easier

 and

 more

 convenient

 for

 individuals

.


2

.

 Rise

 of

 Edge

 AI

:


As

 IoT

 devices

 become

 more

 widespread

,

 edge

 AI

 will

 become

 more




In [6]:
llm.shutdown()