# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.42it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.33it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.80it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.60it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ruth and I am a passionate Homeopath. I have been interested in alternative and holistic health for many years, and I decided to pursue Homeopathy as a career.
I completed a 4-year Bachelor of Science in Homeopathy from the Canadian College of Homeopathic Medicine (CCHM) in 2010. Since then, I have been working with patients of all ages, from newborns to seniors, and have been helping them manage various health issues such as allergies, digestive issues, stress and anxiety, sleep disturbances, and many others.
My approach to Homeopathy is centered around understanding each patient's unique situation and creating a personalized treatment
Prompt: The president of the United States is
Generated text:  not the head of state, but the head of government. The head of state is the vice president, who serves as the ceremonial head of state and the representative of the country in international relations. The president is the head of government and the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and trying to learn more about the Japanese culture. That's me in a nutshell. I'm a bit of a introvert, but I'm always up for a good conversation. I'm looking forward to getting to know you better.
This is a good start, but it's a bit too casual for a formal introduction. Here's a revised version: Hello, my name is Kaida. I'm a 25-year-old freelance writer based in Tokyo. In my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. The city is known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is also famous for its fashion, cuisine, and romantic atmosphere. The city has a population of over 2.1 million people and is a major hub for international business, culture, and tourism. Paris is a UNESCO World Heritage Site and is considered one of the most beautiful and culturally rich cities in the world. The city has a rich history dating back to the 3rd century BC and has

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased Adoption in Everyday Life: AI is likely to become more ubiquitous in everyday life, with applications in areas such as healthcare, finance, transportation, and education.
2. Advancements in Natural Language Processing (NLP): NLP will continue to improve, enabling AI systems to better understand and generate human-like language, leading to more effective communication between humans and machines.
3. Rise of Explainable AI (XAI): As AI becomes more pervasive, there will be a growing need to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar. I'm a 25-year-old astrobiologist with a background in planetary science and a passion for discovering new worlds. I'm currently working at the Helios Research Station on the outskirts of the Mars Colony. When I'm not studying the red planet's geology or analyzing data from the latest robotic probes, you can find me tinkering with my personal Mars rover, the Aurora. I'm excited to meet you!
Elianore Quasar is a 25-year-old astrobiologist at the Helios Research Station on Mars. She is a planetary scientist with a passion for discovering new worlds. Elian

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northern part of the country. Paris is situated on the Seine River, which runs through the heart of the city. It is known for its iconic landmarks like the Eiffel Tower, Notr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lucas

 Black

wood

.

 I

'm

 a

 

25

-year

-old

 student

 at

 the

 University

 of

 Michigan

.

 I

'm

 currently

 studying

 environmental

 science

,

 with

 a

 focus

 on

 renewable

 energy

.

 In

 my

 free

 time

,

 I

 enjoy

 hiking

 and

 playing

 guitar

.

 I

'm

 a

 bit

 of

 a

 book

worm

 and

 love

 getting

 lost

 in

 science

 fiction

 novels

.

 I

'm

 not

 particularly

 outgoing

,

 but

 I

'm

 always

 up

 for

 a

 good

 conversation

.

 I

'm

 excited

 to

 learn

 more

 about

 the

 world

 around

 me

 and

 contribute

 to

 making

 it

 a

 better

 place

.


Some

 people

 might

 find

 my

 intro

 boring

 or

 generic

.

 Others

 might

 appreciate

 the

 straightforward

ness

.

 I

 tried

 to

 write

 a

 neutral

 self

-int

roduction

 that

 doesn

't

 reveal

 too



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 a

 city

 with

 a

 rich

 history

 and

 cultural

 significance

,

 located

 in

 the

 north

-central

 part

 of

 the

 country

.

 It

 is

 the

 country

’s

 largest

 city

 and

 is

 often

 referred

 to

 as

 the

 “

City

 of

 Light

.”

 Paris

 is

 home

 to

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 known

 for

 its

 fashion

 industry

,

 cuisine

,

 and

 romantic

 atmosphere

.


The

 post

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 The

 capital

 of

 France

 is

 Paris

.

 Paris

 is

 a

 city

 with

 a

 rich

 history

 and

 cultural

 significance

,

 located

 in

 the

 north

-central

 part

 of

 the

 country

.

 It

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 expansive

,

 with

 numerous

 emerging

 trends

 that

 will

 shape

 the

 landscape

 of

 technology

 and

 society

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


 

 

1

.

 Increased

 Adoption

 of

 Edge

 AI

:

 As

 the

 number

 of

 IoT

 devices

 increases

,

 edge

 AI

 will

 become

 more

 prevalent

.

 Edge

 AI

 enables

 devices

 to

 process

 data

 in

 real

-time

,

 reducing

 latency

 and

 improving

 performance

.

 This

 will

 lead

 to

 a

 surge

 in

 applications

 such

 as

 smart

 homes

,

 smart

 cities

,

 and

 autonomous

 vehicles

.


 

 

2

.

 Rise

 of

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 explain

able

 AI

.

 This

 involves

 developing

 AI

 systems

 that

 can

 provide

 transparent

 and

 interpre




In [6]:
llm.shutdown()