# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.33it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.08it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.04it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I am a proud participant in the 2019 Conquer Cancer Foundation's Team in Training (TNT) program. I am training to run the Chicago Marathon to support cancer research and to honor my grandmother, who passed away from cancer in 2017.
Throughout my training, I will be logging miles, pushing through challenging workouts, and raising funds for the Conquer Cancer Foundation. This foundation is dedicated to funding cancer research and providing resources to cancer patients and their loved ones. Every donation counts, and I am grateful for any support you can provide.
I have set a personal fundraising goal of $5,000, and I
Prompt: The president of the United States is
Generated text:  responsible for a wide range of duties and powers. He or she serves as the commander-in-chief of the armed forces and is responsible for making key decisions about national security. The president is also the head of state and is responsible for representing th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking and reading in my free time. I'm a bit of a introvert and prefer to keep to myself, but I'm always up for a good conversation when the mood strikes. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell.
This is a good example of a neutral self-introduction because it:
Provides basic information about the character, such as their name, age, occupation, and location.
Avoids revealing too much about the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is located in the northern part of the country, along the Seine River. Paris is known for its beautiful architecture, art museums, and fashion industry. The city has a population of over 2.1 million people and is a major cultural and economic center in Europe. Paris is home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city has a rich history and has been a major center of learning and culture for centuries. Paris is also known for its romantic atmosphere and is a popular destination for tourists. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Widespread adoption of AI in customer service: AI-powered chatbots and virtual assistants are likely to become more common in customer service, helping to answer customer inquiries, resolve issues, and provide personalized support.
3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Alexandra "Alex" Morrell. I'm a 25-year-old freelance writer living in the city. I work from home, which is great because I get to spend a lot of time with my cat, Luna. My interests include reading, hiking, and cooking. I like trying new foods and experimenting with different recipes in the kitchen. I'm a bit of a night owl, so you'll often find me working late or watching a movie with friends. I'm not really into big crowds or loud parties, but I do enjoy meeting new people and making connections. That's me in a nutshell!
Alexandra "Alex" Morrell

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. ## Step 1: Identify the task
The task is to provide a concise factual statement about France’s capital city.

## Step 2: Recall the information
We know that the capital of France is Paris.

## Step 3: Formulate

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Aurora

 "

R

ory

"

 Wy

nt

er

.

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 who

 spends

 most

 of

 my

 free

 time

 reading

,

 practicing

 guitar

,

 and

 exploring

 the

 outdoors

.

 I

'm

 a

 bit

 of

 a

 hopeless

 romantic

,

 and

 I

'm

 always

 up

 for

 a

 good

 adventure

.

 What

 do

 you

 know

 about

 this

 character

 so

 that

 you

 can

 ask

 follow

-up

 questions

?


English

 Language

 Arts

 ,

 Writing

 ,

 English

 ,

 Literature

 ,

 Writing

 Process

 ,

 Literary

 Analysis

 ,

 Character

 Analysis

 ,

 Literary

 Terms

 ,

 Creative

 Writing




In

 this

 activity

,

 students

 will

 analyze

 a

 character

's

 self

-int

roduction

 and

 ask

 follow

-up

 questions

 to

 deepen

 their

 understanding

 of

 the

 character

.


The

 teacher

 will

 provide

 a

 short



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 list

 of

 top

 

3

 attractions

 in

 Paris

.

 Some

 of

 the

 top

 attractions

 in

 Paris

 include

:


The

 E

iff

el

 Tower




The

 Lou

vre

 Museum




The

 Notre

 Dame

 Cathedral




Provide

 a

 concise

 factual

 statement

 about

 the

 most

 popular

 cuisine

 in

 France

.

 The

 most

 popular

 cuisine

 in

 France

 is

 French

 cuisine

,

 which

 is

 known

 for

 its

 rich

 flavors

,

 intricate

 preparations

,

 and

 high

-quality

 ingredients

.

 Some

 of

 the

 most

 well

-known

 French

 dishes

 include

 esc

arg

ots

,

 rat

at

ou

ille

,

 and

 co

q

 au

 vin

.

 French

 cuisine

 also

 has

 a

 strong

 emphasis

 on

 wine

 and

 cheese

.


Provide

 a

 concise

 factual

 statement

 about

 the

 official

 language

 of

 France

.

 The

 official



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 being

 shaped

 by

 technological

 advancements

,

 shifting

 societal

 values

,

 and

 growing

 awareness

 of

 the

 potential

 risks

 and

 benefits

 of

 AI

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


 

 

1

.

 Increased

 use

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 pervasive

 in

 decision

-making

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 models

 make

 decisions

.

 Explain

able

 AI

 will

 become

 more

 prevalent

 to

 provide

 transparency

 and

 trust

 in

 AI

-driven

 systems

.


 

 

2

.

 Rise

 of

 Edge

 AI

:

 With

 the

 proliferation

 of

 IoT

 devices

,

 Edge

 AI

 will

 become

 more

 significant

 as

 it

 enables

 AI

 processing

 and

 decision

-making

 at

 the

 edge of

 the

 network

,

 reducing

 latency

 and

 improving

 real

-time

 responses




In [6]:
llm.shutdown()