# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.38it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.27it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.25it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.54it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dave, and I've been doing some research on the internet for a long time. I've been thinking about a relationship I've been in for a few months now, and I'm trying to make sense of it all. I'm pretty sure that I'm in love, but I'm also pretty sure that my partner is not ready for a committed relationship. They seem to want the freedom to do their own thing, and while I'm happy to give them that space, I'm struggling with my own feelings and desires.
I've been thinking about the concept of "True Love" and whether it's something that can exist outside of commitment
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States of America. The president leads the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces.
The president is elected to a four-year term by the people through the Electoral College system. The president is also 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy writing about technology and social issues, and I'm currently working on a novel about the intersection of artificial intelligence and human relationships. When I'm not writing, you can find me exploring the city's hidden cafes and trying out new foods. I'm a bit of a introvert, but I'm always up for a good conversation. How would you improve this self-introduction?
Here are a few suggestions to improve the self-introduction:
1.  **Add a unique detail**: To make Kaida's introduction more memorable, consider adding a unique detail

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for business, education, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a diverse range of neighborhoods, each with its own unique character

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there will be a growing need for explainable AI, which can provide insights into how AI systems make decisions. This will be particularly important in high-stakes applications such



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lily Tran, and I'm a 16-year-old high school student living in a small town in Oregon. I enjoy reading, hiking, and playing the piano.
This self-introduction should not reveal any personal preferences or characteristics that would influence the reader's opinion of Lily. It simply provides a basic introduction to who she is and what she does.
Here are a few key points to consider when writing a neutral self-introduction:
1.  Keep it brief: A short introduction is more effective than a long one.
2.  Focus on facts: Stick to verifiable information about the character, such as their name, age, and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a concise factual statement about France’s location. France is located in Western Europe.
Provide a concise factual statement about the language spoken in F

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ada

 L

lew

elly

n

,

 and

 I

 am

 a

 

21

-year

-old

 bot

an

ist

 currently

 pursuing

 a

 Ph

.D

.

 in

 plant

 physiology

 at

 the

 University

 of

 Cambridge

.


I

 have

 a

 strong

 interest

 in

 the

 reproductive

 biology

 of

 rare

 plant

 species

,

 and

 I

 am

 currently

 working

 on

 a

 research

 project

 investigating

 the

 effects

 of

 climate

 change

 on

 the

 flowering

 patterns

 of

 these

 plants

.

 Outside

 of

 academia

,

 I

 enjoy

 hiking

 in

 the

 countryside

 and

 experimenting

 with

 new

 recipes

 in

 the

 kitchen

.


What

 is

 a

 neutral

 self

-int

roduction

?


A

 neutral

 self

-int

roduction

 is

 a

 brief

 statement

 that

 presents

 a

 person

's

 name

,

 profession

 or

 area

 of

 study

,

 and

 any

 other

 relevant

 details

 without

 expressing

 a

 personal

 opinion



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 post

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 The

 capital

 of

 France

 is

 Paris

.

 appeared

 first

 on

 Nursing

 writers

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 The

 capital

 of

 France

 is

 Paris

.

 was

 first

 posted

 on

 September

 

15,

 

202

0

 at

 

3

:

47

 pm

.


https

://

n

urs

ing

hom

ework

sh

elp

.com

/wp

-content

/uploads

/

202

1

/

07

/n

hs

-logo

-

202

1

.png

 

0

 

0

 developer

 https

://

n

urs

ing

hom

ework

sh

elp

.com

/wp

-content

/uploads

/

202

1

/

07

/n

hs

-logo-

202

1

.png

 developer

202

1

-

07

-

08

 

03



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 rapidly

 evolving

,

 with

 potential

 breakthrough

s

 in

 various

 fields

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 **

Increased

 Integration

 with

 the

 Internet

 of

 Things

 (

Io

T

)**

:

 AI

 will

 become

 more

 integrated

 with

 IoT

 devices

,

 enabling

 real

-time

 processing

 and

 analysis

 of

 vast

 amounts

 of

 data

 from

 various

 sources

,

 such

 as

 sensors

,

 cameras

,

 and

 other

 smart

 devices

.


2

. **

Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

)**

:

 AI

 systems

 will

 become

 more

 proficient

 in

 understanding

 and

 generating

 human

 language

,

 leading

 to

 improved chat

bots

,

 voice

 assistants

,

 and

 language

 translation

 capabilities

.


3

.

 **

Em

phasis

 on

 Explain

ability

 and

 Transparency

**:

 As

 AI

 becomes




In [6]:
llm.shutdown()