# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.04s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.49it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.34it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Reilly, and I am a licensed therapist with a passion for helping people overcome anxiety and depression. I believe that everyone deserves to live a happy and fulfilling life, free from the grip of mental health struggles.
As a therapist, I use a non-judgmental and empathetic approach to help my clients feel safe and understood. I am trained in a variety of evidence-based therapies, including Cognitive Behavioral Therapy (CBT), Dialectical Behavior Therapy (DBT), and Acceptance and Commitment Therapy (ACT).
My experience has taught me that every person's journey is unique, and what works for one person may not work for
Prompt: The president of the United States is
Generated text:  like a movie star. You're not voting for someone who is going to be a good leader, you're voting for a movie star. I mean, think about it: who do you want to see on the evening news? Who do you want to go to the theater and see? You want to see someone who's charming,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my free time. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states the character's name, occupation, and interests in a straightforward and factual way. This type of introduction can be useful for a character who is trying to make a good impression or establish a professional relationship with someone. It

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a fact or a piece of trivia. 
Note: This response is a direct answer to the question and does not require any additional information or analysis. It is a simple and concise statement that provides a clear and accurate answer.  The tone is neutral and informative, providing a factual statement without any emotional or persuasive language.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Widespread adoption of AI in industries: AI is expected to be adopted in various industries, including finance, transportation, and education. AI-powered systems can automate tasks, improve efficiency, and enhance decision-making.
3. Rise of Explainable



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Evelyn Weaver, but most people call me Evie. I'm a 25-year-old freelance writer and part-time yoga instructor living in Portland, Oregon. I've been writing professionally for about five years, and I've had articles published in several local publications. I'm also a certified yoga instructor and have been teaching classes at a few studios in the city. When I'm not working, you can find me exploring the city's coffee shops, trying new breweries, or hiking in the nearby woods.
I like that this introduction doesn't reveal too much about Evie's personality or background, but still gives the reader a sense of who she

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city is located in the northern part of the country and is situated on the Seine River. The population of Paris is approximately 2.1 million 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Fen

ella

 W

imple

.

 I

'm

 a

 

27

-year

-old

 data

 analyst

 from

 Manchester

.

 I

 enjoy

 reading

 fantasy

 novels

 and

 playing

 board

 games

 in

 my

 free

 time

.

 I

'm

 currently

 looking

 for

 a

 new

 role

 in

 my

 field

,

 possibly

 in

 London

.

 That

's

 a

 bit

 about

 me

!

 What

 do

 you

 think

?

 Should

 I

 change

 anything

?


F

en

ella

 W

imple

 sounds

 like

 a

 perfectly

 ordinary

,

 rather

 pleasant

 person

.

 The

 language

 is

 clear

 and

 easy

 to

 understand

,

 and

 the

 information

 provided

 is

 relevant

 and

 concise

.

 However

,

 the

 tone

 is

 a

 bit

 flat

 and

 neutral

.

 You

 might

 want

 to

 consider

 adding

 a

 few

 personal

 touches

 to

 make

 the

 introduction

 more

 engaging

.

 For

 example

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 list

 of

 cultural

 attractions

 that

 can

 be

 found

 in

 Paris

.

 The

 list

 may

 be

 arranged

 alphabet

ically

 or

 in

 a

 specific

 order

.


The

 E

iff

el

 Tower




The

 Lou

vre

 Museum




Pal

ace

 of

 Vers

ailles




The

 Arc

 de

 Tri

omp

he




The

 Palace

 of

 Font

aine

ble

au




The

 Mus

ée

 d

’

Or

say




The

 Saint

e

-Ch

ap

elle




Provide

 a

 list

 of

 activities

 that

 can

 be

 done

 in

 Paris

.

 The

 list

 may

 be

 arranged

 alphabet

ically

 or

 in

 a

 specific

 order

.


Visit

 the

 E

iff

el

 Tower

 and

 enjoy

 the

 views

 from

 the

 top

.


Explore

 the

 city

 on

 a

 bike

 or

 on

 foot

.


Go

 shopping

 on

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 exciting

.

 Here

 are

 some

 possible

 trends

 that

 might

 shape

 the

 field

 in

 the

 coming

 years

:


1

.

 

 

Increased

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 models

 make

 decisions

.

 This

 trend

 is

 driven

 by

 the

 need

 for

 accountability

,

 trust

,

 and

 explain

ability

 in

 AI

-driven

 systems

.


2

.

 

 

Adv

ancements

 in

 edge

 AI

:

 Edge

 AI

 refers

 to

 the

 processing

 of

 data

 on

 devices

 at

 the

 edge

 of

 the

 network

,

 rather

 than

 in

 the

 cloud

.

 This

 trend

 is

 driven

 by

 the

 need

 for

 real

-time

 processing

,

 low

 latency

,

 and

 reduced

 bandwidth

 usage

.


3

.

 

 

Growing




In [6]:
llm.shutdown()