# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.04s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.10s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.11s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.10it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  David. I am a PhD student in Mathematics at the University of California, Los Angeles (UCLA). I am writing to ask about opportunities for graduate students like myself to engage with the broader mathematics community. I am particularly interested in collaborating with mathematicians from diverse backgrounds and engaging in outreach and education activities.
I am currently involved in several research projects, including a focus on geometric analysis and partial differential equations. My research has taken me to conferences and workshops, where I have had the opportunity to present my work and meet other mathematicians. However, I am eager to expand my connections and reach beyond the confines of academia.
I have been interested
Prompt: The president of the United States is
Generated text:  not a king, and the country’s constitution does not grant him any special powers. The president is the head of state and head of government of the United S

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city and spend most of my free time reading and writing. I'm a bit of a introvert and enjoy spending time alone, but I also value my relationships with friends and family. I'm a bit of a perfectionist and can be quite hard on myself when things don't go as planned. I'm working on learning to be more patient and accepting of myself and others. That's me in a nutshell.
What is the main idea of the self-introduction?
The main idea of the self-introduction is to provide

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also a major hub for business, education, and culture. Paris is a popular tourist destination and is known for its romantic atmosphere and beautiful architecture. The city has a population of over 2.1 million people and is a major economic and cultural center in Europe. The official language of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns, and make predictions about patient outcomes.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI aims to provide transparency and interpretability of AI models, enabling humans to understand the reasoning behind



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Abigail Pierce, and I'm a 25-year-old graphic designer working as a freelancer in the city. I enjoy hiking and playing piano in my free time. I live alone in a small studio apartment. That's me in a nutshell. This introduction is neutral because it doesn't reveal any of Abigail's personality traits, values, or motivations. It simply presents some basic facts about her. Here are a few things to note about this introduction: (1) it's concise and easy to read; (2) it provides a clear and specific description of Abigail's profession and work situation; (3) it includes some relevant

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a concise factual statement about the population of France’s capital city. The population of Paris is approximately 2.1 million people. Note: This is the population of Pari

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 El

ara

.

 I

’m

 a

 

25

-year

-old

 artist

 living

 in

 a

 small

 town

 surrounded

 by

 rolling

 hills

 and

 dense

 forests

.

 I

 enjoy

 painting

,

 drawing

,

 and

 reading

.

 My

 favorite

 authors

 are

 Tolkien

 and

 Rowling

,

 and

 I

 love

 how

 their

 worlds

 transport

 me

 to

 far

-off

 lands

.

 When

 I

’m

 not

 creating

 art

,

 you

 can

 find

 me

 exploring

 the

 outdoors

 or

 trying

 out

 new

 recipes

 in

 the

 kitchen

.

 That

’s

 me

 in

 a

 nutshell

.

 How

 would

 you

 rate

 the

 self

-int

roduction

 on

 the

 following

 criteria

:

 Cl

arity

:

 

8

/

10

 The

 introduction

 starts

 with

 a

 straightforward

 greeting

,

 and

 the

 character

’s

 name

 and

 age

 are

 provided

.

 However

,

 the

 connection

 between

 the



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 capital

 of

 France

 is

 located

 on

 the

 Se

ine

 River

 in

 northern

 France

.

 The

 city

 is

 built

 on

 two

 islands

 in

 the

 river

,

 which

 were

 originally

 marsh

y

 land

.

 The

 city

 has

 a

 long

 history

,

 dating

 back

 to

 the

 Roman

 era

,

 with

 the

 first

 permanent

 settlement

 built

 in

 the

 

3

rd

 century

 AD

.

 Over

 the

 centuries

,

 Paris

 has

 been

 an

 important

 center

 of

 learning

,

 art

,

 fashion

,

 and

 culture

,

 earning

 it

 the

 nickname

 "

the

 City

 of

 Light

."


Ge

ographically

,

 Paris

 is

 situated

 in

 the

 Î

le

-de

-F

rance

 region

,

 about

 

300

 kilometers

 (

186

 miles

)

 north

 of

 the

 Lo

ire

 Valley

.

 The

 city

's

 terrain

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 its

 impact

 will

 be

 significant

 in

 various

 sectors

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Adoption

 of

 Edge

 AI

:

 Edge

 AI

 refers

 to

 the

 processing

 of

 AI

 algorithms

 at

 the

 edge

 of

 the

 network

,

 closer

 to

 the

 source

 of

 the

 data

.

 This

 approach

 allows

 for

 faster

 processing

,

 reduced

 latency

,

 and

 increased

 security

.

 As

 the

 Internet

 of

 Things

 (

Io

T

)

 continues

 to

 grow

,

 edge

 AI

 will

 become

 more

 prevalent

,

 enabling

 real

-time

 processing

 and

 decision

-making

.


2

.

 Adv

ancements

 in

 Explain

able

 AI

:

 Explain

able

 AI

 (

X

AI

)

 is

 a

 sub

field

 of

 AI

 that

 focuses

 on

 making

 AI

 models

 more

 transparent

 and




In [6]:
llm.shutdown()