# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.31it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.25it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.24it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.50it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lucas and I am a freshman at ASU. I am a member of the Barrett Honors College and am pursuing a degree in Computer Science. I have a passion for gaming and technology. I am excited to join the ASU community and contribute to the Barrett community through various activities and projects.
This summer, I had the opportunity to work on a research project with a faculty member in the Computer Science department at ASU. The project involved developing a machine learning algorithm to predict the likelihood of a user buying a product based on their browsing history. This experience gave me a deeper understanding of how machine learning and data analysis can be applied in real
Prompt: The president of the United States is
Generated text:  the head of the executive branch of the federal government and is also the commander-in-chief of the armed forces. The president serves a four-year term and is limited to two terms in office. The president is elected 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and trying to learn more about the world around me. That's me in a nutshell. I'm a bit of a introvert, but I'm always up for a good conversation when I'm feeling energized. I'm not really sure what the future holds, but I'm excited to see where life takes me. I'm a bit of a dreamer, and I'm always looking for new inspiration to fuel my writing. I'm not really sure what I want

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, on the Seine River. It is the most populous city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also a major center for business, education, and culture. Paris is a popular tourist destination, attracting millions of visitors each year. The city has a diverse population and a strong economy, making it one of the world’s leading cities. Paris is also known for its romantic atmosphere, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely to become even more widespread, with the potential to automate many tasks and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  August Flynn. I'm a historian specializing in medieval art and architecture. I've spent years researching the intricate details of stained glass windows and the symbolism behind them. I'm currently working on a book about the influence of medieval art on modern design.
When writing a self-introduction, aim for brevity, clarity, and a neutral tone. Avoid jargon or overly technical language, especially if you're introducing yourself in a professional or academic setting. Here's a breakdown of the key elements in this example:
1. **Start with a friendly greeting**: "Hello" is a common and approachable way to begin a self-introduction,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Next Next post: What is the second-largest city in India? Mumbai. is the second-largest city in India. (This is a commonly ac

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Luna

 and

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 who

 loves

 trying

 new

 coffee

 shops

 and

 collecting

 obscure

 facts

 about

 the

 world

's

 most

 unlikely

 creatures

.

 I

'm

 a

 bit

 of

 a

 night

 owl

,

 so

 you

 might

 catch

 me

 working

 late

 into

 the

 evening

 with

 a

 cup

 of

 black

 coffee

 in

 hand

.

 When

 I

'm

 not

 writing

 or

 s

ipping

 coffee

,

 you

 can

 find

 me

 reading

 sci

-fi

 novels

 or

 watching

 documentaries

 about

 the

 latest

 scientific

 breakthrough

s

.

 I

'm

 always

 up

 for

 a

 conversation

 about

 the

 weird

 and

 wonderful

 things

 in

 life

.


How

 do

 I

 make

 this

 introduction

 more

 engaging

?


To

 make

 this

 introduction

 more

 engaging

,

 consider

 the

 following

 suggestions

:


1

.

 

 **

Add

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 second

-largest

 city

.

 The

 second

-largest

 city

 in

 France

 is

 Lyon

.


Provide

 a

 concise

 factual

 statement

 about

 the

 city

 of

 Lyon

.

 Lyon

 is

 a

 city

 in

 eastern

 France

 known

 for

 its

 historic

 architecture

 and

 gastr

onomy

.


Provide

 a

 concise

 factual

 statement

 about

 the

 city

 of

 Bordeaux

.

 Bordeaux

 is

 a

 city

 in

 southwestern

 France

 known

 for

 its

 wine

 production

 and

 historic

 architecture

.

 


Provide

 a

 concise

 factual

 statement

 about

 the

 city

 of

 Marseille

.

 Marseille

 is

 a

 city

 in

 southern

 France

 known

 for

 its

 Mediterranean

 coastline

 and

 cultural

 diversity

.

 


Provide

 a

 concise

 factual

 statement

 about

 the

 city

 of

 Nice

.

 Nice

 is

 a

 city

 in

 southeastern

 France

 known

 for

 its

 Mediterranean



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 influenced

 by

 multiple

 factors

,

 including

 technological

 advancements

,

 societal

 needs

,

 and

 economic

 pressures

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Use

 of

 Edge

 AI

:

 As

 more

 devices

 become

 connected

 to

 the

 internet

,

 the

 demand

 for

 edge

 AI

 will

 grow

.

 Edge

 AI

 refers

 to

 the

 processing

 of

 data

 closer

 to

 the

 source

,

 rather

 than

 sending

 it

 to

 the

 cloud

 for

 processing

.

 This

 trend

 is

 driven

 by

 the

 need

 for

 real

-time

 processing

 and

 reduced

 latency

.


2

.

 Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

):

 N

LP

 is

 a

 key

 area

 of

 AI

 research

,

 and

 future

 advancements

 will

 enable

 more

 sophisticated

 language

 understanding

 and

 generation

.

 This

 will

 lead




In [6]:
llm.shutdown()