# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.33it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.27it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.23it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.51it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Andrea, I am a freelance writer, researcher, and photographer with a passion for telling stories through images and words. I have a background in journalism and photography and have worked with various publications and brands across different industries. I am highly skilled in writing engaging and informative content, conducting research, and capturing stunning visuals that bring stories to life.
I am available for freelance assignments, collaborations, and projects that align with my interests and expertise. Some of the areas I specialize in include:
Travel writing and photography
Food and beverage writing and photography
Lifestyle and culture writing and photography
Environmental and conservation writing and photography
I am based in Bali,
Prompt: The president of the United States is
Generated text:  in a tough spot. The economy is tanking, and he's getting heat from Wall Street to do something. But what can he do? He's not a economist, and

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation.
I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation.
I'm a 25-year-old freelance writer and artist living in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also known for its romantic atmosphere and is often referred to as the City of Light. Paris is a popular tourist destination and is considered one of the most beautiful cities in the world. The city has a population of over 2.1 million people and is a major hub for business,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased Adoption of AI in Everyday Life: AI is likely to become more ubiquitous in our daily lives, with applications in areas such as healthcare, finance, transportation, and education.
2. Advancements in Natural Language Processing (NLP): NLP is a key area of AI research, and future advancements in this field are likely to enable more sophisticated human-computer interactions, such as conversational interfaces and language translation.
3. Rise of Explainable AI (XAI): As AI becomes



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Zara and I work as a freelance writer. I'm a bit of a homebody, preferring to spend my free time curled up with a good book or experimenting with new recipes in the kitchen.
Zara is a freelance writer who lives a relatively quiet life, enjoying her time alone with hobbies like reading and cooking.
Read the introduction to get a sense of Zara's personality and lifestyle. The tone is neutral, not giving away any dramatic secrets or exciting adventures.
The introduction doesn't tell us too much about Zara's background or life before becoming a freelance writer, leaving room for the story to unfold.
Freelance writer Z

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France and one of the most famous cities in the world. It is a global center for art, fashion, gastronomy, and cul

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ly

ra

 Flynn

.

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 and

 editor

.

 I

 currently

 reside

 in

 Portland

,

 Oregon

,

 where

 I

 enjoy

 hiking

 and

 exploring

 the

 city

's

 food

 scene

.


Here

 are

 a

 few

 tips

 to

 help

 you

 craft

 a

 neutral

 self

-int

roduction

:


Use

 a

 convers

ational

 tone

:

 Write

 as

 if

 you

're

 introducing

 yourself

 to

 a

 friend

 or

 acquaintance

.


Keep

 it

 brief

:

 Aim

 for

 a

 few

 sentences

 at

 most

.


Focus

 on

 facts

:

 Share

 your

 name

,

 occupation

,

 and

 location

.


Optional

 details

:

 You

 can

 include

 hobbies

 or

 interests

 to

 add

 some

 personality

 to

 your

 introduction

,

 but

 keep

 them

 neutral

 and

 not

 too

 revealing

.


Here

 are

 a

 few

 examples

 of

 neutral



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 capital

 city

 of

 France

.

 The

 E

iff

el

 Tower

 is

 an

 iconic

 landmark

 in

 the

 city

,

 and

 the

 Ch

amps

-

É

lys

ées

 is

 one

 of

 the

 city

's

 most

 famous

 streets

.

 The

 Lou

vre

 Museum

,

 located

 in

 the

 city

,

 is

 home

 to

 the

 Mona

 Lisa

 and

 other

 famous

 artworks

.


France

 is

 home

 to

 a

 rich

 history

,

 art

,

 and

 culture

,

 and

 Paris

 is

 the

 heart

 of

 it

 all

.

 The

 city

 is

 known

 for

 its

 stunning

 architecture

,

 world

-class

 museums,

 and

 romantic

 atmosphere

.

 Visitors

 can

 explore

 the

 city

's

 historic

 neighborhoods

,

 such

 as

 Mont

mart

re

 and

 Le

 Mar

ais

,

 and

 enjoy

 the

 city

's

 vibrant

 food

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 the

 convergence

 of

 various

 technologies

 and

 the

 growing

 demand

 for

 intelligent

 systems

 that

 can

 learn

,

 reason

,

 and

 interact

 with

 humans

 in

 a

 more

 natural

 and

 intuitive

 way

.


Potential

 future

 trends

 in

 AI

 include

:


1

.

 

 

Increased

 use

 of

 machine

 learning

 and

 deep

 learning

:

 

 

As

 data

 continues

 to

 grow

 exponentially

,

 AI

 systems

 will

 become

 more

 adept

 at

 learning

 from

 data

 and

 making

 predictions

,

 leading

 to

 more

 accurate

 and

 personalized

 decision

-making

.


2

.

 

 

Integration

 of

 AI

 with

 the

 Internet

 of

 Things

 (

Io

T

):

 

 

The

 IoT

 will

 continue

 to

 expand

,

 and

 AI

 will

 play

 a

 key

 role

 in

 managing

 and

 analyzing

 the

 vast

 amounts

 of

 data




In [6]:
llm.shutdown()