# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.07s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.22s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.18s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.03it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sophie, and I’m a creative copywriter. I love crafting compelling stories, highlighting unique brands, and helping them find their voice.
I have over 5 years of experience in writing for various mediums, including advertising, content marketing, and social media. My expertise lies in understanding the nuances of different brand voices and adapting my writing style to match their tone, audience, and message.

Some of my strengths include:

• Crafting engaging headlines and taglines
• Writing persuasive copy for various marketing channels
• Developing brand voice guidelines and tone of voice
• Creating social media content that drives engagement
• Collaborating with designers and art directors to
Prompt: The president of the United States is
Generated text:  the head of the executive branch of the federal government. The president is both the head of state and head of government of the United States. The president directs the executive branch of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team. I like to think I'm a pretty logical and analytical person, but I'm still figuring out who I am and where I fit in the world. I'm not really sure what I want to do with my life yet, but I'm excited to explore my options and see where they take me. I'm a bit of a introvert, but I'm working on being more outgoing and confident. I'm looking forward to seeing

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, culture, and tourism. Paris is also known for its romantic atmosphere, with its beautiful parks, gardens, and bridges. The city has a long history dating back to the 3rd century

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI will focus on developing AI systems that can provide transparent and interpretable explanations for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar, and I'm a middle-aged astrophysicist living in a small town on the outskirts of the New Eden Space Colony. I work as a researcher at the colony's central observatory, studying the properties of distant stars and galaxies. When not working, I enjoy spending time in the colony's botanical gardens, where I tend to a small collection of rare, exotic plants. I'm a bit of a quiet, introspective person, content with my simple life in the colony, but I do enjoy sharing my knowledge with others and engaging in lively discussions about the wonders of the cosmos. What are your thoughts

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. I. Introduction
The capital city of France is a significant urban center known for its rich history, cultural landmarks, and romantic ambiance. The city's beautiful a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Aurora

 Black

wood

,

 and

 I

'm

 a

 

23

-year

-old

 history

 student

 from

 rural

 Kentucky

.


A

ur

ora

 is

 a

 neutral

 character

 introduction

 because

 it

:


Is

 straightforward

 and

 to

 the

 point




Does

n

't

 reveal

 any

 interesting

 or

 unique

 details

 about

 the

 character




Uses

 a

 common

 name

 and

 a

 typical

 occupation




Is

 set

 in

 a

 fairly

 generic

 location




Avoid

s

 any

 emotional

 or

 sensational

 language




Overall

,

 this

 introduction

 provides

 basic

 information

 about

 the

 character

 but

 doesn

't

 give

 any

 hints

 about

 their

 personality

,

 skills

,

 or

 motivations

.

 It

's

 a

 good

 example

 of

 a

 neutral

 self

-int

roduction

.

 ...

 Read

 More




The

 final

 answer

 is

:

 There

 is

 no

 final

 answer

,

 as

 this



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 of

 Paris

 is

 a

 major

 hub

 for

 business

,

 culture

,

 and

 tourism

,

 attracting

 millions

 of

 visitors

 each

 year

.


Paris

 is

 known

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

,

 as

 well

 as

 its

 museums

 like

 the

 Lou

vre

 and

 the

 Or

say

.

 The

 city

 is

 home

 to

 the

 French

 government

 and

 international

 organizations

,

 including

 the

 United

 Nations

 Educational

,

 Scientific

 and

 Cultural

 Organization

 (

UN

ESCO

).


What

 are

 some

 popular

 tourist

 attractions

 in

 Paris

?


Some

 popular

 tourist

 attractions

 in

 Paris

 include

:


The

 E

iff

el

 Tower

:

 The

 iconic

 iron

 lattice

 tower

 built

 for

 the

 

188

9

 World

’s

 Fair

,

 offering

 stunning

 views

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 rapidly

 evolving

.

 Here

 are

 some

 possible

 future

 trends

 that

 could

 shape

 the

 field

:


 

 

1

.

 Increased

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 transparency

 and

 explain

ability

 in

 AI

 decision

-making

.

 This

 could

 involve

 developing

 techniques

 for

 interpreting

 and

 understanding

 the

 decision

-making

 processes

 of

 AI

 systems

.


 

 

2

.

 Rise

 of

 hybrid

 intelligence

:

 Hybrid

 intelligence

 combines

 human

 and

 artificial

 intelligence

 to

 create

 more

 effective

 and

 efficient

 solutions

.

 This

 could

 involve

 using

 AI

 to

 augment

 human

 capabilities

,

 such

 as

 decision

-making

,

 problem

-solving

,

 and

 creativity

.


 

 

3

.

 Growing

 importance

 of

 edge

 AI

:

 Edge

 AI

 involves

 deploying

 AI




In [6]:
llm.shutdown()