# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.13s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.16s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.16s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.03it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Clementine (Tine) and I am a dietitian and a passionate foodie. I love exploring the world of food and sharing my knowledge with others. On this blog, I will share some of my favorite recipes, nutrition tips and fun foodie experiences.
My specialty is plant-based eating, but I also love working with clients who have gluten intolerance or other dietary restrictions. I believe that everyone deserves to enjoy good food, regardless of their dietary needs.
I am based in Brussels, Belgium, and I have a strong connection to the local food scene. I love trying out new restaurants and cafes, and discovering local ingredients and producers.

Prompt: The president of the United States is
Generated text:  not a dictator
The president of the United States is not a dictator. The U.S. Constitution establishes the structure of the federal government and limits the powers of the president. The president is a representative of the people and is accountable to C

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying out new foods. I'm currently working on a novel and experimenting with different writing styles. That's me in a nutshell. I'm looking forward to meeting you.
This is a good example of a neutral self-introduction because it doesn't reveal too much about Kaida's personality, interests, or motivations. It simply states the facts about her life and what she enjoys doing. This kind of introduction is suitable for a professional or social setting where you want to make a good impression without revealing too much about yourself.
Here

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major center for business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a rich cultural heritage and is home to many museums,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of Explainable AI (XAI): As AI becomes more pervasive, there will be a growing need to understand how AI systems make decisions. XAI will become more prevalent to provide transparency and accountability in AI decision-making.
2. Advancements in Natural Language Processing (NLP): NLP will continue to improve, enabling AI systems to better understand and generate human language. This will lead to more sophisticated chatbots, virtual assistants, and language translation systems.
3. Rise of Edge



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Maeve Winter. I'm a 28-year-old data analyst currently living in San Francisco. I have a cat named Pixel and enjoy hiking and reading in my free time. How would you modify this self-introduction to make it more engaging? Here are a few ideas to get you started:
A. Add a personal anecdote: Include a brief story that showcases your character's personality or interests.
B. Cite a unique skill or hobby: Highlight a distinctive talent or pastime that sets Maeve apart from others.
C. Use vivid language: Enrich your introduction with sensory details to bring Maeve to life.
D. Incorpor

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the largest city and a major cultural and commercial center. Its cityscape is known for its iconic landmarks such as the Eiffel Tower and Notre Dame Cathedral. Paris is a center fo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

,

 and

 I

'm

 a

 skilled

 astronom

er

 working

 at

 the

 prestigious

 Galactic

 Observatory

 on

 planet

 Z

or

v

ath

.

 I

've

 spent

 the

 past

 five

 years

 studying

 the

 celestial

 bodies

 in

 our

 galaxy

 and

 am

 eager

 to

 explore

 new

 worlds

.

 In

 my

 free

 time

,

 I

 enjoy

 st

arg

azing

 and

 reading

 about

 the

 history

 of

 space

 exploration

.

 What

 is

 your

 area

 of

 expertise

,

 and

 what

 brings

 you

 to

 this

 planet

?

 Feel

 free

 to

 introduce

 yourself

!

 I

'd

 love

 to

 hear

 about

 your

 experiences

 and

 share

 some

 of

 my

 own

 stories

.

 My

 appearance

 is

 quite

 un

assuming

,

 with

 short

,

 dark

 hair

 and

 expressive

 brown

 eyes

.

 I

'm

 a

 bit

 on

 the

 slender



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Where

 is

 the

 capital

 city

 of

 France

 located

?


The

 capital

 city

 of

 France

 is

 located

 in

 the

 northern

 part

 of

 the

 country

.

 It

 is

 situated

 on

 the

 Se

ine

 River

.

 Paris

 is

 the

 capital

 and

 the

 largest

 city

 of

 France

,

 located

 in

 the

 Î

le

-de

-F

rance

 region

.

 It

 is

 the

 center

 of

 French

 politics

,

 economy

,

 and

 culture

.


What

 is

 the

 significance

 of

 Paris

 as

 the

 capital

 of

 France

?


Paris

 is

 the

 seat

 of

 the

 French

 government

 and

 the

 location

 of

 the

 É

lys

ée

 Palace

,

 the

 official

 residence

 and

 workplace

 of

 the

 President

 of

 France

.

 It

 is

 also

 the

 site

 of

 the

 French

 National

 Assembly

 and

 the

 Senate

.

 Paris

 is

 a

 global



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 often

 explored

 in

 science

 fiction

 but

 actual

 developments

 are

 also

 driven

 by

 rapidly

 advancing

 technologies

.

 In

 this

 article

,

 we

 will

 explore

 some

 possible

 future

 trends

 in

 AI

.


1

.

 Increased

 Efficiency

 in

 Industries

:

 AI

 has

 the

 potential

 to

 automate

 and

 optimize

 various

 industries

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.

 This

 could

 lead

 to

 increased

 efficiency

,

 reduced

 costs

,

 and

 improved

 customer

 experiences

.


2

.

 More

 Human

-A

I

 Collaboration

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 lives

,

 we

 can

 expect

 to

 see

 more

 human

-A

I

 collaboration

.

 This

 could

 involve

 AI

 assistants

 helping

 us

 make

 decisions

,

 or

 AI

-powered

 tools

 aiding

 us

 in

 creative

 pursuits

.


3

.

 Greater

 Use

 of

 Explain

able

 AI

 (




In [6]:
llm.shutdown()