# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.38it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.12it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.07it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jennifer and I am the owner of Boogie Woogie Pie Company. I am a pie lover, a baker, and a dog mom. When I'm not in the kitchen baking up a storm, I love spending time with my rescue pup, Jasper, and enjoying the beautiful outdoors with my family.
Here at Boogie Woogie Pie Company, we are passionate about creating delicious, unique pies that bring joy to our customers. We use only the freshest ingredients and time-honored baking techniques to craft our pies, and we're proud to offer a variety of flavors that are sure to satisfy any sweet tooth.
Whether you're looking for
Prompt: The president of the United States is
Generated text:  a very powerful person, and one of the most recognizable figures in the world. But who was the first president of the United States? Here’s a brief history of George Washington, the first president of the United States.
George Washington was born on February 22, 1732, in Westmoreland County, Virginia. He was the el

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, trying new foods, and practicing yoga. I'm currently working on a novel and trying to learn more about the world around me. That's me in a nutshell.
Write a short, neutral self-introduction for a fictional character. Hello, my name is Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, trying new foods, and practicing yoga. I'm currently working on a novel and trying to learn more about the world around me. That's me in a nutshell. This introduction is neutral

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  Paris is the largest city in France and is located in the northern part of the country. It is situated on the Seine River and is known for its beautiful architecture, art museums, and fashion industry. Paris is a popular tourist destination and is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major hub for business, culture, and entertainment.  Paris is also known for its romantic atmosphere

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of Explainable AI (XAI): As AI becomes more pervasive, there will be a growing need to understand how AI systems make decisions. XAI will become more prevalent to provide transparency and accountability in AI decision-making.
2. Rise of Edge AI: With the proliferation of IoT devices, Edge AI will become more important to process data in real-time, reducing latency and improving decision-making.
3. Growing importance of Human-AI Collaboration: As AI becomes more capable, humans



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lyra Frost. I'm a 22-year-old student majoring in Environmental Science at the University of Northern California. I'm currently living in the dorms on campus, sharing an apartment with two roommates. When I'm not studying or attending classes, I enjoy spending time outdoors and exploring the nearby trails. I also have a part-time job at a local coffee shop to help make ends meet. What are your interests?
This introduction is neutral because it doesn't reveal any personal opinions, biases, or conflicts. It provides some basic information about the character's background, interests, and personality, but doesn't include any emotional or evalu

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
NextNext post:What is the capital of Greece? Athens. The capital of Greece is Athens. Athens is a city in Greece. It 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

 Gray

 and

 I

’m

 a

 

25

-year

-old

 graphic

 designer

.

 I

 work

 for

 a

 small

 firm

 in

 downtown

 Boston

,

 designing

 logos

 and

 branding

 materials

 for

 various

 businesses

 and

 organizations

.


Create

 a

 neutral

 self

-int

roduction

 by

 using

 a

 formal

,

 objective

 tone

.

 I

’m

 Julia

 Lee

,

 a

 

32

-year

-old

 marketing

 specialist

.

 I

 have

 been

 working

 in

 the

 field

 for

 over

 

5

 years

 and

 currently

 hold

 a

 Bachelor

’s

 degree

 in

 Marketing

 from

 the University

 of

 California

.


Write

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

.

 Hi

,

 I

’m

 Ryan

 Thompson

,

 a

 

28

-year

-old

 software

 engineer

 with

 a

 degree

 in

 Computer

 Science

 from

 the

 University

 of

 Michigan

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 located

 on

 the

 Se

ine

 River

 and

 is

 home

 to

 many

 famous

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 The

 city

 also

 hosts

 several

 international

 organizations

 and

 events

,

 such

 as

 the

 United

 Nations

 Educational

,

 Scientific

 and

 Cultural

 Organization

 (

UN

ESCO

)

 and

 the

 Tour

 de

 France

 cycling

 race

.


The

 statement

 is

 concise

 and

 factual

,

 providing

 basic

 information

 about

 Paris

,

 the

 capital

 of

 France

.


Here

 are

 a

 few

 ways

 to

 expand

 on

 the

 statement

:


Provide

 more

 detailed

 information

 about

 the

 city

’s

 history

 and

 cultural

 significance

:

 “

The

 capital

 of

 France

,

 Paris

,

 has

 a

 rich

 history

 dating

 back

 to

 the

 

3

rd

 century

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 transformative

 and

 widespread

,

 with

 numerous

 predictions

 and

 forecasts

 made

 by

 experts

 in

 the

 field

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


AI

 will

 become

 increasingly

 pervasive

:

 AI

 will

 be

 integrated

 into

 various

 aspects

 of

 our

 lives

,

 from

 home

 appliances

 and

 personal

 assistants

 to

 self

-driving

 cars

 and

 healthcare

 systems

.

 AI

 will

 become

 an

 indispensable

 tool

 for

 businesses

,

 governments

,

 and

 individuals

,

 transforming

 the

 way

 we

 work

,

 live

,

 and

 interact

 with

 each

 other

.


AI

 will

 become

 more

 autonomous

:

 As

 AI

 systems

 become

 more

 advanced

,

 they

 will

 be

 able

 to

 operate

 independently

,

 making

 decisions

 without

 human

 intervention

.

 This

 will

 raise

 questions

 about

 accountability

,

 ethics

,

 and

 the

 potential

 for

 AI




In [6]:
llm.shutdown()