# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.08it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.72it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.33it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:22,  1.04s/it]  9%|▊         | 2/23 [00:01<00:11,  1.87it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.70it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.43it/s]

 22%|██▏       | 5/23 [00:01<00:04,  3.73it/s]

 26%|██▌       | 6/23 [00:01<00:04,  4.08it/s] 30%|███       | 7/23 [00:02<00:03,  4.51it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.84it/s] 39%|███▉      | 9/23 [00:02<00:02,  4.94it/s]

 43%|████▎     | 10/23 [00:02<00:02,  5.13it/s] 48%|████▊     | 11/23 [00:02<00:02,  5.27it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  5.43it/s] 57%|█████▋    | 13/23 [00:03<00:01,  5.49it/s]

 61%|██████    | 14/23 [00:03<00:01,  5.38it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.03it/s]

 70%|██████▉   | 16/23 [00:03<00:01,  4.65it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.41it/s] 78%|███████▊  | 18/23 [00:04<00:01,  4.76it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.01it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.19it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.33it/s] 96%|█████████▌| 22/23 [00:05<00:00,  5.41it/s]

100%|██████████| 23/23 [00:05<00:00,  5.24it/s]100%|██████████| 23/23 [00:05<00:00,  4.38it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emma and I am a 15 year old girl. I am a vegetarian and I have been wanting to make some changes in my life to live a more sustainable lifestyle. I am excited to start this journey and learn more about the topic. I would love to hear any tips or advice that you might have.
Welcome to the community, Emma! It's great to hear that you're interested in living a more sustainable lifestyle. That's a wonderful goal, and it's awesome that you're taking the first step by educating yourself.

As a vegetarian, you're already doing a great thing for the environment and animal welfare. Here are some tips
Prompt: The president of the United States is
Generated text:  a symbol of the nation’s power and prestige. But, more importantly, the president is a symbol of the values and ideals of the American people.
One of the most important ideals in America is the idea of freedom. Americans cherish their freedom to make choices, to express themselves, and to live 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying out new restaurants. I'm currently working on a novel and experimenting with different writing styles. That's me in a nutshell. I'm looking forward to meeting new people and learning more about their experiences.
This self-introduction is neutral because it doesn't reveal any personal opinions or biases. It simply states the character's name, age, occupation, and interests. It also mentions a current project, which can help to establish the character's personality and goals. The introduction ends with a friendly and open-ended statement,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also a major hub for international business, finance, and tourism. Paris is a popular destination for visitors from around the world, attracting over 23 million tourists each year. The city is divided into 20 arrondissements, or districts, and has a population of over 2

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI aims to provide transparency and interpretability of AI models, enabling humans to understand



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jane Thompson. I'm a 20-year-old college student majoring in English literature. I'm a bit of a bookworm and enjoy writing poetry and short stories in my free time. Outside of academics, I'm a bit of a homebody and enjoy spending time with my family and friends, trying out new recipes in the kitchen, and watching old movies. I'm currently working on building up my confidence and exploring my passions.
This self-introduction is neutral because it:
Avoids using overly promotional language or boasting about one's accomplishments.
Does not express any strong opinions or biases.
Does not include any sensitive or potentially divisive topics.


Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The city of Paris is known for its beautiful architecture and historic landmarks, including the Eiffel Tower and Notre D

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Luna

 Night

shade

 and

 I

'm

 a

 

25

-year

-old

 bot

an

ist

 who

's

 been

 studying

 the

 unique

 plant

 life

 of

 the

 mystical

 forest

.

 I

'm

 passionate

 about

 understanding

 the

 intric

acies

 of

 nature

 and

 discovering

 new

 species

 to

 help

 preserve

 the

 delicate

 balance

 of

 our

 ecosystem

.


In

 this

 example

,

 Luna

's

 self

-int

roduction

 is

 neutral

 because

 it

 doesn

't

 reveal

 any

 strong

 opinions

 or

 biases

.

 She

 introduces

 herself

,

 shares

 her

 profession

,

 and

 explains

 her

 interests

 without

 taking

 a

 stance

 or

 expressing

 enthusiasm

.

 This

 approach

 is

 beneficial

 when

 you

 want

 to

 establish

 a

 connection

 with

 others

 without

 appearing

 too

 push

y

 or

 confront

ational

.

 By

 keeping

 your

 introduction

 concise

 and

 neutral

,

 you

 can

 set

 the

 stage



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Next

 Next

 post

:

 Which

 city

 is

 the

 capital

 of

 France

?

 Paris

.

 Is

 this

 correct

?

 Yes

.

 Is

 it

 a

 fact

?

 Yes

.

 What

 makes

 it

 a

 fact

?

 It

 is

 a

 statement

 that

 can

 be

 verified

 to

 be

 true

.

 Can

 anyone

 verify

 this

 fact

?

 Yes

,

 by

 looking

 at

 a

 map

 or

 any

 reliable

 source

 of

 information

.

 Is

 this

 fact

 reliable

?

 Yes

.

 Is

 this

 fact

 interesting

?

 Yes

,

 because

 it

 is

 the

 capital

 of

 a

 country

 that

 is

 well

 known

 and

 visited

 by

 many

 people

.

 Is

 this

 fact

 new

?

 No

.

 Does

 this

 fact

 need

 to

 be

 verified

?

 No

,

 it

 is

 a

 well

-known

 fact

.

 Is

 this

 fact

 important

?



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 ongoing

 debate

 and

 speculation

.

 However

,

 some

 possible

 trends

 that

 may

 shape

 the

 future

 of

 AI

 include

:


1

.

 

 

Increased

 adoption

 of

 AI

 in

 various

 industries

:

 AI

 is

 expected

 to

 become

 increasingly

 adopted

 across

 various

 industries

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 education

.

 This

 will

 lead

 to

 more

 efficient

 and

 effective

 operations

,

 improved

 decision

-making

,

 and

 enhanced

 customer

 experiences

.


2

.

 

 

Adv

ancements

 in

 natural

 language

 processing

 (

N

LP

)

 and

 machine

 learning

 (

ML

):

 N

LP

 and

 ML

 are

 expected

 to

 continue

 advancing

,

 enabling

 AI

 systems

 to

 better

 understand

 and

 interact

 with

 humans

 through

 voice

,

 text

,

 and

 other

 forms

 of

 communication

.

 This

 will

 lead




In [6]:
llm.shutdown()