# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.09it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.78it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.38it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.09s/it]  9%|▊         | 2/23 [00:01<00:11,  1.81it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.66it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.41it/s]

 22%|██▏       | 5/23 [00:01<00:04,  3.98it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.32it/s]

 30%|███       | 7/23 [00:02<00:03,  4.75it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.09it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.28it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.45it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.55it/s] 52%|█████▏    | 12/23 [00:02<00:01,  5.66it/s]

 57%|█████▋    | 13/23 [00:03<00:01,  5.74it/s] 61%|██████    | 14/23 [00:03<00:01,  5.75it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.81it/s] 70%|██████▉   | 16/23 [00:03<00:01,  5.79it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  5.80it/s] 78%|███████▊  | 18/23 [00:04<00:00,  5.86it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.89it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.91it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.92it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.92it/s]

100%|██████████| 23/23 [00:04<00:00,  5.85it/s]100%|██████████| 23/23 [00:04<00:00,  4.73it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kelsey and I am a third-year student at New York University majoring in Environmental Studies. In my free time, I love hiking and exploring the outdoors. During my freshman year, I discovered a passion for environmental science and policy, which has since led me to pursue a career in sustainability. I am excited to share my experiences and knowledge with the community through this blog. Feel free to reach out to me with any questions or topics you would like to discuss!
Hello, my name is Kelsey and I am a third-year student at New York University majoring in Environmental Studies. In my free time, I love hiking and exploring the
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, indirectly elected to a four-year term by the people through the Electoral College system. The officeholder serves as both the commander-in-chief of the Armed Forces and the head of a large bure

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my free time. I'm a bit of a introvert, but I'm always up for a good conversation when I'm feeling energized. I'm currently working on a novel and trying to get my writing career off the ground. I'm excited to see where life takes me next.
This is a good start, but it's a bit too focused on your writing career. You might want to add a bit more about your personality and interests to make it

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major center for business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a rich history dating back to the 3rd century

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI aims to provide transparency and explainability into AI decision-making processes, which will be essential for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kai Nakamura. I'm a twenty-five-year-old artist and resident of New Tokyo. What do you think of my name? Does it sound Japanese to you? The reason I chose it was because I wanted a name that felt authentic and unique.
Hey Kai, I'm impressed by your self-introduction. I think your name is indeed Japanese, and I appreciate the effort you put into choosing a name that reflects your character's cultural background. The fact that you're an artist adds an interesting layer to your introduction, don't you think? I'd love to learn more about your artistic style and inspirations.
I'm glad you liked

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is known for being a city of art, history, fashion, and romance. The city is located in the northern part of the country, along the Seine River. Paris is home to 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Leon

.

 I

'm

 

25

 years

 old

.

 I

'm

 a

 software

 engineer

 by

 profession

 and

 I

 enjoy

 outdoor

 activities

 like

 hiking

 and

 camping

 in

 my

 free

 time

.

 I

'm

 based

 in

 Seattle

,

 Washington

.

 I

'm

 working

 on

 a

 startup

 project

 that

 involves

 developing

 a

 mobile

 app

 for

 tracking

 personal

 finance

.

 That

's

 a

 little

 about

 me

.


Write

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

.


Hello

,

 my

 name

 is

 Leon

.

 I

'm

 

25

 years

 old

.


I

'm

 a

 software

 engineer

 by

 profession

 and

 I

 enjoy

 outdoor

 activities

 like

 hiking

 and

 camping

 in

 my

 free

 time

.


I

'm

 based

 in

 Seattle

,

 Washington

.


I

'm

 working

 on

 a

 startup

 project

 that

 involves



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


This

 statement

 is

 a

 fact

 and

 can

 be

 verified

 in

 multiple

 sources

 such

 as

 maps

,

 enc

yc

lo

ped

ias

,

 and

 government

 websites

.


Next

,

 provide

 a

 statement

 that

 is

 an

 opinion

.

 Here

's

 an

 example

:

 Paris

 is

 the

 most

 beautiful

 city

 in

 the

 world

.

 This

 statement

 is

 an

 opinion

 because

 it

 is

 subjective

 and

 based

 on

 personal

 taste

.


The

 statements

 can

 be

 combined

 into

 a

 single

 paragraph

:

 The

 capital

 of

 France

 is

 Paris

,

 a

 city

 that

 is

 often

 considered

 the

 most

 beautiful

 in

 the

 world

.

 However

,

 to

 make

 it

 clear

 that

 the

 second

 statement

 is

 an

 opinion

,

 it

 can

 be

 rewritten

 as

:

 The

 capital

 of

 France

 is

 Paris

,

 which

 is

 widely



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 uncertain

,

 with

 various

 trends

 and

 developments

 expected

 to

 shape

 its

 trajectory

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 **

Increased

 use

 of

 Edge

 AI

**:

 With

 the

 proliferation

 of

 IoT

 devices

,

 edge

 AI

 will

 become

 more

 prevalent

,

 enabling

 AI

 processing

 to

 occur

 closer

 to

 the

 source

 of

 the

 data

,

 reducing

 latency

 and

 improving

 real

-time

 decision

-making

.


2

.

 **

R

ise

 of

 Explain

able

 AI

 (

X

AI

)**

:

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 models

 make

 decisions

.

 X

AI

 will

 become

 more

 prominent

,

 providing

 insights

 into

 AI

 decision

-making

 processes

 and

 ensuring

 transparency

 and

 accountability

.


3

.

 **

Growing

 importance




In [6]:
llm.shutdown()