# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.24it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.90it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.57it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.45it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:20,  1.07it/s]  9%|▊         | 2/23 [00:01<00:09,  2.10it/s]

 13%|█▎        | 3/23 [00:01<00:06,  3.06it/s] 17%|█▋        | 4/23 [00:01<00:04,  3.89it/s]

 22%|██▏       | 5/23 [00:01<00:03,  4.59it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.98it/s]

 30%|███       | 7/23 [00:01<00:02,  5.43it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.77it/s]

 39%|███▉      | 9/23 [00:02<00:02,  6.05it/s] 43%|████▎     | 10/23 [00:02<00:02,  6.24it/s]

 48%|████▊     | 11/23 [00:02<00:01,  6.36it/s] 52%|█████▏    | 12/23 [00:02<00:01,  6.35it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.44it/s] 61%|██████    | 14/23 [00:02<00:01,  6.49it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.55it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.60it/s]

 74%|███████▍  | 17/23 [00:03<00:00,  6.61it/s] 78%|███████▊  | 18/23 [00:03<00:00,  6.64it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  6.65it/s] 87%|████████▋ | 20/23 [00:03<00:00,  6.66it/s]

 91%|█████████▏| 21/23 [00:03<00:00,  6.67it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.68it/s]

100%|██████████| 23/23 [00:04<00:00,  6.64it/s]100%|██████████| 23/23 [00:04<00:00,  5.40it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Belina, and I am a 24-year-old Christian blogger and freelance writer. I am passionate about helping people find faith in God and living a life of purpose and joy. My writing style is straightforward, relatable, and encouraging.
I believe that faith is not just something we believe with our minds, but it's something we live out in our daily lives. I want to help people discover their purpose and live a life that honors God. My blog is a place where people can come to find inspiration, encouragement, and practical advice on how to live a life that is pleasing to God.

I have been a Christian since I was
Prompt: The president of the United States is
Generated text:  supposed to embody the values of his nation, but when it comes to President Donald Trump, that's not always the case.
Trump's time in office has been marked by controversy, divisiveness, and a general disregard for the norms of civility and decorum that have guided past presidents.
F

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 22-year-old student at the University of Tokyo, studying environmental science. I'm originally from a small town in Hokkaido, where I grew up surrounded by nature. I'm interested in sustainable development and conservation, and I'm currently working on a research project about the impact of climate change on Japanese forests. I'm a bit of a bookworm and enjoy reading about science, history, and philosophy in my free time. I'm also an avid hiker and love exploring the great outdoors. That's me in a nutshell! What do you think? Is there anything you'd like to add or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a starting point for further discussion or research. The statement is also accurate and reliable, as it is a widely accepted fact about France’s capital city. Overall, the statement is a good example of a concise factual statement. The statement is also neutral and does not contain any bias or opinion. It simply states a fact, without

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI aims to provide transparent and interpretable AI models, enabling humans to understand the reasoning behind AI-driven decisions.
2. Advancements in Edge AI: Edge AI refers to the processing of AI tasks at the edge of the network, closer to the source of the data. This trend is driven by the need for faster and more efficient AI processing



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Luca. I'm a 22-year-old from a small town in Italy. I'm currently studying engineering at the local university. I enjoy hiking and reading in my free time. I'm not really sure what I want to do with my life yet, but I'm open to any opportunities that come my way. I'm a bit of a introvert, but I do enjoy spending time with my close friends and family. That's me in a nutshell.
The introduction is neutral because it doesn't reveal any unique or exciting qualities about Luca, and it doesn't express any strong personality traits or emotions. It simply provides a brief overview of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This question requires a simple factual recall of the capital of France. The correct answer is a one-word response that is widely known. There is no need for additional information or

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

ait

o

.

 I

 work

 as

 a

 freelance

 writer

 and

 enjoy

 spending

 time

 outdoors

.

 When

 I

'm

 not

 working

,

 you

 can

 find

 me

 hiking

 or

 reading

 a

 good

 book

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 so

 I

 prefer

 to

 keep

 to

 myself

,

 but

 I

'm

 always

 up

 for

 a

 quiet

 conversation

.

 I

'm

 a

 bit

 of

 a

 day

dream

er

 and

 sometimes

 get

 lost

 in

 my

 own

 thoughts

,

 but

 I

 always

 try

 to

 be

 present

 in

 the

 moment

.

 I

'm

 a

 curious

 person

 and

 love

 to

 learn

 new

 things

,

 especially

 about

 history

 and

 science

.

 I

'm

 not

 much

 of

 a

 social

ite

,

 but

 I

 appreciate

 the

 value

 of

 human

 connection

 and

 try

 to

 be



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 statement

 is

 concise

 and

 factual

.

 It

 provides

 a

 basic

 piece

 of

 information

 about

 France

’s

 capital

 city

.

 To

 improve

 the

 statement

,

 you

 could

 add

 more

 detail

,

 such

 as

 the

 population

 of

 Paris

 or

 a

 notable

 landmark

 in

 the

 city

.

 However

,

 the

 statement

 is

 clear

 and

 easy

 to

 understand

 as

 it

 is

.

 Therefore

,

 it

 ful

fills

 the

 requirements

 of

 the

 task

.

 The

 tone

 is

 neutral

 and

 informative

,

 which

 is

 suitable

 for

 a

 factual

 statement

.

 Overall

,

 the

 statement

 effectively

 provides

 a

 brief

 overview

 of

 France

’s

 capital

 city

.

 No

 further

 action

 is

 required

.

 The

 statement

 is

 already

 accurate

 and

 clear

.

 Therefore

,

 it

 is

 complete

.

 The

 statement

 is

 a

 single

 sentence



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 closely

 tied

 to

 the

 development

 of

 new

 technologies

,

 societal

 needs

,

 and

 ethical

 considerations

.

 Some

 possible

 future

 trends

 in

 AI

 include

:


More

 emphasis

 on

 Explain

ability

 and

 Transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 greater

 need

 for

 understanding

 how

 decisions

 are

 made

 and

 why

 certain

 outcomes

 occur

.

 This

 may

 lead

 to

 the

 development

 of

 more

 explain

able

 and

 transparent

 AI

 systems

.


Increased

 focus

 on

 Edge

 AI

:

 As

 the

 Internet

 of

 Things

 (

Io

T

)

 continues

 to

 grow

,

 there

 will

 be

 a

 need

 for

 AI

 to

 be

 processed

 at

 the

 edge

 of

 the

 network

,

 rather

 than

 in

 the

 cloud

.

 This

 will

 enable

 faster

 and

 more

 efficient

 processing

 of

 data

 in

 real

-time

.


Adv




In [6]:
llm.shutdown()