# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.27it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.93it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.59it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.48it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.03it/s]

  9%|▊         | 2/23 [00:01<00:10,  1.93it/s] 13%|█▎        | 3/23 [00:01<00:07,  2.69it/s]

 17%|█▋        | 4/23 [00:01<00:05,  3.25it/s] 22%|██▏       | 5/23 [00:01<00:04,  3.77it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.86it/s] 30%|███       | 7/23 [00:02<00:03,  4.37it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.90it/s]

 39%|███▉      | 9/23 [00:03<00:04,  2.82it/s] 43%|████▎     | 10/23 [00:03<00:03,  3.42it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.01it/s] 52%|█████▏    | 12/23 [00:03<00:02,  4.56it/s]

 57%|█████▋    | 13/23 [00:03<00:01,  5.01it/s] 61%|██████    | 14/23 [00:03<00:01,  5.41it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.73it/s] 70%|██████▉   | 16/23 [00:04<00:01,  5.97it/s]

 74%|███████▍  | 17/23 [00:04<00:00,  6.15it/s] 78%|███████▊  | 18/23 [00:04<00:00,  6.20it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  6.31it/s] 87%|████████▋ | 20/23 [00:04<00:00,  6.40it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  6.45it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.48it/s]

100%|██████████| 23/23 [00:05<00:00,  6.53it/s]100%|██████████| 23/23 [00:05<00:00,  4.47it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Martin and I am a retired Computer Systems Engineer. I am an avid traveler and have been fortunate enough to have visited over 60 countries around the world. My travels have taken me to many exciting places, from the ancient temples of Cambodia to the vibrant cities of Japan, from the beautiful beaches of Thailand to the majestic mountains of New Zealand.
I have always been fascinated by the diversity of cultures and the way people live their lives in different parts of the world. I have tried to capture some of these experiences and memories in my writing, which I hope will inspire and entertain my readers.
In addition to my travels, I also enjoy learning new
Prompt: The president of the United States is
Generated text:  planning to unveil a new proposal to reform the country's tax system on Wednesday. The plan, which has been under consideration for months, aims to reduce tax rates and simplify the tax code. However, some lawmakers are expre

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys playing the guitar and listening to music. I'm a bit of a bookworm and like to spend my free time reading and learning new things. I'm a pretty laid-back person and try to stay out of trouble. I'm not really sure what I want to do with my life yet, but I'm open to exploring different possibilities. I'm a bit shy, but I'm working on being more confident and outgoing. That's me in a nutshell. What do you think? Is it a good self-introduction? Should I add or change anything?
Your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  Paris is a city located in the northern part of France, along the Seine River. It is the country's largest city and a major cultural and economic center. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion, cuisine, and romantic atmosphere.  Paris is a popular tourist destination and a hub for international business and finance.  It is home to many international organizations, including the United Nations Educational, Scientific and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to improve patient outcomes and reduce healthcare costs.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the education sector by providing



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Arin Vex. I'm a software engineer by trade, and I spend most of my free time reading and coding in my small apartment. I'm not particularly outgoing, but I enjoy the company of a few close friends and family members.
Arin is a bit of a loner, but he's not antisocial. He's willing to engage with others when the situation calls for it, but he prefers to spend his time alone, working on his projects or reading. He's not really interested in the typical social scene, but he's not opposed to having a good time either. He's a bit of a introverted nerd

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city is home to world-famous landmarks such as the Eiffel Tower and Notre-Dame Cathedral. It is also known for its fashion, art, and cuisine. The city has a population of approximately 2.1 million people. The 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Clement

ine

 W

emb

ly

,

 and

 I

'm

 a

 

35

-year

-old

...

:

 clear

 and

 concise

.

 Do

 not

 include

 any

 overly

 personal

 or

 sensitive

 information

.

 (

e

.g

.,

 do

 not

 mention

 their

 spouse

 or

 any

 health

 conditions

.)

 I

'm

 a

 software

 engineer

,

 specializing

 in

 machine

 learning

.

 I

 reside

 in

 a

 quiet

 neighborhood

 near

 the

 city

 center

.

 I

 enjoy

 spending

 my

 free

 time

 reading

 and

 taking

 long

 walks

.


Here

's

 a

 rewritten

 version

 of

 the

 introduction

 with

 a

 bit

 more

 detail

 and

 a

 neutral

 tone

:


Hi

,

 I

'm

 Clement

ine

 W

emb

ly

.

 I

'm

 a

 software

 engineer

 with

 a

 focus

 on

 machine

 learning

,

 currently

 working

 on

 several

 projects

 that

 aim

 to

 improve

 data



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 and

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

,

 and

 other

 famous

 landmarks

.


What

 are

 some

 of

 the

 top

 attractions

 in

 Paris

?


Some

 of

 the

 top

 attractions

 in

 Paris

 include

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

,

 Notre

-D

ame

 Cathedral

,

 the

 Arc

 de

 Tri

omp

he

,

 and

 the

 Mont

mart

re

 neighborhood

.


What

 is

 the

 history

 of

 the

 E

iff

el

 Tower

?


The

 E

iff

el

 Tower

 was

 built

 for

 the

 

188

9

 World

’s

 Fair

 in

 Paris

,

 which

 was

 held

 to

 commemorate

 the

 

100

th

 anniversary

 of

 the

 French

 Revolution

.

 It

 was

 designed

 by

 Gust

ave



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 and

 its

 impact

 on

 various

 sectors

 and

 industries

 will

 be

 significant

.

 Various

 trends

 and

 developments

 are

 expected

 to

 shape

 the

 future

 of

 AI

,

 including

 advancements

 in

 machine

 learning

,

 the

 growth

 of

 edge

 AI

,

 and

 the

 increasing

 adoption

 of

 explain

able

 AI

.

 


Art

ificial

 intelligence

 (

AI

)

 is

 a

 rapidly

 evolving

 field

 with

 the

 potential

 to

 transform

 various

 aspects

 of

 our

 lives

.

 The

 future

 of

 AI

 is

 uncertain

,

 but

 several

 trends

 and

 developments

 are

 expected

 to

 shape

 its

 trajectory

.

 Some

 possible

 future

 trends

 in

 AI

 include

:


1

.

 Adv

ancements

 in

 Machine

 Learning

:

 Machine

 learning

,

 a

 subset

 of

 AI

,

 will

 continue

 to

 improve

,

 enabling

 AI

 systems

 to

 learn

 from

 data

 and

 make




In [6]:
llm.shutdown()