# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.45it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:20,  1.06it/s]  9%|▊         | 2/23 [00:01<00:10,  2.08it/s]

 13%|█▎        | 3/23 [00:01<00:06,  3.03it/s] 17%|█▋        | 4/23 [00:01<00:04,  3.83it/s]

 22%|██▏       | 5/23 [00:01<00:03,  4.50it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.86it/s]

 30%|███       | 7/23 [00:01<00:03,  5.31it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.64it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.90it/s] 43%|████▎     | 10/23 [00:02<00:02,  6.09it/s]

 48%|████▊     | 11/23 [00:02<00:01,  6.22it/s] 52%|█████▏    | 12/23 [00:02<00:01,  6.34it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.40it/s] 61%|██████    | 14/23 [00:02<00:01,  6.42it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.46it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.52it/s]

 74%|███████▍  | 17/23 [00:03<00:00,  6.45it/s] 78%|███████▊  | 18/23 [00:03<00:00,  6.48it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  6.50it/s] 87%|████████▋ | 20/23 [00:03<00:00,  6.53it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  6.55it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.56it/s]

100%|██████████| 23/23 [00:04<00:00,  6.51it/s]100%|██████████| 23/23 [00:04<00:00,  5.31it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tommy and I'm a salesman from California. I'm in Las Vegas for a conference and I thought I'd take a break and enjoy the sights and sounds of the city. I'm not much of a gambler, but I do love a good show. So, I decided to check out the Cirque du Soleil show at the Bellagio.
As I walked in, I was immediately struck by the grandeur of the theater. The set design was incredible, with towering walls and a massive stage. The audience was seated on either side of the stage, with a large catwalk above. I felt a little intimidated by the
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president leads the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces. The president is indirectly elected by the people through the Electoral College. The president serves a four-year term in office. The current president is Joe Bid

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys playing the guitar and listening to music. I'm a bit of a bookworm and like to read about history and science. I'm a pretty laid-back person who tries to stay out of trouble. I'm not really sure what I want to do with my life yet, but I'm open to exploring different possibilities. I'm a bit of a introvert and prefer to spend time alone or with a small group of close friends. I'm not really into sports or any other high-energy activities, but I do enjoy taking long walks and exploring new places. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country. It is situated on the Seine River. Paris is known for its cultural and historical significance, with many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major hub for international business, fashion, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the "City of Love." The city has a rich history dating back to the Roman era, and its architecture reflects a mix of medieval, Renaissance, and modern styles

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to improve patient outcomes and reduce healthcare costs.
2. Widespread adoption of AI in industries: AI is already being used in various industries, including finance,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: ... (Here is where you write your fictional character's name, but it's incomplete. You will add the last name after this introduction has been approved.)

I am a 16-year-old high school student living in a small suburban town. I have long, curly brown hair and bright green eyes. I enjoy playing guitar and writing poetry. When I not in school, I spend most of my time with my close friends, exploring the nearby woods and trying new things.

How is the introduction? I'd like to know before I fill in the character's full name.

## Step 1: Assess the overall tone and content of the introduction

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the north-central part of the country. Paris is the largest city in France and is home to the Eiffel Tower, one of the most iconic landmarks in the world. The

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Evelyn

 Winter

 and

 I

'm

 a

 

25

-year

-old

 former

 librarian

 who

 has

 recently

 moved

 to

 the

 city

 to

 pursue

 a

 career

 in

 writing

.


I

 want

 to

 change

 it

 to

 make

 it

 more

 engaging

,

 but

 still

 neutral

.

 How

 can

 I

 do

 this

?


To

 make

 your

 self

-int

roduction

 more

 engaging

 without

 losing

 its

 neutrality

,

 consider

 adding

 a

 few

 personal

 details

 that

 showcase

 your

 character

's

 personality

,

 interests

,

 or

 quir

ks

.

 Here

 are

 some

 suggestions

:


1

.

 **

Add

 a

 brief

 hobby

 or

 interest

**:

 Mention

 a

 hobby

 or

 interest

 that

 gives

 a

 glimpse

 into

 your

 character

's

 personality

.

 For

 example

:

 "

Hello

,

 my

 name

 is

 Evelyn

 Winter

,

 and

 I

'm

 a

 

25

-year



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


France

’s

 capital

 city

 is

 known

 for

 its

 rich

 history

 and

 cultural

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.


Paris

 has

 been

 ranked

 as

 the

 most

 visited

 city

 in

 the

 world

 due

 to

 its

 famous

 art

,

 architecture

,

 and

 cuisine

.


The

 city

 of

 Paris

 is

 home

 to

 many

 famous

 artists

 and

 writers

 throughout

 history

,

 including

 Claude

 Mon

et

 and

 Victor

 Hugo

.


Paris

 has

 a

 distinct

 French

 flair

,

 with

 a

 focus

 on

 elegance

 and

 sophistication

.


Overall

,

 the

 capital

 of

 France

 is

 a

 unique

 and

 captivating

 city

 that

 has

 something

 to

 offer

 for

 everyone

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 The

 capital

 of

 France

 is

 Paris

.


France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 convergence

 of

 technological

 advancements

,

 societal

 needs

,

 and

 regulatory

 frameworks

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Adoption

 in

 Industries

:

 AI

 will

 continue

 to

 be

 adopted

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 education

,

 leading

 to

 increased

 efficiency

,

 productivity

,

 and

 innovation

.


2

.

 Development

 of

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 prevalent

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 Explain

able

 AI

 (

X

AI

)

 will

 become

 a

 key

 area

 of

 research

 and

 development

 to

 ensure

 transparency

 and

 trust

 in

 AI

 systems

.


3

.

 Rise

 of

 Edge

 AI

:

 With

 the

 growth

 of




In [6]:
llm.shutdown()