# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.08it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.75it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.36it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.09s/it]

  9%|▊         | 2/23 [00:01<00:11,  1.75it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.47it/s]

 17%|█▋        | 4/23 [00:01<00:06,  3.08it/s] 22%|██▏       | 5/23 [00:01<00:04,  3.62it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.93it/s] 30%|███       | 7/23 [00:02<00:03,  4.27it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.46it/s] 39%|███▉      | 9/23 [00:02<00:03,  4.66it/s]

 43%|████▎     | 10/23 [00:02<00:02,  4.52it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.35it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.41it/s] 57%|█████▋    | 13/23 [00:03<00:02,  4.64it/s]

 61%|██████    | 14/23 [00:03<00:02,  4.49it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.41it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.14it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.11it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.17it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.31it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  3.98it/s] 91%|█████████▏| 21/23 [00:05<00:00,  4.24it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  3.92it/s]

100%|██████████| 23/23 [00:06<00:00,  3.92it/s]100%|██████████| 23/23 [00:06<00:00,  3.78it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John, and I'm a software engineer.
I'm here to share some of my experiences and knowledge with you. I've been working in the software industry for over a decade, and I've had the privilege of working on a wide range of projects, from mobile apps to web applications to enterprise software.

My specialty is in building scalable and maintainable software systems, with a focus on web development using technologies like Node.js, Express.js, and React.js. I'm also proficient in a variety of databases, including MongoDB, PostgreSQL, and MySQL.

When I'm not coding, I enjoy hiking, playing guitar, and trying out new restaurants
Prompt: The president of the United States is
Generated text:  not an all-powerful, infallible, or unaccountable individual. While the president does have many powers and responsibilities, there are limits to their authority and mechanisms for checking their actions. The Constitution, laws, and other institutions ensure that th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new restaurants. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and trying to learn more about the world of publishing. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
Your self-introduction is clear and concise, and it gives a good sense of who you are and what you're about. Here

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city is also a major hub for international business, finance, and tourism. Paris is a popular destination for visitors from around the world, attracting over 23 million tourists each year. The city has a population of over 2.1 million people and is a major center for education, research,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, there are several trends that are likely to shape the development and impact of artificial intelligence in the coming years.
1. Increased Adoption of AI in Various Industries:
AI is expected to become increasingly ubiquitous across various industries, including healthcare, finance, transportation, and education. As AI technology improves, more businesses will adopt AI-powered solutions to improve efficiency, reduce costs, and enhance customer experiences.
2. Advancements in Machine Learning and Deep Learning:
Machine learning and deep learning are key areas of AI research, and significant advancements are expected in the coming



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Ada Blackwood. I'm a 25-year-old data analyst. I've worked at the information technology department of the local government for two years now. I enjoy reading science fiction novels and learning about artificial intelligence in my free time. I'm not really sure where I see myself in five years, but I'm looking forward to exploring different career opportunities.
The tone of the statement is neutral and professional, with no emotional appeal or persuasion. It is concise and to the point, focusing on the character's basic information and some of their hobbies and interests. This tone is useful for many real-life situations, such as filling out job applications,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the northern part of France, in the Île-de-France region. It is situated on t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Am

os

 Black

wood

,

 and

 I

 live

 in

 the

 small

 town

 of

 Raven

's

 Peak

.

 I

'm

 a

 

25

-year

-old

 photographer

 who

's

 trying

 to

 make

 a

 name

 for

 myself

.

 I

 work

 at

 the

 local

 newspaper

,

 taking

 pictures

 and

 writing

 articles

 about

 the

 town

 and

 its

 people

.

 I

 enjoy

 capturing

 the

 beauty

 in

 everyday

 moments

 and

 telling

 the

 stories

 of

 those

 around

 me

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 but

 I

'm

 always

 eager

 to

 meet

 new

 people

 and

 hear

 their

 stories

.

 That

's

 me

 in

 a

 nutshell

 –

 or

 rather

,

 a

 camera

 lens

.


Am

os

 Black

wood

,

 

25




Phot

ographer

 at

 the

 Raven

's

 Peak

 newspaper




https

://

www



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 e

iff

el

 tower

 in

 Paris

.

 The

 E

iff

el

 Tower

 in

 Paris

 is

 an

 iconic

 symbol

 of

 France

 and

 is

 one

 of

 the

 most

 famous

 landmarks

 in

 the

 world

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 Paris

 is

 the

 capital

 of

 France

.


The

 most

 famous

 art

 museum

 in

 France

,

 the

 Lou

vre

 Museum

.

 The

 Lou

vre

 Museum

 is

 one

 of

 the

 world

’s

 largest

 and

 most

 visited

 museums

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 Paris

,

 the

 capital

 of

 France

,

 is

 situated

 in

 the

 north

-central

 part

 of

 the

 country

.


The

 beautiful

 Se

ine

 River

 runs

 through

 the

 heart

 of

 Paris

.

 The

 Se

ine

 River

 runs



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 various

 factors

,

 including

 advancements

 in

 computing

 power

,

 data

 availability

,

 and

 the

 development

 of

 new

 algorithms

 and

 techniques

.

 Some

 possible

 future

 trends

 in

 AI

 include

:


1

.

 Increased

 use

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 prevalent

 in

 decision

-making

 processes

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 these

 decisions

 are

 made

.

 X

AI

 aims

 to

 provide

 transparent

 and

 interpre

table

 AI

 systems

 that

 can

 explain

 their

 reasoning

 and

 decision

-making

 processes

.


2

.

 Emer

gence

 of

 Edge

 AI

:

 With

 the

 proliferation

 of

 IoT

 devices

 and

 the

 need

 for

 real

-time

 processing

,

 Edge

 AI

 is

 likely

 to

 become

 more

 prominent

.

 This

 involves

 processing

 data

 at

 the




In [6]:
llm.shutdown()