# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.78it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.44it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:20,  1.07it/s]  9%|▊         | 2/23 [00:01<00:10,  2.09it/s]

 13%|█▎        | 3/23 [00:01<00:06,  3.04it/s] 17%|█▋        | 4/23 [00:01<00:04,  3.88it/s]

 22%|██▏       | 5/23 [00:01<00:03,  4.57it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.94it/s]

 30%|███       | 7/23 [00:01<00:02,  5.40it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.72it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.89it/s] 43%|████▎     | 10/23 [00:02<00:02,  6.10it/s]

 48%|████▊     | 11/23 [00:02<00:01,  6.24it/s] 52%|█████▏    | 12/23 [00:02<00:01,  6.36it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.44it/s] 61%|██████    | 14/23 [00:02<00:01,  6.50it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.53it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.57it/s]

 74%|███████▍  | 17/23 [00:03<00:00,  6.58it/s] 78%|███████▊  | 18/23 [00:03<00:00,  6.60it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  6.59it/s] 87%|████████▋ | 20/23 [00:03<00:00,  6.60it/s]

 91%|█████████▏| 21/23 [00:03<00:00,  6.59it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.61it/s]

100%|██████████| 23/23 [00:04<00:00,  6.60it/s]100%|██████████| 23/23 [00:04<00:00,  5.36it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Pooja and I'm a web designer and front-end developer. I'm passionate about creating user-friendly and visually appealing websites. I love coding in HTML, CSS, and JavaScript, and experimenting with new design trends and technologies.
Here are some of my strengths:
Design and development of websites, mobile applications, and web applications using HTML, CSS, JavaScript, and other front-end development tools.
Understanding of responsive design principles and ability to create websites that work seamlessly on various devices and screen sizes.
Experience with popular front-end frameworks and libraries, including Bootstrap, React, and Angular.
Knowledge of accessibility guidelines and best practices, ensuring that websites are
Prompt: The president of the United States is
Generated text:  not elected through direct vote by the people, but rather through the electoral college system established by the Founding Fathers. In this system, each state is 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and experimenting with new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and a few art projects, and I'm excited to see where my creative endeavors take me.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or motivations. It simply presents a brief overview of who she is and what she does. This can be helpful

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This is a concise and factual statement about the capital city of France. It does not include any additional information or opinions, making it a clear and direct statement of fact. This type of statement is often used in encyclopedias, dictionaries, and other reference materials where accuracy and brevity are essential. It is also a good starting point for further research or discussion about the city of Paris and its significance in France. The statement is neutral and does not express any opinion or emotion, which is another characteristic of a concise factual statement. Overall, the statement is clear, direct, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to improve patient outcomes and reduce healthcare costs.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the way we learn, with



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lydia Singh. I'm 24 years old and work as a freelance writer. I'm originally from New Delhi, India, but I've been living in Tokyo, Japan for the past 5 years. I'm passionate about language and culture, and I enjoy exploring different neighborhoods and trying new foods in the city. That's me in a nutshell. I'm looking forward to getting to know you better.
Hello, my name is Zara Ali. I'm a 29-year-old artist living in Brooklyn, New York. I'm originally from London, England, and I moved to the States for college. I've been living in New

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. From a geographical perspective, the city is located in the northern part of the country, in the Île-de-France region. The city is situated along the Seine River, which plays a significant role in its history and development

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Helen

 Z

ara

.

 I

'm

 a

 

25

-year

-old

 historian

 and

 writer

,

 currently

 living

 in

 Oxford

,

 England

.

 I

 have

 a

 degree

 in

 medieval

 history

 and

 a

 passion

 for

 re

imag

ining

 the

 past

 in

 my

 fiction

 writing

.

 When

 I

'm

 not

 researching

 or

 writing

,

 you

 can

 find

 me

 wandering

 the

 university

's

 old

 colleges

 or

 exploring

 the

 city

's

 hidden

 corners

.

 I

'm

 excited

 to

 meet

 you

 and

 learn

 more

 about

 your

 interests

!

 This

 self

-int

roduction

 is

 neutral

 because

 it

 doesn

't

 reveal

 any

 biases

 or

 personal

 preferences

.

 It

 presents

 the

 character

's

 background

 and

 interests

 in

 a

 straightforward

 and

 respectful

 manner

.

 The

 tone

 is

 friendly

 and

 inviting

,

 making

 it

 suitable

 for

 a

 variety

 of



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 northern

-central

 part

 of

 the

 country

.

 Paris

 is

 situated

 on

 the

 Se

ine

 River

 and

 is

 the

 center

 of

 France

’s

 government

,

 economy

,

 and

 culture

.

 It

 is

 known

 for

 its

 rich

 history

,

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 and

 its

 artistic

 and

 intellectual

 heritage

.

 Paris

 is

 a

 major

 tourist

 destination

 and

 a

 symbol

 of

 French

 culture

 and

 elegance

.

 It

 has

 been

 the

 capital

 of

 France

 since

 

987

 AD

.

 The

 city

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

,

 making

 it

 the

 second

-most

 populous

 city

 in

 the

 European

 Union

 after

 London

.

 Paris

 is

 also

 a

 major

 hub

 for

 business

,

 finance

,

 and

 education

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 much

 speculation

 and

 debate

.

 Some

 potential

 trends

 that

 may

 emerge

 in

 the

 future

 of

 artificial

 intelligence

 include

:


1

.

 Increased

 Adoption

 of

 AI

 in

 Industries




2

.

 Improved

 Efficiency

 and

 Product

ivity




3

.

 Enhanced

 Human

-A

I

 Collaboration




4

.

 Greater

 Use

 of

 Explain

able

 AI




5

.

 More

 Advanced

 Robotics

 and

 Autonomous

 Systems




6

.

 Increased

 Focus

 on

 Ethics

 and

 Fair

ness




7

.

 Growing

 Importance

 of

 Transfer

 Learning




8

.

 Rise

 of

 Edge

 AI




9

.

 Expansion

 of

 Convers

ational

 AI




10

.

 Increased

 Use

 of

 AI

 in

 Healthcare




Art

ificial

 intelligence

 (

AI

)

 is

 becoming

 increasingly

 pervasive

 in

 our

 daily

 lives

,

 transforming

 the

 way

 we

 live

,




In [6]:
llm.shutdown()