# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.22it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.86it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.54it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.43it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:20,  1.07it/s]  9%|▊         | 2/23 [00:01<00:10,  2.08it/s]

 13%|█▎        | 3/23 [00:01<00:06,  3.03it/s] 17%|█▋        | 4/23 [00:01<00:04,  3.85it/s]

 22%|██▏       | 5/23 [00:01<00:03,  4.53it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.90it/s]

 30%|███       | 7/23 [00:01<00:03,  5.33it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.60it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.84it/s] 43%|████▎     | 10/23 [00:02<00:02,  6.04it/s]

 48%|████▊     | 11/23 [00:02<00:01,  6.15it/s] 52%|█████▏    | 12/23 [00:02<00:01,  6.24it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.31it/s] 61%|██████    | 14/23 [00:02<00:01,  6.38it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.50it/s] 70%|██████▉   | 16/23 [00:03<00:01,  5.75it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  5.75it/s] 78%|███████▊  | 18/23 [00:03<00:00,  5.71it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  5.85it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.63it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.82it/s]

 96%|█████████▌| 22/23 [00:04<00:00,  5.07it/s]

100%|██████████| 23/23 [00:04<00:00,  4.34it/s]100%|██████████| 23/23 [00:04<00:00,  4.80it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Carrie, and I'm a writer, editor, and social media specialist with a passion for storytelling and helping others find their voice.
I help entrepreneurs and small business owners develop a strong online presence through engaging content, strategic social media marketing, and compelling storytelling. Whether you're launching a new product, service, or business, I can help you craft a message that resonates with your target audience and sets you apart from the competition.
As a writer, I specialize in creating content that is informative, entertaining, and engaging. My services include blog posts, articles, social media posts, website copy, and more. I've worked with a variety
Prompt: The president of the United States is
Generated text:  elected by the Electoral College, not directly by the people. This is a fact that has been consistently ignored by the media and the general public for decades, and it's time to change that. In this article, we'

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and a few art projects, and I'm excited to see where they take me. That's me in a nutshell! What do you think? Is there anything you'd like to add or change?
Here are a few suggestions to make your self-introduction more engaging and effective:
1.  **Add a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This is a concise factual statement about France’s capital city. It is a simple and direct statement that provides the necessary information without any additional details or opinions. It is a good example of a factual statement because it is based on verifiable evidence and is widely accepted as true. The statement is also concise, meaning it is brief and to the point, making it easy to understand and remember. Overall, this statement is a good example of a factual statement about France’s capital city. 
This response meets the requirements of providing a concise factual statement about France’s capital city. It is a simple and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the way we



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Astrid, and I am a 22-year-old freelance writer living in Portland, Oregon. I have a degree in creative writing and have written for several local publications. In my free time, I enjoy hiking, reading, and attempting to cook. That’s a little bit about me.
This introduction is neutral because it doesn’t reveal any personal biases or opinions, and it doesn’t imply any sort of authority or expertise. It simply presents the facts of Astrid’s identity and background in a straightforward way. This kind of introduction is often useful in professional or academic settings, where you want to establish your credentials without revealing too much about your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide an example of a significant historical event that occurred in Paris, the capital of France. The Frenc

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 A

ria

 Flynn

.

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 and

 editor

 living

 in

 Los

 Angeles

.

 I

 have

 a

 cat

 named

 Jasper

 and

 enjoy

 hiking

 and

 trying

 out

 new

 coffee

 shops

 in

 my

 free

 time

.

 That

's

 me

 in

 a

 nutshell

.


A

ria

 Flynn

 is

 a

 

25

-year

-old

 freelance

 writer

 and

 editor

 living

 in

 Los

 Angeles

.

 She

 enjoys

 hiking

 and

 trying

 out

 new

 coffee

 shops

 in

 her

 free

 time

,

 and

 is

 often

 accompanied

 by

 her

 cat

,

 Jasper

.

 A

ria

 is

 a

 professional

 words

mith

 with

 a

 keen

 eye

 for

 detail

 and

 a

 knack

 for

 crafting

 compelling

 stories

.

 She

's

 always

 on

 the

 lookout

 for

 new

 adventures

 and

 is

 not

 afraid

 to

 take

 risks

 in

 pursuit



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 brief

 history

 of

 France

’s

 capital

 city

,

 including

 the

 founding

 date

 of

 the

 city

.

 Paris

,

 France

,

 has

 a

 rich

 history

 dating

 back

 to

 the

 

3

rd

 century

 BC

.

 It

 was

 founded

 by

 the

 Celtic

 people

,

 but

 it

 was

 the

 Romans

 who

 named

 it

 L

ut

et

ia

 and

 made

 it

 a

 major

 trading

 center

.

 In

 the

 

5

th

 century

,

 the

 city

 was

 renamed

 Paris

 by

 the

 Frank

ish

 king

 Clo

vis

 I

 and

 it

 has

 remained

 the

 capital

 of

 France

 since

 then

.


Describe

 the

 geographical

 features

 of

 France

’s

 capital

 city

.

 Paris

 is

 situated

 on

 the

 Se

ine

 River

 and

 is

 nestled

 in

 the

 Î

le

-de

-F

rance

 region

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 factors

,

 including

 advancements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 adoption

 of

 AI

 in

 various

 industries

:

 AI

 is

 expected

 to

 become

 more

 widespread

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 transportation

,

 and

 education

.


2

.

 Adv

ancements

 in

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 explain

ability

 and

 transparency

 in

 AI

 decision

-making

 processes

.


3

.

 Rise

 of

 autonomous

 systems

:

 Autonomous

 systems

,

 such

 as

 self

-driving

 cars

 and

 drones

,

 are

 expected

 to

 become

 more

 common

 in

 the

 future

.


4

.

 Development

 of

 hybrid




In [6]:
llm.shutdown()