# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.83it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.43it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:22,  1.04s/it]  9%|▊         | 2/23 [00:01<00:11,  1.89it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.77it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.54it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.19it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.50it/s]

 30%|███       | 7/23 [00:02<00:03,  4.91it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.24it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.48it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.67it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.79it/s] 52%|█████▏    | 12/23 [00:02<00:01,  5.83it/s]

 57%|█████▋    | 13/23 [00:03<00:01,  5.88it/s] 61%|██████    | 14/23 [00:03<00:01,  5.95it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.99it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.02it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  5.93it/s] 78%|███████▊  | 18/23 [00:03<00:00,  5.90it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.95it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.98it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.88it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.94it/s]

100%|██████████| 23/23 [00:04<00:00,  5.97it/s]100%|██████████| 23/23 [00:04<00:00,  4.88it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sandy F. and I am a tech-obsessed, crafty, music-loving, anxious introvert. I'm a freelance writer and editor, and I spend most of my days working from home (which is my happy place).
As a freelance writer and editor, I have the luxury of working on a wide range of projects, from articles to blog posts to social media content. I love learning about new topics and figuring out how to make them engaging and accessible to my audience.
When I'm not working, you can find me listening to music (I'm a huge fan of 80s and 90s alt rock and indie
Prompt: The president of the United States is
Generated text:  a top dog. There is no doubt about that. And as such, they hold immense power and authority. With great power comes great responsibility, and the president has a duty to uphold the law and protect the rights of all citizens.
But what happens when a president's personal life and behavior clash with their duties as a public servant? When does a presid

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking and reading in my free time. I'm currently working on a novel, but I'm not sure if it will ever be finished. I'm a bit of a introvert and prefer to spend time alone, but I do have a few close friends who I value greatly. That's me in a nutshell. What do you think? Is this a good self-introduction for a character?
This is a good self-introduction for a character. It provides a clear sense of who Kaida is, what she does

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is located in the northern part of the country, along the Seine River. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. It is home to many famous museums, such as the Louvre and the Orsay, as well as iconic landmarks like the Eiffel Tower and Notre-Dame Cathedral. The city is also a major hub for fashion, cuisine, and entertainment. Paris is a popular tourist destination and a significant cultural center in Europe. The city has a population of over 2.1 million people and is a major economic and political center in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lyra Blackwood, and I’m a freelance journalist with a passion for investigative reporting and storytelling.
Write a short, neutral self-introduction for a fictional character.
Hello, my name is Lyra Blackwood, and I'm a freelance journalist with a passion for investigative reporting and storytelling. ## Step 1: Determine the purpose of the introduction
The purpose of the introduction is to provide a brief overview of Lyra's profession and interests, without revealing any personal biases or opinions.

## Step 2: Use a formal tone
Use a formal tone to convey a sense of professionalism and objectivity.

## Step 3: Keep it

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. France is a country located in Europe, known for its rich history, art, fashion, and cuisine. The capital city of France is also called Pa

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

,

 and

 I

'm

 a

 research

 astro

bi

ologist

 studying

 the

 atmospheric

 conditions

 on

 distant

 ex

oplan

ets

.

 I

'm

 fascinated

 by

 the

 potential

 for

 life

 beyond

 Earth

 and

 enjoy

 exploring

 the

 intersection

 of

 science

 and

 philosophy

.

 That

's

 me

 in

 a

 nutshell

.


E

lian

ore

 Qu

asar

 is

 a

 research

 astro

bi

ologist

 who

 studies

 the

 atmospheric

 conditions

 on

 distant

 ex

oplan

ets

.

 She

's

 fascinated

 by

 the

 potential

 for

 life

 beyond

 Earth

 and

 enjoys

 exploring

 the

 intersection

 of

 science

 and

 philosophy

.


This

 response

 is

 neutral

 and

 gives

 a

 basic

 introduction

 to

 the

 character

.

 It

 provides

 some

 context

 about

 their

 profession

 and

 interests

,

 but

 doesn

't

 reveal

 too

 much

 about

 their

 personality

 or

 background

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 situated

 in

 the

 north

-central

 region

 of

 the

 country

.

 Paris

 is

 a

 global

 hub

 for

 art

,

 fashion

,

 and

 culture

.

 It

 hosts

 many

 world

-f

amous

 museums

,

 such

 as

 the

 Lou

vre

,

 and

 landmarks

 like

 the

 E

iff

el

 Tower

.

 The

 city

 is

 home

 to

 over

 

2

.

1

 million

 people

 and

 is

 a

 major

 economic

 center

.


The

 best

 answers

 will

 be

 concise

,

 factual

,

 and

 free

 of

 bias

.

 They

 will

 also

 provide

 a

 clear

 and

 accurate

 statement

 about

 the

 topic

.


Here

 are

 some

 examples

 of

 good

 answers

:


The

 capital

 of

 France

 is

 Paris

.


The

 city

 of

 Paris

 is

 located

 in

 the

 north

-central

 region

 of

 France

.


Paris

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 complex

 and

 multif

ac

eted

 topic

,

 with

 various

 potential

 trends

 and

 developments

 on

 the

 horizon

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 

 **

Increased

 focus

 on

 human

-centered

 AI

**:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 emphasis

 on

 ensuring

 that

 AI

 systems

 are

 designed

 with

 human

 values

 and

 ethics

 in

 mind

.

 This

 may

 involve

 the

 development

 of

 more

 transparent

,

 explain

able

,

 and

 accountable

 AI

 systems

.


2

.

 

 **

Adv

ances

 in

 natural

 language

 processing

 (

N

LP

)**

:

 N

LP

 has

 made

 significant

 progress

 in

 recent

 years

,

 and

 future

 advancements

 in

 this

 area

 could

 enable

 more

 sophisticated

 and

 nuanced

 human

-A

I

 interactions

.


3

.

 

 **

Expansion




In [6]:
llm.shutdown()