# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.09it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.73it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.41it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.02it/s]

  9%|▊         | 2/23 [00:01<00:11,  1.89it/s] 13%|█▎        | 3/23 [00:01<00:07,  2.64it/s]

 17%|█▋        | 4/23 [00:01<00:05,  3.18it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.53it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.67it/s]

 30%|███       | 7/23 [00:02<00:04,  3.96it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.09it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.20it/s]

 43%|████▎     | 10/23 [00:02<00:03,  4.27it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.08it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.31it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.34it/s]

 61%|██████    | 14/23 [00:03<00:02,  4.44it/s] 65%|██████▌   | 15/23 [00:04<00:01,  4.64it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.94it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.85it/s] 78%|███████▊  | 18/23 [00:04<00:01,  4.92it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.14it/s] 87%|████████▋ | 20/23 [00:05<00:00,  5.35it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  5.50it/s] 96%|█████████▌| 22/23 [00:05<00:00,  5.67it/s]

100%|██████████| 23/23 [00:05<00:00,  5.75it/s]100%|██████████| 23/23 [00:05<00:00,  4.17it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Allison and I am a 4th grader at a local elementary school. I recently learned that my school is going to be building a new playground. I am super excited about this, but also a little worried because I have heard that the new playground may be really expensive and might require some fundraising to make it happen.
As part of the fundraising effort, the PTA (Parent-Teacher Association) is planning to host a bake sale. I was thinking that I could make some extra money by baking and selling some treats at the bake sale. My mom said that I could use her kitchen and ingredients to make some goodies, but she
Prompt: The president of the United States is
Generated text:  the head of the federal government of the United States. The president serves as both the head of state and head of government for the United States. The president is elected by the citizens of the United States through the Electoral College system, and serves a four-year term. The p

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 22-year-old student at the University of Tokyo, studying environmental science. I'm originally from a small town in Hokkaido, where I grew up surrounded by nature and developed a strong appreciation for the outdoors. I'm interested in sustainable development and conservation, and I'm excited to learn more about the complex relationships between human societies and the natural world. I'm a bit of a bookworm, but I also enjoy hiking and trying out new foods. I'm looking forward to meeting new people and making connections in my field. How would you like to proceed? Would you like to: A) Ask

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is the largest city in France and is located in the northern part of the country. It is situated on the Seine River and is known for its beautiful architecture, art museums, and fashion industry. Paris is also home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major cultural and economic center in Europe. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." The city has a rich history dating back to the Middle Ages and has been

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future holds, here are some possible trends that may shape the development and impact of artificial intelligence in the coming years:
1. Increased focus on explainability and transparency: As AI becomes more pervasive in our lives, there is a growing need for AI systems to be transparent and explainable in their decision-making processes. This will help build trust in AI and ensure that it is used responsibly.
2. Advancements in natural language processing: Natural language processing (NLP) is a key area of AI research, and we can expect to see significant advancements in this



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Echo. I'm a 25-year-old woman who is currently residing in a small town in the Pacific Northwest. I have a passion for reading and writing, and I spend most of my free time exploring the woods and taking photographs. I'm an introverted person who values my alone time and likes to keep to myself, but I'm always up for a quiet conversation or a friendly chat. I'm curious about the world around me and enjoy learning new things, but I don't have any particular goals or aspirations at this point in my life. I'm just taking things as they come and enjoying the simple pleasures. What can you tell

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This fact is useful because it allows individuals to accurately identify the capital of France. The simplicity and clarity of this fact make it a valuable piece of info

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ethan

.

 I

'm

 a

 

28

-year

-old

 freelance

 writer

 living

 in

 Seattle

.

 I

 enjoy

 hiking

 and

 trying

 out

 new

 restaurants

.


I

'm

 a

 

27

-year

-old

 software

 engineer

 working

 in

 San

 Francisco

.

 In

 my

 free

 time

,

 I

 like

 to

 play

 basketball

 and

 read

 sci

-fi

 novels

.


Your

 turn

.

 Write

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 character

 you

've

 created

.

 Use

 

1

-

2

 paragraphs

 and

 stick

 to

 the

 basics

.


My

 name

 is

 Maya

,

 and

 I

'm

 a

 

25

-year

-old

 artist

 living

 in

 New

 York

 City

.

 I

 have

 a

 degree

 in

 fine

 arts

 and

 spend

 most

 of

 my

 time

 painting

 and

 working

 on

 various

 creative

 projects

.


I

'm

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 and

 is

 situated

 on

 the

 Se

ine

 River

.

 Paris

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 also

 famous

 for

 its

 fashion

,

 cuisine

,

 and

 art

.

 It

 is

 a

 popular

 tourist

 destination

 and

 is

 home

 to

 many

 international

 organizations

 and

 institutions

,

 including

 the

 United

 Nations

 Educational

,

 Scientific

 and

 Cultural

 Organization

 (

UN

ESCO

).


To

 answer

 this

 question

 correctly

,

 you

 would

 need

 to

 demonstrate

 knowledge

 of

 basic

 facts

 about

 France

’s

 capital

 city

.

 This

 would

 involve

 recalling

 information

 about

 Paris

’s

 location

,

 notable

 landmarks

,

 cultural

 significance

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 combination

 of

 technological

 advancements

,

 societal

 needs

,

 and

 ethical

 considerations

.


##

 Step

 

1

:

 Techn

ological

 Adv

ancements




The

 future

 of

 artificial

 intelligence

 (

AI

)

 will

 be

 significantly

 influenced

 by

 technological

 advancements

.

 Several

 key

 areas

 are

 expected

 to

 drive

 AI

 development

:


-

 **

Adv

ancements

 in

 Machine

 Learning

 (

ML

)

 Algorithms

:**

 More

 sophisticated

 ML

 algorithms

 will

 allow

 for

 better

 pattern

 recognition

,

 decision

-making

,

 and

 problem

-solving

.


-

 **

Increased

 Use

 of

 Deep

 Learning

:**

 Deep

 learning

 techniques

,

 especially

 those

 based

 on

 neural

 networks

,

 will

 continue

 to

 improve

 the

 efficiency

 and

 accuracy

 of

 AI

 systems

.


-

 **

R

ise

 of

 Edge

 AI

:**

 As

 devices

 become

 more

 connected




In [6]:
llm.shutdown()