# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.45it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:20,  1.05it/s]  9%|▊         | 2/23 [00:01<00:10,  2.05it/s]

 13%|█▎        | 3/23 [00:01<00:06,  2.96it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.75it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.40it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.73it/s]

 30%|███       | 7/23 [00:01<00:03,  5.15it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.47it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.73it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.93it/s]

 48%|████▊     | 11/23 [00:02<00:01,  6.02it/s] 52%|█████▏    | 12/23 [00:02<00:01,  6.12it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.10it/s] 61%|██████    | 14/23 [00:03<00:01,  5.95it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.07it/s] 70%|██████▉   | 16/23 [00:03<00:01,  5.93it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  5.90it/s] 78%|███████▊  | 18/23 [00:03<00:00,  5.97it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  5.95it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.90it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.96it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.08it/s]

100%|██████████| 23/23 [00:04<00:00,  6.16it/s]100%|██████████| 23/23 [00:04<00:00,  5.06it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Caralyn, and I am a junior at Vanguard University. I am majoring in Psychology with a minor in Theology. I have been drawn to counseling for a long time, and I am excited to be a part of this program. I am passionate about creating a safe and non-judgmental space for people to share their thoughts and feelings.
I am excited to be working with Dr. Kim and the other interns on this project. I have learned a lot from Dr. Kim's research and I am grateful for her guidance and support.
I have had experience working with clients in a clinical setting, and I have seen firsthand
Prompt: The president of the United States is
Generated text:  to be questioned by police in a probe into his connection to a sex worker, the BBC has learned.
The US justice department has obtained a search warrant that will allow law enforcement officers to question Joe Biden about a woman who worked as a sex worker.
It is believed the woman is a key figure in a wider investig

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my free time. I'm a bit of a introvert, but I'm always up for a good conversation when I'm feeling energized. I'm currently working on a novel and a few art projects, and I'm excited to see where my creative pursuits take me. I'm a bit of a hopeless romantic, but I'm also a realist, and I'm always looking for ways to balance my idealism with the demands of everyday life. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also a major center for business, finance, and international relations. Paris is a popular tourist destination and is known for its romantic atmosphere and beautiful architecture. The city has a population of over 2.1 million people and is a major hub for transportation, with two international airports and a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, here are some possible future trends in artificial intelligence:
1. Increased use of AI in everyday life: AI is already being used in many aspects of our lives, from virtual assistants like Siri and Alexa to self-driving cars and personalized medicine. In the future, we can expect to see even more widespread use of AI in areas such as education, healthcare, and finance.
2. Advancements in natural language processing: Natural language processing (NLP) is a key area of AI research, and we can expect to see significant advancements in this



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Artemis. I'm a 25-year-old freelance writer, living in a cozy apartment in Seattle. I have a passion for crafting compelling stories and enjoy spending time outdoors, hiking in the nearby mountains. That's me in a nutshell.
Write a short, neutral self-introduction for a fictional character. I'm Ryker, a 28-year-old chef, working at a bustling restaurant in Chicago. I have a degree in culinary arts and enjoy experimenting with new recipes. In my free time, I like to try out new breweries and explore the city's diverse neighborhoods.
Write a short, neutral self-introduction for a fictional character.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a concise factual statement about a notable feature of the Eiffel Tower. The Eiffel Tower is 324 meters (1,063 feet) tall. 
Provide a concise factual s

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ly

ra

 Flynn

.

 I

'm

 a

 skilled

 engineer

 with

 a

 passion

 for

 invent

ing

 innovative

 solutions

 to

 complex

 problems

.

 I

 currently

 reside

 in

 a

 small

,

 coastal

 town

 where

 I

 work

 for

 a

 local

 research

 facility

.


Ly

ra

's

 self

-int

roduction

 is

 neutral

 because

 it

 focuses

 on

 her

 professional

 background

 and

 doesn

't

 reveal

 her

 personality

 or

 any

 personal

 details

.

 The

 introduction

 is

 short

 and

 to

 the

 point

,

 making

 it

 suitable

 for

 a

 formal

 or

 professional

 setting

.


Here

 are

 a

 few

 alternatives

 to

 the

 self

-int

roduction

:


Option

 

1

:

 Hello

,

 I

'm

 Ly

ra

 Flynn

.

 I

'm

 a

 mechanical

 engineer

 with

 a

 knack

 for

 designing

 and

 building

 creative

 machines

.

 I

'm

 based

 in

 a

 quaint



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Located

 in

 the

 northern

 region

 of

 France

,

 Paris

 is

 the

 country

’s

 largest

 city

 and

 the

 primary

 urban

 center

.

 It

 is

 situated

 on

 the

 Se

ine

 River

 and

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 cultural

 significance

,

 and

 vibrant

 arts

 scene

,

 making

 it

 a

 popular

 destination

 for

 tourists

 and

 a

 hub

 for

 international

 business

 and

 diplomacy

.

 France

’s

 capital

 city

 is

 a

 significant

 economic

 and

 cultural

 center

 in

 Europe

,

 with

 a

 diverse

 population

 and

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 Empire

.

 The

 city

 has

 undergone

 significant

 transformations

 over

 the

 years

,

 from



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 being

 shaped

 by

 rapidly

 advancing

 technologies

 and

 societal

 needs

.

 A

 few

 possible

 trends

 that

 may

 emerge

 in

 the

 future

 of

 AI

 include

:


   

 -

 **

Increased

 focus

 on

 Explain

ability

 and

 Transparency

**:

 As

 AI

 becomes

 more

 pervasive

 in

 our

 lives

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

 and

 recommendations

.

 Future

 AI

 systems

 may

 prioritize

 explain

ability

 and

 transparency

,

 providing

 insights

 into

 their

 decision

-making

 processes

 and

 enabling

 users

 to

 make

 informed

 choices

.


   

 -

 **

R

ise

 of

 Human

-A

I

 Collaboration

**:

 With

 the

 increasing

 complexity

 of

 tasks

 and

 the

 need

 for

 specialized

 expertise

,

 future

 AI

 systems

 may

 focus

 on

 collaboration

 with

 humans

 rather

 than

 replacing

 them

.

 This

 could

 involve

 AI

 systems

 that




In [6]:
llm.shutdown()