# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.08it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.73it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.06s/it]

  9%|▊         | 2/23 [00:01<00:13,  1.58it/s]

 13%|█▎        | 3/23 [00:01<00:09,  2.02it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.20it/s]

 22%|██▏       | 5/23 [00:02<00:09,  1.94it/s]

 26%|██▌       | 6/23 [00:03<00:08,  2.09it/s]

 30%|███       | 7/23 [00:03<00:07,  2.25it/s]

 35%|███▍      | 8/23 [00:03<00:06,  2.30it/s]

 39%|███▉      | 9/23 [00:04<00:05,  2.36it/s]

 43%|████▎     | 10/23 [00:04<00:05,  2.54it/s]

 48%|████▊     | 11/23 [00:05<00:04,  2.50it/s]

 52%|█████▏    | 12/23 [00:05<00:05,  2.17it/s]

 57%|█████▋    | 13/23 [00:05<00:04,  2.39it/s]

 61%|██████    | 14/23 [00:06<00:03,  2.53it/s]

 65%|██████▌   | 15/23 [00:06<00:02,  2.84it/s]

 70%|██████▉   | 16/23 [00:06<00:02,  3.24it/s]

 74%|███████▍  | 17/23 [00:07<00:01,  3.58it/s]

 78%|███████▊  | 18/23 [00:07<00:01,  3.83it/s]

 83%|████████▎ | 19/23 [00:07<00:00,  4.06it/s]

 87%|████████▋ | 20/23 [00:07<00:00,  4.24it/s]

 91%|█████████▏| 21/23 [00:07<00:00,  4.37it/s]

 96%|█████████▌| 22/23 [00:08<00:00,  4.45it/s]

100%|██████████| 23/23 [00:08<00:00,  4.48it/s]100%|██████████| 23/23 [00:08<00:00,  2.77it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah. I am a 28-year-old Registered Nurse (RN) and I have been working in healthcare for over 6 years. I have always been passionate about helping others and making a difference in people's lives.
I am currently working as a nurse in a busy hospital, where I have gained a wealth of experience in various areas of patient care. I have worked on medical-surgical units, ICU, and even done some travel nursing assignments to different parts of the country.
As a nurse, I have seen firsthand the impact that healthcare can have on people's lives. I have watched patients and their families go through some of the most
Prompt: The president of the United States is
Generated text:  the head of the executive branch of the federal government. The president is elected by the people through the Electoral College. The president serves a four-year term and is limited to two terms.
The president's powers and duties are outlined in Article II of the Constitution.

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I live in a small town in the Midwest with my family. I enjoy reading and playing guitar in my free time. I'm a bit of a introvert, but I'm working on being more outgoing. That's me in a nutshell. I'm looking forward to getting to know you better.
This is a good example of a neutral self-introduction because it doesn't reveal too much about Kaida's personality, interests, or motivations. It simply provides a brief overview of who she is and what she's like. This can be helpful for a character introduction because

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. The city is known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is also famous for its fashion, cuisine, and romantic atmosphere. The city has a rich history dating back to the Middle Ages and has been a major cultural and intellectual center for centuries. Today, Paris is a global hub for business, tourism, and culture, attracting millions of visitors each year. The city is also home to many world-renowned universities and research institutions, making it a hub for innovation

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, with applications in medical diagnosis, personalized medicine, and patient care.
2. Advancements in natural language processing: AI systems will become more proficient in understanding and generating human language, enabling more effective communication between humans and machines.
3. Expansion of computer vision: AI systems will become more adept at interpreting and understanding visual data, enabling applications such as self-driving cars and smart homes.
4. Increased use of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lena Douglas. I'm a 25-year-old freelance writer and part-time barista. I've been working on a novel in my spare time, and I enjoy hiking and reading in my free moments. That's me in a nutshell. How would you rate this self-introduction? Here are the five criteria to use for your rating:
1. Clarity: Does the introduction clearly communicate the main points about the character?
2. Relevance: Does the self-introduction include information that is relevant to the story or context?
3. Interest: Does the self-introduction engage the reader's interest or spark curiosity?
4. Style:

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the Île-de-France region of northern central France. The city is famous for its beauty, history, art, fashion, cuisine, and landmarks such as the Eiffel Tower, Notre 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Lex

a

 El

w

es

.

 I

'm

 a

 skilled

 we

aver

 and

 seam

stress

,

 specializing

 in

 tape

st

ries

 and

 textiles

 for

 homes

 and

 businesses

.

 I

 live

 and

 work

 in

 a

 small

,

 rural

 town

 surrounded

 by

 rolling

 hills

 and

 far

mland

.

 That

's

 me

 in

 a

 nutshell

.


Lex

a

 El

w

es

 is

 a

 

30

-year

-old

 woman

 who

 runs

 her

 own

 business

,

 El

w

es

 We

aving

.

 She

 has

 a

 degree

 in

 textile

 arts

 and

 has

 worked

 in

 several

 different

 weaving

 and

 sewing

 shops

 before

 deciding

 to

 strike

 out

 on

 her

 own

.

 Lex

a

 is

 known

 for

 her

 attention

 to

 detail

 and

 her

 ability

 to

 bring

 customers

'

 visions

 to

 life

.

 She

 is

 a

 hard

 worker



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 northern

 part

 of

 the

 country

 in

 the

 Î

le

-de

-F

rance

 region

.

 Paris

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

,

 and

 is

 home

 to

 many

 museums

 and

 cultural

 institutions

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 second

-largest

 city

.

 Marseille

 is

 France

’s

 second

-largest

 city

,

 located

 in

 the

 Prov

ence

-Al

pes

-C

ôte

 d

’

Az

ur

 region

 in

 the

 south

 of

 the

 country

.

 It

 is

 a

 major

 port

 city

 and

 a

 cultural

 center

,

 known

 for

 its

 rich

 history

,

 beautiful

 beaches

,

 and

 vibrant

 nightlife

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 largest

 city

.

 Paris

 is

 France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 predicted

 to

 be

 exciting

 and

 potentially

 transformative

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Adoption

 in

 Everyday

 Life

:

 AI

 will

 become

 more

 ubiquitous

 and

 integrated

 into

 daily

 life

,

 from

 smart

 homes

 and

 personal

 assistants

 to

 AI

-powered

 healthcare

 and

 education

.


2

.

 Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

):

 N

LP

 will

 continue

 to

 improve

,

 enabling

 humans

 to

 interact

 with

 AI

 systems

 more

 naturally

 and

 effectively

.


3

.

 Rise

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

 and

 recommendations

.

 X

AI

 will

 help

 address

 this

 need

.


4

.

 Development

 of

 Autonomous

 Systems




In [6]:
llm.shutdown()