# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.11it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.04it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.03it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]

  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.01it/s]  9%|▊         | 2/23 [00:01<00:10,  1.96it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.78it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.53it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.11it/s]

 26%|██▌       | 6/23 [00:01<00:03,  4.36it/s] 30%|███       | 7/23 [00:02<00:03,  4.78it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.88it/s] 39%|███▉      | 9/23 [00:02<00:02,  5.19it/s]

 43%|████▎     | 10/23 [00:02<00:02,  5.37it/s] 48%|████▊     | 11/23 [00:02<00:02,  5.45it/s]

 52%|█████▏    | 12/23 [00:02<00:02,  5.43it/s] 57%|█████▋    | 13/23 [00:03<00:01,  5.41it/s]

 61%|██████    | 14/23 [00:03<00:01,  5.53it/s] 65%|██████▌   | 15/23 [00:03<00:01,  5.44it/s]

 70%|██████▉   | 16/23 [00:03<00:01,  4.60it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  3.84it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  3.55it/s]

 83%|████████▎ | 19/23 [00:04<00:01,  3.74it/s] 87%|████████▋ | 20/23 [00:04<00:00,  4.14it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.38it/s] 96%|█████████▌| 22/23 [00:05<00:00,  4.76it/s]

100%|██████████| 23/23 [00:05<00:00,  4.94it/s]100%|██████████| 23/23 [00:05<00:00,  4.22it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Samantha, and I am a passionate artist and writer. I've always been fascinated by the world of art and storytelling, and I've spent years honing my skills in both areas.
As a visual artist, I've explored a range of mediums, from painting and drawing to photography and mixed media. My work is often inspired by the natural world, and I enjoy experimenting with different techniques to capture the beauty and wonder of the world around us.
As a writer, I've always been drawn to fantasy and science fiction. I love getting lost in the imaginative worlds and characters that these genres have to offer, and I've spent years honing
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, and is the highest-ranking official in the federal government. The president is responsible for executing the laws of the land and serving as the commander-in-chief of the armed forces.
The president is

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and a few art projects that I'm excited to share with the world someday. That's me in a nutshell! What do you think? Is there anything you'd like to add or change?
I think your self-introduction is great! It's concise, informative, and gives a good sense of who K

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is the most populous city in France and is located in the northern part of the country. It is situated on the Seine River and is known for its iconic landmarks such as the Eiffel Tower and Notre Dame Cathedral. Paris is a major cultural and economic center and is home to many museums, art galleries, and historical sites. The city has a rich history dating back to the Roman era and has been a major hub of artistic and intellectual activity throughout the centuries. Today, Paris is a popular tourist destination and a center of fashion, cuisine, and entertainment. The city is also home to many international organizations and institutions,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Rise of autonomous vehicles: Autonomous vehicles are already being tested on public roads, and it's likely



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Dr. Zara Saeed, and I'm a pediatric cardiologist specializing in congenital heart defects. I've worked at Children's Hospital for the past decade, where I've had the privilege of helping countless young patients and their families navigate life-altering medical diagnoses. I'm excited to share my experiences and perspectives with you. What's your story?
This text is a neutral self-introduction for a fictional character, Dr. Zara Saeed. It highlights her professional background and experience as a pediatric cardiologist, without expressing personal opinions or biases. The text aims to introduce Dr. Saeed in a neutral and professional

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The capital of France is Paris, located in the north-central part of the country, along the Seine River.
The city of Paris, k

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Atlas

 Black

wood

,

 and

 I

'm

 a

 

22

-year

-old

 student

 at

 a

 local

 college

.

 I

've

 been

 studying

 computer

 science

 for

 the

 past

 two

 years

.

 I

 enjoy

 reading

 science

 fiction

 novels

 and

 listening

 to

 electronic

 dance music

.

 What

 is

 the

 best

 way

 to

 write

 a

 neutral

 self

-int

roduction

 for

 a

 character

?


The

 best

 way

 to

 write

 a

 neutral

 self

-int

roduction

 for

 a

 character

 is

 to

 focus

 on

 the

 character

's

 basic

 information

 and

 avoid

 making

 judgments

 or

 assumptions

 about

 them

.

 Here

 are

 some

 tips

 to

 help

 you

 write

 a

 neutral

 self

-int

roduction

 for

 a

 character

:


1

.

 Stick

 to

 the

 facts

:

 Focus

 on

 the

 character

's

 basic

 information

 such

 as

 their

 name

,

 age

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 statement

 about

 the

 founding

 of

 Paris

.

 The

 origins

 of

 Paris

 date

 back

 to

 the

 

3

rd

 century

 BC

,

 when

 it

 was

 a

 small

 Celtic

 settlement

.


Provide

 a

 concise

 statement

 about

 the

 history

 of

 Paris

.

 The

 city

 of

 Paris

 has

 a

 rich

 and

 complex

 history

,

 having

 been

 ruled

 by

 the

 Romans

,

 the

 Fr

anks

,

 and

 the

 Cap

et

ian

 dynasty

,

 among

 others

,

 and

 has

 undergone

 numerous

 periods

 of

 growth

 and

 decline

 throughout

 the

 centuries

.


Provide

 a

 concise

 statement

 about

 the

 cultural

 significance

 of

 Paris

.

 Paris

 is

 renowned

 for

 its

 cultural

 and

 artistic

 heritage

,

 being

 the

 birth

place

 of

 the

 French

 Renaissance

,

 the

 epic

enter

 of

 Imp

ression

ism

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 several

 trends

 are

 likely

 to

 emerge

 in

 the

 next

 decade

.


Several

 possible

 future

 trends

 in

 artificial

 intelligence

 include

:


1

.

 Increased

 adoption

 in

 various

 industries

:

 AI

 is

 expected

 to

 become

 a

 standard

 tool

 in

 many

 industries

,

 including

 healthcare

,

 finance

,

 transportation

,

 and

 education

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

:

 AI

 systems

 will

 become

 more

 capable

 of

 understanding

 and

 generating

 human

-like

 language

,

 leading

 to

 improved

 customer

 service

 and

 more

 sophisticated

 chat

bots

.


3

.

 Rise

 of

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

,

 leading

 to

 the

 development

 of

 Explain

able

 AI

 (

X

AI

)

 techniques




In [6]:
llm.shutdown()