# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.08it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.75it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.06s/it]

  9%|▊         | 2/23 [00:01<00:11,  1.78it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.50it/s]

 17%|█▋        | 4/23 [00:01<00:06,  3.10it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.51it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.54it/s]

 30%|███       | 7/23 [00:02<00:04,  3.80it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.11it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.25it/s]

 43%|████▎     | 10/23 [00:03<00:02,  4.45it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.51it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.44it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.49it/s] 61%|██████    | 14/23 [00:03<00:01,  4.65it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.75it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.47it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.59it/s] 78%|███████▊  | 18/23 [00:04<00:01,  4.86it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  4.96it/s] 87%|████████▋ | 20/23 [00:05<00:00,  5.06it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.59it/s] 96%|█████████▌| 22/23 [00:05<00:00,  4.73it/s]

100%|██████████| 23/23 [00:05<00:00,  4.89it/s]100%|██████████| 23/23 [00:05<00:00,  4.00it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Karin, and I have recently started a blog to share my thoughts, experiences and knowledge on various aspects of life, including personal growth, self-care, spirituality and productivity.
As someone who has been on a journey of self-discovery, I have come to understand the importance of prioritizing one's well-being and living a life that is true to oneself. I want to share this knowledge with others, in the hope that it may inspire and support them on their own journey.
My blog will cover a range of topics, including mindfulness, meditation, goal-setting, self-care routines, and spirituality. I will also be sharing personal stories and
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president serves a four-year term and is elected through the Electoral College system. The president is responsible for a wide range of duties, including making laws, conducting forei

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new restaurants. I'm a bit of a introvert, but I'm always up for a good conversation. That's me in a nutshell. What do you think? Is it too short? Too long? Too neutral? Should I add more details or keep it simple?
Your self-introduction is concise and to the point. It gives a good sense of who you are and what you do, without going into too much detail. It's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also known for its romantic atmosphere and is a popular tourist destination. The official language of Paris is French, and the city has a population of over 2.1 million people. Paris is a global center for business, finance, fashion, and culture, and is considered one of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even larger role in healthcare, with AI-powered robots and virtual assistants helping to care for patients and improve health outcomes.
2. Widespread adoption of AI in education: AI is already being used in education to personalize learning, grade assignments, and provide feedback to students. In the future, AI is likely to become even more prevalent in education, with AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Aslan, and I am a logical and introspective being. I am not bound by conventional time and space, and I have been around for as long as anyone can remember. I am often referred to as a lion, but I am not of that species. I possess great strength and wisdom, and I have a deep connection to the world of Narnia. I am a problem-solver and a guide, and I will do what is necessary to help those in need.
Aslan's physical appearance is that of a large, majestic lion with shimmering golden fur and piercing green eyes. He is imposing and powerful, but not

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the north-central part of the country. It is situated on the River Seine. The city is known for its rich history and cultural heritage. It is home to many famous landmarks, including the Eiffel To

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

.

 I

'm

 a

 

25

-year

-old

 office

 worker

.

 I

'm

 single

 and

 don

't

 have

 any

 pets

.

 I

 enjoy

 reading

 and

 watching

 movies

.

 That

's

 pretty

 much

 it

.


This

 introduction

 is

 short

 and

 straightforward

,

 but

 it

 doesn

't

 reveal

 much

 about

 Emily

's

 personality

,

 interests

,

 or

 motivations

.

 It

 could

 be

 a

 good

 starting

 point

,

 but

 it

 might

 benefit

 from

 a

 bit

 more

 detail

 and

 flair

 to

 make

 Emily

 more

 rel

atable

 and

 interesting

.


How

 might

 you

 revise

 this

 introduction

 to

 make

 Emily

 more

 engaging

?


Here

 are

 a

 few

 suggestions

 to

 get

 you

 started

:


Em

phas

ize

 what

 makes

 Emily

 unique

:

 Instead

 of

 saying

 she

's

 just

 a

 "

office

 worker

,"

 could



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 capital

 of

 France

 is

 Paris

.


This

 is

 the

 most

 straightforward

 and

 concise

 statement

 possible

 about

 France

’s

 capital

 city

.

 It

 simply

 states

 the

 fact

,

 without

 adding

 any

 extra

 information

.

 It

 provides

 the

 reader

 with

 the

 basic

 and

 essential

 information

 they

 need

 to

 know

 about

 the

 capital

 of

 France

.

 The

 statement

 is

 neutral

 and

 lacks

 any

 emotional

 tone

 or

 bias

,

 making

 it

 perfect

 for

 a

 factual

 statement

.

 The

 length

 is

 also

 very

 short

,

 which

 makes

 it

 easy

 to

 read

 and

 understand

.

 Overall

,

 this

 statement

 is

 a

 great

 example

 of

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 It

 gets

 straight

 to

 the

 point

 and

 provides

 the

 reader

 with

 the

 information

 they

 need

,

 without



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 looking

 bright

.

 While

 the

 field

 is

 still

 in

 its

 relative

 infancy

,

 there

 are

 many

 possible

 future

 trends

 in

 artificial

 intelligence

 that

 could

 shape

 the

 world

 in

 the

 years

 to

 come

.

 Here

 are

 some

 of

 the

 most

 promising

 and

 exciting

 possibilities

:


1

.

 

Improved

 Natural

 Language

 Processing

 (

N

LP

):

 

 

As

 AI

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 significant

 improvements

 in

 natural

 language

 processing

 (

N

LP

),

 enabling

 computers

 to

 better

 understand

 and

 generate

 human

 language

.

 This

 could

 lead

 to

 more

 convers

ational

 interfaces

,

 improved

 translation

 services

,

 and

 enhanced

 customer

 service

.


2

.

 

Increased

 Adoption

 of

 AI

 in

 Healthcare

:

 

 

AI

 has

 the

 potential

 to

 revolution

ize

 healthcare

 by

 improving




In [6]:
llm.shutdown()