# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.11it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.74it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.40it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.08s/it]  9%|▊         | 2/23 [00:01<00:11,  1.81it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.64it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.32it/s]

 22%|██▏       | 5/23 [00:01<00:04,  3.92it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.27it/s]

 30%|███       | 7/23 [00:02<00:03,  4.55it/s] 35%|███▍      | 8/23 [00:02<00:03,  4.88it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.13it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.14it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.02it/s] 52%|█████▏    | 12/23 [00:03<00:02,  5.16it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.58it/s] 61%|██████    | 14/23 [00:03<00:01,  4.79it/s]

 65%|██████▌   | 15/23 [00:04<00:02,  2.83it/s]

 70%|██████▉   | 16/23 [00:04<00:02,  3.08it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  3.10it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  3.19it/s]

 83%|████████▎ | 19/23 [00:05<00:01,  3.48it/s] 87%|████████▋ | 20/23 [00:05<00:00,  3.95it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.34it/s] 96%|█████████▌| 22/23 [00:05<00:00,  4.64it/s]

100%|██████████| 23/23 [00:06<00:00,  4.92it/s]100%|██████████| 23/23 [00:06<00:00,  3.80it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Stephane Fosse, I'm a passionate photographer and digital artist. I'm specialized in creating realistic and detailed CGI artworks. I'm based in Paris, France, and I'm always ready to collaborate with potential clients or partners.
I have a strong background in computer graphics and I'm well-versed in 3D modeling, texturing, and rendering. My expertise includes creating realistic images of buildings, landscapes, and product visualizations. I'm also skilled in post-processing and compositing.
I'm a certified graphic designer and a skilled 3D artist with a strong portfolio that showcases my skills. I've worked on various
Prompt: The president of the United States is
Generated text:  the head of state and government of the United States. The president serves a four-year term and is elected by the American people through the Electoral College. The president is also the commander-in-chief of the United States Armed Forces and has the power to veto l

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm looking forward to meeting new people and learning from their experiences. I'm a bit of a introvert, but I'm always up for a good conversation. I'm interested in hearing about your interests and hobbies. What brings you here today?
This is a good example of a neutral self-introduction. It provides some basic information about the character, such as their name, age, and occupation, but it also gives a sense

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city. The capital of France is Paris.
The capital of France is Paris. This is a concise factual statement about France’s capital city. It is a simple and direct statement that provides accurate information. The statement is also neutral and does not include any opinion or bias. It is a good example of a factual statement that can be used in a variety of contexts, such as in a history textbook, a travel guide, or a general knowledge quiz. The statement is also easy to understand and remember, making it a useful piece of information for anyone looking to learn about France’s capital city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely to become even more widespread, with many industries adopting AI-powered solutions



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I am a 25-year-old travel writer. I have traveled to over 20 countries and have written for several travel publications. I am currently working on a memoir about my travels. When I am not writing, I can be found trying new foods, practicing yoga, or taking long walks in the city. I enjoy meeting new people and hearing their stories. I am open-minded and curious, always looking for the next adventure. This is a good start, but you might want to consider a few things to make it more engaging. Here are some suggestions: Use more descriptive language to paint a picture of Kaida in the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Located on the Seine River in northern France, Paris is a major economic, cultural, and tourist centre. The city is famous for its iconic landmarks, such as the Eiffel To

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ke

iko

 Yam

ato

.

 I

'm

 a

 freelance

 graphic

 designer

 living

 in

 Tokyo

.

 I

'm

 

25

 years

 old

 and

 I

'm

 originally

 from

 a

 small

 town

 in

 the

 countryside

.

 I

 enjoy

 working

 with

 clients

 to

 create

 unique

 visual

 identities

 and

 brand

 designs

 that

 capture

 their

 unique

 personalities

.

 When

 I

'm

 not

 working

,

 I

 like

 to

 try

 out

 new

 cafes

 and

 restaurants

 around

 the

 city

,

 or

 visit

 my

 family

's

 farm

 to

 help

 with

 the

 harvest

.


Ke

iko

 Yam

ato

 is

 a

 

25

-year

-old

 freelance

 graphic

 designer

 living

 in

 Tokyo

.

 Originally

 from

 a

 small

 town

 in

 the

 countryside

,

 she

 enjoys

 working

 with

 clients

 to

 create

 unique

 visual

 identities

 and

 brand

 designs

 that

 capture

 their

 unique

 personalities

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 home

 to

 over

 

2

.

1

 million

 people

 and

 is

 a

 significant

 economic

,

 political

,

 and

 cultural

 center

 in

 the

 European

 Union

.

 The

 city

 has

 been

 the

 capital

 of

 France

 for

 over

 a

 century

 and

 has

 a

 rich

 history

,

 including

 being

 a

 major

 center

 of

 art

 and

 culture

 during

 the

 Renaissance

 and

 Enlightenment

 periods

.

 Paris

 is

 also

 known

 for

 its

 beautiful

 architecture

,

 fashion

,

 and

 cuisine

.


The

 final

 answer

 is

:

 Paris

.

 ...

Read

 more

 Read

 less




What

 is

 the

 capital

 of

 France

?


The

 capital

 of

 France

 is

 Paris

.

 ...

Read

 more

 Read

 less




What

 is

 the

 population

 of

 the

 capital

 of

 France

?


The

 city

 has

 over

 

2

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 being

 shaped

 by

 a

 combination

 of

 technological

 advancements

,

 societal

 needs

,

 and

 regulatory

 frameworks

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 use

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 X

AI

 is

 a

 subset

 of

 AI

 that

 focuses

 on

 creating

 AI

 systems

 that

 can

 explain

 their

 reasoning

 and

 decision

-making

 processes

.

 This

 trend

 is

 likely

 to

 continue

 as

 organizations

 seek

 to

 build

 trust

 with

 their

 stakeholders

 and

 ensure

 that

 AI

 systems

 are

 fair

 and

 transparent

.


2

.

 Rise

 of

 Edge

 AI

:

 Edge

 AI

 refers

 to

 the

 deployment

 of

 AI

 algorithms

 and

 models

 at

 the

 edge

 of

 the




In [6]:
llm.shutdown()