# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.16it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.84it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.53it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:25,  1.14s/it]

  9%|▊         | 2/23 [00:01<00:13,  1.55it/s]

 13%|█▎        | 3/23 [00:01<00:09,  2.10it/s]

 17%|█▋        | 4/23 [00:01<00:07,  2.52it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.87it/s]

 26%|██▌       | 6/23 [00:02<00:05,  2.99it/s]

 30%|███       | 7/23 [00:02<00:04,  3.26it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.57it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.86it/s]

 43%|████▎     | 10/23 [00:03<00:03,  4.10it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.28it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.45it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.58it/s] 61%|██████    | 14/23 [00:04<00:01,  4.70it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.82it/s] 70%|██████▉   | 16/23 [00:04<00:01,  4.90it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.94it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.30it/s]

 83%|████████▎ | 19/23 [00:05<00:01,  3.96it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  3.84it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  3.80it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  3.72it/s]

100%|██████████| 23/23 [00:06<00:00,  3.53it/s]100%|██████████| 23/23 [00:06<00:00,  3.48it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Pat and I'm a member of the Mystic Valley Chorale. I've been involved with the chorale for about 20 years, and I've sung with many other choruses in the Boston area. I'm an alto, and I love singing in harmony with the other voices. Music is a big part of my life, and I'm always looking for new and exciting ways to experience it. I'm also a bit of a theater buff, and I love attending musical theater productions and concerts.
When I'm not singing, I work as a librarian at a local elementary school. I love working with kids and helping them develop a
Prompt: The president of the United States is
Generated text:  a position that has been held by many influential individuals throughout history. From George Washington to Joe Biden, each president has brought their unique experiences and perspectives to the office. In this article, we will explore some of the most notable presidents in U.S. history, highlighting their accomplishments, challenges, and

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm a bit of a introvert, but I'm always up for a good conversation. I'm looking forward to meeting new people and learning more about their experiences. What do you think? Is this a good self-introduction?
The introduction is clear and concise, and it provides a good sense of who Kaida is and what she's interested in. However, it's a bit too neutral and doesn't reveal much

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
The capital of France is Paris. The city is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. The city is home to many famous museums, such as the Louvre and the Orsay, and is also famous for its fashion industry. Paris is a popular tourist destination and is often referred to as the "City of Light." The city has a population of over 2.1 million people and is a major hub for business, education, and culture in Europe. The official language

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Widespread adoption of AI in industries: AI is likely to become more prevalent in various industries, including finance, transportation, and education. AI-powered systems may be able to automate tasks, improve efficiency, and enhance decision



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jaelin Frost. I'm a 25-year-old freelance writer and artist living in Portland, Oregon. I enjoy spending time outdoors, practicing yoga, and exploring new coffee shops in the city. I'm a bit of a hopeless romantic, but I've learned to be content with my solo lifestyle. When I'm not working on my latest creative project, you can find me volunteering at a local animal shelter or practicing my photography skills on the city streets.
Use present tense to describe their daily life and personality traits.
Jaelin Frost is a 25-year-old freelance writer and artist living in Portland, Oregon. She is always on

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
How To Write A Thesis Statement For A Research Paper
Writing a thesis statement for a research paper involves several steps that help you to formulate a clea

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 R

ina

.


I

'm

 a

 high

 school

 student

 living

 in

 a

 quiet

 suburban

 town

.

 I

'm

 currently

 in

 my

 junior

 year

,

 and

 I

 spend

 most

 of

 my

 free

 time

 reading

 and

 writing

 short

 stories

.

 I

'm

 pretty

 laid

-back

 and

 enjoy

 the

 simple

 things

 in

 life

,

 like

 taking

 long

 walks

 in

 the

 nearby

 park

 or

 chatting

 with

 my

 friends

 at

 the

 local

 coffee

 shop

.

 I

'm

 not

 really

 into

 sports

 or

 loud

 gatherings

,

 but

 I

 appreciate

 the

 company

 of

 good

 people

 and

 the

 comfort

 of

 a

 cozy

 night

 in

.

 That

's

 about

 it

 for

 now

.


What

 do

 you

 think

?

 Is

 this

 a

 good

 self

-int

roduction

 for

 R

ina

?

 Does

 it

 give

 a

 good

 sense

 of

 who



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 most

 populous

 city

 in

 the

 European

 Union

.

 It

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

,

 along

 the

 Se

ine

 River

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

 museums

,

 fashion

,

 and

 cuisine

.

 The

 city

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

 and

 is

 a

 major

 hub

 for

 business

,

 tourism

,

 and

 culture

.


The

 following

 is

 a

 text

 describing

 the

 characteristics

 of

 the

 E

iff

el

 Tower

,

 a

 famous

 landmark

 in

 Paris

.


The

 E

iff

el

 Tower

 is

 a

 iconic

 iron

 lattice

 tower

 located

 in

 the

 heart

 of

 Paris

,

 France

.

 It

 was

 built

 for

 the

 

188

9



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 highly

 debated

 and

 evolving

 topic

.

 However

,

 based

 on

 current

 developments

 and

 trends

,

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 AI

 will

 become

 ubiquitous

:

 As

 AI

 technology

 improves

,

 it

's

 likely

 that

 AI

 will

 become

 ubiquitous

,

 embedded

 in

 many

 aspects

 of

 daily

 life

,

 from

 smart

 homes

 to

 self

-driving

 cars

.


2

.

 Increased

 focus

 on

 Explain

ability

 and

 Transparency

:

 As

 AI

 becomes

 more

 prevalent

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 to

 be

 explain

able

 and

 transparent

 in

 their

 decision

-making

 processes

.


3

.

 Emer

gence

 of

 Hybrid

 Intelligence

:

 Hybrid

 intelligence

 combines

 human

 and

 artificial

 intelligence

 to

 create

 more

 effective

 and

 efficient

 solutions

.

 This

 trend

 is

 likely

 to




In [6]:
llm.shutdown()