# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.80it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.46it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.02it/s]  9%|▊         | 2/23 [00:01<00:10,  1.98it/s]

 13%|█▎        | 3/23 [00:01<00:06,  2.89it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.66it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.32it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.69it/s]

 30%|███       | 7/23 [00:01<00:03,  5.13it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.37it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.57it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.71it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.79it/s] 52%|█████▏    | 12/23 [00:02<00:01,  5.89it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  5.96it/s] 61%|██████    | 14/23 [00:03<00:01,  6.07it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.14it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.19it/s]

 74%|███████▍  | 17/23 [00:03<00:00,  6.22it/s] 78%|███████▊  | 18/23 [00:03<00:00,  6.27it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  6.25it/s] 87%|████████▋ | 20/23 [00:04<00:00,  6.24it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  6.23it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.23it/s]

100%|██████████| 23/23 [00:04<00:00,  5.99it/s]100%|██████████| 23/23 [00:04<00:00,  5.03it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Fiona Walker and I am a qualified Clinical Hypnotherapist and Hypnotherapist Member of the National Council for Hypnotherapy.
I run my practice in Northwich, Cheshire, which is convenient for those living in and around Northwich, Knutsford, Warrington, Chester, Runcorn and other surrounding areas.
I help my clients to overcome a wide range of issues such as anxiety, stress, insomnia, low self-confidence, weight management, phobias, smoking and many other problems.
Using a combination of gentle, powerful and relaxing hypnotherapy techniques, I help my clients to achieve their goals and improve
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States of America. The president is elected through the Electoral College system, with each state allocating a certain number of electoral votes based on its population. The president serves a four-year term and is limited to two terms in 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking and exploring the outdoors, and I'm passionate about environmental conservation. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm currently working on a novel and a series of illustrations that explore the intersection of nature and human experience. I'm excited to connect with like-minded individuals and share my work with the world.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions, and it doesn't try to persuade the reader to agree with

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city. The capital of France is Paris. The city is located in the northern part of the country, along the Seine River. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion, cuisine, and romantic atmosphere. Paris is a popular tourist destination and a major cultural and economic hub in Europe. The city has a rich history dating back to the Middle Ages and has been a center of art, literature, and science for centuries. Today, Paris is a vibrant and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased Adoption in Various Industries: AI is expected to become more widespread across various industries, including healthcare, finance, education, and transportation. This will lead to increased efficiency, productivity, and innovation in these sectors.
2. Advancements in Machine Learning: Machine learning is a key component of AI, and it is expected to continue to advance in the future. This will enable AI systems to learn from data and improve their performance over time.
3. Rise of Explainable AI: As AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lena. I live in the small town of Willow Creek and work as a librarian. I like to read, hike, and play the violin in my free time. I'm a bit of a homebody and enjoy spending time alone, but I appreciate the simple pleasures in life and am always up for a quiet conversation or a good book recommendation.
Answer: My name is Lena. I live in the small town of Willow Creek and work as a librarian. I like to read, hike, and play the violin in my free time. I'm a bit of a homebody and enjoy spending time alone, but I appreciate the simple pleasures in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the north-central region of the country.
Describe the geographical features of Paris. Paris is situated on the River Seine and is surrounded by several major rivers, including the Marne and the Ois

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 A

stra

 Red

ding

,

 and

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 living

 in

 the

 small

 town

 of

 Willow

 Creek

.

 I

 like

 to

 spend

 my

 free

 time

 exploring

 the

 outdoors

 and

 reading

 about

 science

 and

 history

.

 I

'm

 not

 really

 sure

 what

 I

 want

 to

 do

 with

 my

 life

 yet

,

 but

 I

'm

 trying

 to

 figure

 that

 out

.

 I

'm

 a

 bit

 of

 a

 lon

er

,

 but

 I

 enjoy

 the

 company

 of

 my

 close

 friends

 and

 family

.


A

stra

 Red

ding

 is

 the

 main

 character

 in

 a

 young

 adult

 novel

 called

 "

The

 Quiet

 Observ

ations

 of

 A

stra

 Red

ding

".

 The

 story

 follows

 A

stra

 as

 she

 navig

ates

 the

 complexities

 of

 high

 school

,

 friendships



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

.


Paris

 is

 situated

 on

 the

 Se

ine

 River

.


Paris

 is

 home

 to

 numerous

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.


The

 population

 of

 Paris

 is

 approximately

 

2

.

1

 million

 people

.


Paris

 is

 a

 major

 economic

 and

 cultural

 center

 in

 Europe

.


Paris

 is

 known

 for

 its

 fashion

 industry

,

 cuisine

,

 and

 art

 museums

.


The

 city

 has

 a

 diverse

 population

 with

 people

 from

 various

 ethnic

 and

 cultural

 backgrounds

.


Paris

 is

 a

 popular

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 The

 capital

 of

 France

 is

 Paris

.


Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 various

 factors

 including

 technological

 advancements

,

 societal

 needs

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 adoption

 of

 AI

 in

 various

 industries

:

 AI

 is

 expected

 to

 become

 a

 standard

 tool

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 It

 will

 help

 automate

 routine

 tasks

,

 improve

 decision

-making

,

 and

 enhance

 customer

 experience

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

 (

N

LP

)

 and

 computer

 vision

:

 AI

 systems

 will

 become

 more

 proficient

 in

 understanding

 and

 generating

 human

 language

,

 and

 in

 interpreting

 and

 processing

 visual

 data

.

 This

 will

 enable

 applications

 such

 as

 virtual

 assistants

,

 language

 translation

,

 and

 image

 recognition




In [6]:
llm.shutdown()