# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.16it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.88it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.40it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:20,  1.05it/s]  9%|▊         | 2/23 [00:01<00:10,  1.97it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.76it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.40it/s]

 22%|██▏       | 5/23 [00:01<00:04,  3.92it/s]

 26%|██▌       | 6/23 [00:01<00:04,  4.16it/s] 30%|███       | 7/23 [00:02<00:03,  4.50it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.72it/s] 39%|███▉      | 9/23 [00:02<00:02,  4.93it/s]

 43%|████▎     | 10/23 [00:02<00:02,  5.05it/s] 48%|████▊     | 11/23 [00:02<00:02,  5.15it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  5.24it/s] 57%|█████▋    | 13/23 [00:03<00:01,  5.29it/s]

 61%|██████    | 14/23 [00:03<00:01,  5.31it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.20it/s] 70%|██████▉   | 16/23 [00:03<00:01,  5.28it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  5.30it/s] 78%|███████▊  | 18/23 [00:04<00:00,  5.35it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.21it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.26it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.32it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.36it/s]

100%|██████████| 23/23 [00:05<00:00,  5.36it/s]100%|██████████| 23/23 [00:05<00:00,  4.50it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lisa and I am a passionate individual with a strong will to help others. I am a proud mother of 2 wonderful children and a dedicated wife to my loving husband. I have had my fair share of challenges and struggles in life, but I have never lost sight of my dreams and aspirations. As a result, I have been able to overcome obstacles and come out stronger on the other side.
I have a deep desire to connect with like-minded individuals who share my passion for helping others. I believe that everyone deserves a chance to live a fulfilling life, and I am committed to doing my part in making that a reality.
As a dedicated
Prompt: The president of the United States is
Generated text:  the head of state and the head of government of the United States. The president serves a four-year term and is elected through the Electoral College system. The president is responsible for enforcing laws, commanding the armed forces, and conducting foreign policy. The pr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys reading and playing video games in my free time. I'm a bit of a bookworm and often get lost in the world of fantasy novels. I'm also pretty good at strategy games like chess and Starcraft. When I'm not studying or gaming, you can find me listening to music or trying out new recipes in the kitchen. I'm a bit of a homebody, but I'm always up for a good conversation or adventure. I'm a bit of a perfectionist, but I'm working on being more laid-back and embracing my flaws. I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, tourism, and education. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." (Word Count: 100) Provide a concise factual statement

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, there are several trends that are likely to shape the development and impact of artificial intelligence in the coming years.
1. Increased Adoption of AI in Various Industries:
AI is expected to become increasingly adopted across various industries, including healthcare, finance, transportation, and education. This will lead to improved efficiency, productivity, and decision-making in these sectors.
2. Advancements in Machine Learning and Deep Learning:
Machine learning and deep learning are expected to continue to advance, enabling AI systems to learn from data and improve their performance over time. This will



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Astrid Erso. I'm a 22-year-old introverted botanist with a passion for studying unusual plant species.
Astrid Erso's introduction is concise and to the fact. It includes the following information:
Age: 22
Occupation: Botanist
Personality: Introverted
Interests: Studying unusual plant species
This introduction provides a neutral, neutral description of Astrid, without any emotional or personal details. It sets the stage for her character and background, leaving room for further development and expansion in the story. The introduction is also concise, making it easy to understand and remember. Overall

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which has a population of approximately 2.1 million people. (Source: Wikipedia)
Based on the provided text, the answer to the question is Paris. The text expl

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

.

 I

'm

 a

 

25

-year

-old

 ast

roph

ys

ic

ist

 with

 a

 penchant

 for

 science

 fiction

 novels

 and

 a

 love

 for

 st

arg

azing

 on

 clear

 nights

.

 I

 currently

 work

 at

 the

 Cele

stial

 Research

 Institute

,

 where

 I

'm

 part

 of

 a

 team

 studying

 the

 mysteries

 of

 black

 holes

.

 My

 interests

 include

 exploring

 the

 intersection

 of

 science

 and

 philosophy

,

 and

 I

'm

 always

 eager

 to

 engage

 in

 discussions

 about

 the

 nature

 of

 reality

.


This

 text

 is

 a

 good

 example

 of

 a

 neutral

 self

-int

roduction

 because

 it

:


Provides

 a

 brief

 overview

 of

 the

 character

's

 background

 and

 profession

.


M

entions

 their

 interests

 and

 hobbies

.


Avoid

s

 any

 overt

ly

 dramatic

 or

 attention



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 and

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

.

 It

 is

 situated

 on

 the

 Se

ine

 River

 and

 is

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

.


Paris

,

 also

 known

 as

 the

 City

 of

 Light

,

 is

 a

 global

 center

 for

 fashion

,

 cuisine

,

 and

 art

.

 It

 has

 been

 a

 major

 hub

 of

 culture

 and

 learning

 for

 centuries

 and

 is

 home

 to

 many

 famous

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.


The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 

3

rd

 century

 BC

 and

 has

 been

 ruled

 by

 various

 em



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 factors

,

 including

 technological

 advancements

,

 societal

 needs

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 

 

Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 will

 continue

 to

 be

 used

 in

 healthcare

 to

 improve

 diagnosis

,

 treatment

,

 and

 patient

 outcomes

.

 For

 example

,

 AI

-powered

 diagnostic

 tools

 will

 become

 more

 common

,

 and

 AI

-ass

isted

 robots

 will

 be

 used

 to

 perform

 surgeries

.


2

.

 

 

More

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 to

 be

 explain

able

 and

 transparent

.

 This

 means

 that

 AI

 systems

 will

 need

 to

 provide

 clear

 explanations

 for

 their




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.80it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.45it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.06s/it]  9%|▊         | 2/23 [00:01<00:11,  1.84it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.67it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.38it/s]

 22%|██▏       | 5/23 [00:01<00:04,  3.80it/s] 26%|██▌       | 6/23 [00:01<00:04,  4.14it/s]

 30%|███       | 7/23 [00:02<00:03,  4.51it/s] 35%|███▍      | 8/23 [00:02<00:03,  4.79it/s]

 39%|███▉      | 9/23 [00:02<00:02,  4.90it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.01it/s]

 48%|████▊     | 11/23 [00:02<00:02,  4.86it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.54it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.62it/s] 61%|██████    | 14/23 [00:03<00:01,  4.74it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  4.63it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.63it/s] 74%|███████▍  | 17/23 [00:04<00:01,  4.78it/s]

 78%|███████▊  | 18/23 [00:04<00:00,  5.02it/s] 83%|████████▎ | 19/23 [00:04<00:00,  5.15it/s]

 87%|████████▋ | 20/23 [00:04<00:00,  5.19it/s] 91%|█████████▏| 21/23 [00:04<00:00,  5.23it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  5.37it/s]

100%|██████████| 23/23 [00:05<00:00,  4.56it/s]100%|██████████| 23/23 [00:05<00:00,  4.21it/s]


In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Jacob, and I’m the president of the Bear
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the head of state and the head of government of
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  a city like no other. Paris is the epic
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]),

In [9]:
llm.shutdown()