# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.04it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.66it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:24,  1.09s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.66it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.30it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.85it/s]

 22%|██▏       | 5/23 [00:02<00:05,  3.25it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.48it/s]

 30%|███       | 7/23 [00:02<00:04,  3.75it/s]

 35%|███▍      | 8/23 [00:02<00:03,  3.94it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.15it/s]

 43%|████▎     | 10/23 [00:03<00:02,  4.35it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.52it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.47it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  3.58it/s]

 61%|██████    | 14/23 [00:04<00:02,  3.51it/s]

 65%|██████▌   | 15/23 [00:04<00:02,  3.22it/s]

 70%|██████▉   | 16/23 [00:04<00:02,  3.21it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  3.27it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  3.46it/s]

 83%|████████▎ | 19/23 [00:05<00:01,  3.67it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  3.97it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.03it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.05it/s]

100%|██████████| 23/23 [00:07<00:00,  2.87it/s]100%|██████████| 23/23 [00:07<00:00,  3.27it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Shelly and I am a 30-year-old wife and mom to two beautiful kids. I am a stay-at-home mom and I love every minute of it. I get to spend all day, every day, with my babies and that is truly a blessing. I also have a passion for photography and I love to capture moments and memories for my family.
When I'm not playing with my kids or taking pictures, you can find me trying out new recipes in the kitchen or running errands for my family. I love to stay organized and make lists to help keep my sanity. I'm a bit of a planner, but I'm
Prompt: The president of the United States is
Generated text:  the head of state and the head of government of the United States. The president serves as the commander-in-chief of the United States Armed Forces and is the highest-ranking official in the executive branch of the federal government. The president is also the chief executive of the United States and is responsible for carrying out the laws passed by Congre

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my free time. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal any personal opinions or biases. It simply states the character's name, occupation, and some basic facts about their life. This can be a good way to introduce a character in a story, especially if you want to keep the focus on the plot rather than the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is known for its beautiful architecture, art museums, and romantic atmosphere. Paris is a popular tourist destination and is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a rich history and has been a center of culture and learning for centuries. Paris is also a major economic and financial center, and is home to many international organizations and companies. The city has a population of over 2.1 million people and is a hub for transportation, with two major airports and a comprehensive public transportation system. Overall, Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely to become even more widespread, with many industries adopting AI-powered solutions to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elara, and I'm a 22-year-old student working on a degree in environmental science at the University of Washington. I'm originally from Seattle, where I grew up surrounded by the city's unique blend of urban and natural landscapes. I enjoy hiking and exploring the outdoors, especially in the Pacific Northwest. I'm currently focusing on learning as much as I can about sustainable ecosystems and conservation efforts, with the goal of making a positive impact in my community. That's me in a nutshell! I'm excited to meet new people and collaborate with others who share my passions. (How to Write a Self-Introduction) - English lessons by

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It has a population of approximately 2.1 million people. Paris is the center of the Île-de-France region, home to a populatio

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Astr

id

 Jensen

 and

 I

'm

 a

 junior

 in

 high

 school

.

 I

'm

 

16

 years

 old

 and

 I

 enjoy

 playing

 soccer

 and

 reading

 science

 fiction

 novels

.

 I

'm

 a

 bit

 of

 a

 day

dream

er

 and

 I

 often

 find

 myself

 lost

 in

 thought

 while

 staring

 out

 the

 window

.

 I

'm

 working

 hard

 to

 get

 good

 grades

 and

 hoping

 to

 attend

 a

 good

 college

 in

 the

 future

.

 I

'm

 also

 a

 bit

 of

 a

 wor

rier

,

 often

 thinking

 about

 all

 the

 things

 that

 could

 go

 wrong

.

 But

 I

'm

 trying

 to

 focus

 on

 the

 positive

 and

 make

 the

 most

 of

 every

 day

.


I

'll

 give

 you

 some

 more

 information

 about

 Astr

id

.

 She

's

 a

 smart

 and

 ambitious

 student

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


A

 possible

 rhetorical

 question

 to

 pose

 to

 the

 class

:

 What

 is

 the

 E

iff

el

 Tower

’s

 significance

?


Possible

 response

:

 The

 E

iff

el

 Tower

 is

 an

 iron

 lattice

 tower

 located

 in

 Paris

,

 France

,

 and

 it

 was

 the

 tallest

 man

-made

 structure

 in

 the

 world

 when

 it

 was

 first

 built

 in

 the

 late

 

19

th

 century

.

 It

 was

 constructed

 for

 the

 

188

9

 World

’s

 Fair

 and

 has

 since

 become

 a

 symbol

 of

 French

 culture

 and

 engineering

 ing

enuity

.

 The

 tower

 stands

 at

 a

 height

 of

 

324

 meters

 (

1

,

063

 feet

)

 and

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

.


Possible

 follow

-up

 question

:

 What

 are

 some

 other

 famous



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 subject

 to

 ongoing

 research

 and

 debate

,

 and

 experts

 offer

 a

 range

 of

 predictions

 about

 its

 future

 impact

.

 With

 the

 rapid

 growth

 of

 machine

 learning

,

 AI

 is

 becoming

 increasingly

 prevalent

 across

 industries

,

 including

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 Here

 are

 some

 potential

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Use

 of

 Edge

 AI

:

 Edge

 AI

 refers

 to

 the

 processing

 of

 data

 and

 AI

 models

 at

 the

 edge

 of

 the

 network

,

 closer

 to

 the

 source

 of

 the

 data

,

 rather

 than

 in

 the

 cloud

.

 This

 trend

 is

 likely

 to

 continue

 as

 the

 need

 for

 real

-time

 processing

 and

 lower

 latency

 increases

.

 Edge

 AI

 will

 enable

 more

 efficient

 and

 autonomous

 systems

,

 such

 as

 self

-driving

 cars




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.45it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.04it/s]

  9%|▊         | 2/23 [00:01<00:10,  1.94it/s] 13%|█▎        | 3/23 [00:01<00:07,  2.77it/s]

 17%|█▋        | 4/23 [00:01<00:05,  3.38it/s]

 22%|██▏       | 5/23 [00:01<00:04,  3.83it/s]

 26%|██▌       | 6/23 [00:01<00:04,  4.02it/s]

 30%|███       | 7/23 [00:02<00:03,  4.28it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.43it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.52it/s]

 43%|████▎     | 10/23 [00:02<00:02,  4.56it/s] 48%|████▊     | 11/23 [00:02<00:02,  4.74it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.52it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.56it/s]

 61%|██████    | 14/23 [00:03<00:02,  4.39it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  4.44it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.55it/s] 74%|███████▍  | 17/23 [00:04<00:01,  4.67it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.83it/s] 83%|████████▎ | 19/23 [00:04<00:00,  4.93it/s]

 87%|████████▋ | 20/23 [00:04<00:00,  4.87it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.79it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.78it/s]

100%|██████████| 23/23 [00:05<00:00,  4.75it/s]100%|██████████| 23/23 [00:05<00:00,  4.14it/s]


In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Belle Gray and I am the founder and Chief Creative
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  known for his ability to speak eloquently and
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  Paris. Paris is the most populous city in France
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), to

In [9]:
llm.shutdown()