# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.04it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.64it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:22,  1.03s/it]  9%|▊         | 2/23 [00:01<00:11,  1.87it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.73it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.52it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.20it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.60it/s]

 30%|███       | 7/23 [00:02<00:03,  5.07it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.45it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.71it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.90it/s]

 48%|████▊     | 11/23 [00:02<00:01,  6.09it/s] 52%|█████▏    | 12/23 [00:02<00:01,  6.24it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.34it/s] 61%|██████    | 14/23 [00:03<00:01,  6.42it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.45it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.48it/s]

 74%|███████▍  | 17/23 [00:03<00:00,  6.50it/s] 78%|███████▊  | 18/23 [00:03<00:00,  6.53it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  6.54it/s] 87%|████████▋ | 20/23 [00:04<00:00,  6.53it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  6.54it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.54it/s]

100%|██████████| 23/23 [00:04<00:00,  6.52it/s]100%|██████████| 23/23 [00:04<00:00,  5.13it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Amanda and I am a daughter, sister, friend, partner and mother. I am also a massage therapist, reflexologist, energy healer and hypnotherapist. I am passionate about helping others to achieve their full potential and live a happy, healthy and balanced life.
I have been working in the wellness industry for over 10 years, and during that time I have had the privilege of working with clients from all walks of life. I have helped people to manage stress and anxiety, improve their physical health, overcome addictions, and even cope with the loss of a loved one.
My approach is holistic, meaning I consider the whole person
Prompt: The president of the United States is
Generated text:  set to unveil a massive new infrastructure plan, but the details are shrouded in secrecy. White House officials have refused to provide even the most basic information about the plan, including its cost and the projects it will fund.
According to Axios, the plan is a "h

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new coffee shops. I'm a bit of a introvert, but I'm working on being more outgoing. I'm interested in learning more about the world and meeting new people. That's me in a nutshell. What do you think? Is it a good self-introduction?
This is a good self-introduction because it's concise, neutral, and provides a clear picture of who you are. It doesn't reveal too much about your personality

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  Paris is the largest city in France and is located in the northern part of the country. It is situated on the Seine River and is known for its beautiful architecture, art museums, and fashion industry. Paris is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major center for business, culture, and tourism.  Paris is also known for its romantic atmosphere and is often referred to

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, from diagnosis and treatment to personalized medicine and patient care. AI-powered systems will be able to analyze vast amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI will focus on developing AI systems that can provide transparent and interpre



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Julian Blackwood. I'm a 25-year-old who lives in the city. I'm a bit of a loner, but I enjoy taking walks in the park and reading about history.
This self-introduction is neutral because it doesn't reveal any emotional or personal details about Julian. It simply states his name, age, location, and a few of his habits. This is a good way to introduce a character to the reader because it sets a baseline for their personality and background without giving too much away. The reader can then interpret the details of the introduction as they see fit, depending on the context of the story. For example

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 
This is a factual statement about the capital city of France. 
The city of Paris is located in northern France and is the most populous city in the country. 
The 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Elijah

.

 I

'm

 a

 

25

-year

-old

 artist

 living

 in

 Brooklyn

.

 I

 like

 to

 draw

 and

 paint

 in

 my

 free

 time

,

 and

 I

'm

 currently

 working

 on

 a

 series

 of

 mixed

 media

 pieces

 exploring

 themes

 of

 identity

 and

 place

.

 I

'm

 also

 an

 avid

 reader

 and

 love

 getting

 lost

 in

 sci

-fi

 and

 fantasy

 novels

.

 I

'm

 still

 figuring

 out

 what

 I

 want

 to

 do

 with

 my

 life

,

 but

 for

 now

,

 I

'm

 just

 taking

 things

 one

 day

 at

 a

 time

.


Describe

 a

 time

 when

 Elijah

's

 neutral

 demeanor

 was

 challenged

.

 It

 was

 a

 week

 before

 the

 art

 show

 where

 his

 mixed

 media

 pieces

 were

 going

 to

 be

 showcased

.

 He

 had

 spent

 countless

 hours

 perfect

ing

 his



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 located

 in

 the

 northern

 part

 of

 the

 country

 along

 the

 Se

ine

 River

.

 Paris

 is

 a

 major

 cultural

 and

 economic

 center

.

 The

 city

 is

 also

 famous

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.


Provide

 a

 concise

 factual

 statement

 about

 the

 population

 of

 France

.

 The

 population

 of

 France

 is

 approximately

 

67

 million

 people

.

 The

 population

 is

 densely

 concentrated

 in

 the

 north

 and

 west

 of

 the

 country

,

 with

 the

 majority

 living

 in

 urban

 areas

.

 The

 population

 is

 projected

 to

 continue

 growing

 due

 to

 a

 high

 birth

 rate

 and

 an

 influx

 of

 immigrants

.


Provide

 a

 concise

 factual

 statement

 about

 the

 climate

 of

 France

.

 The

 climate

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 combination

 of

 technological

 advancements

,

 societal

 needs

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 use

 of

 deep

 learning

 and

 neural

 networks

:

 Deep

 learning

 and

 neural

 networks

 have

 already

 shown

 significant

 promise

 in

 AI

 applications

 such

 as

 image

 and

 speech

 recognition

.

 We

 can

 expect

 to

 see

 further

 advancements

 in

 these

 areas

,

 leading

 to

 more

 sophisticated

 AI

 systems

.


2

.

 Rise

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 decisions

 are

 made

.

 Explain

able

 AI

 aims

 to

 provide

 transparency

 and

 interpret

ability

 into

 AI

 decision

-making

 processes

,

 which

 will

 be

 crucial

 for

 building




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.45it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.03it/s]  9%|▊         | 2/23 [00:01<00:10,  2.00it/s]

 13%|█▎        | 3/23 [00:01<00:06,  2.90it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.69it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.35it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.71it/s]

 30%|███       | 7/23 [00:01<00:03,  4.97it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.33it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.63it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.79it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.80it/s] 52%|█████▏    | 12/23 [00:02<00:01,  5.99it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.13it/s] 61%|██████    | 14/23 [00:03<00:01,  6.16it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.25it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.17it/s]

 74%|███████▍  | 17/23 [00:03<00:00,  6.07it/s] 78%|███████▊  | 18/23 [00:03<00:00,  5.98it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  5.95it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.72it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.79it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.60it/s]

100%|██████████| 23/23 [00:04<00:00,  5.73it/s]100%|██████████| 23/23 [00:04<00:00,  4.95it/s]


In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Dr. April Gardner. I am a board certified
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  an elected official who serves as the head of the
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  Paris.
The capital of New Zealand is Wellington.

Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.

In [9]:
llm.shutdown()