# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.82it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.38it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Glenny and I am a 4th year animal science student at the University of Guelph. I am excited to be a part of the Equine Guelph program and I am looking forward to sharing my experiences with you.
My passion for horses started at a young age and I have been involved in the equine industry for most of my life. I have competed in horse shows, worked at a horse stable, and volunteered at a therapeutic riding center. My experience with horses has not only taught me valuable skills, but has also given me a sense of purpose and fulfillment.
Through the Equine Guelph program, I
Prompt: The president of the United States is
Generated text:  usually the most powerful figure in American politics, but sometimes events can limit his or her power. During times of crisis, such as war or economic downturn, the president may find that Congress, the courts, or the public has significant influence over their decisions.
Historically, the president has had signific

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and trying to learn more about the city's hidden gems. That's me in a nutshell. What do you think? Is it too short or too long? Should I add anything else?
Your self-introduction is concise and to the point. It provides a good balance of personal and professional information. However, it might be a bit too short for a self-introduction, especially if you're trying to make a good impression or establish a connection with others. Consider adding

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city. It directly answers the question and provides a clear and accurate piece of information. The statement is also very short and to the point, making it easy to understand and remember. 
Here are a few more examples of concise factual statements about France’s capital city:
- Paris is the capital of France.
- The capital of France is located in the Île-de-France region.
- Paris is situated on the Seine River.
- The city of Paris is known for its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, potentially leading to improved patient outcomes and more efficient healthcare systems.
2. Rise of autonomous vehicles: Autonomous vehicles are already being tested on public roads, and it's likely that they



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Eira Shadowglow, and I am a traveling scholar of the arcane arts. I have spent many years studying and mastering various forms of magic, and I have a particular interest in the field of shadow manipulation. I am currently on a journey to explore the mysteries of the unknown and to further my knowledge of the mystical forces that shape our world. I am a seeker of truth, and I am always eager to learn from others and share my own insights with those who will listen. How might you revise this self-introduction to make it more neutral?
## Step 1: Remove any language that implies a strong sense of purpose or determination

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a concise factual statement about France’s currency. The official currency of France is the euro (EUR).
Provide a concise factual st

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 E

wen

 Thompson

,

 I

 work

 as

 a

 freelance

 journalist

.

 I

 enjoy

 traveling

 and

 trying

 new

 foods

.

 I

'm

 currently

 living

 in

 Seattle

.


The

 phrase

 "

Hello

,

 my

 name

 is

"

 is

 a

 common

 way

 to

 introduce

 oneself

,

 but

 for

 a

 short

,

 neutral

 introduction

,

 it

's

 better

 to

 simply

 state

 your

 name

.

 Here

 is

 a

 revised

 version

 of

 the

 introduction

:


E

wen

 Thompson

 is

 a

 freelance

 journalist

.

 He

 enjoys

 traveling

 and

 trying

 new

 foods

.

 He

 currently

 lives

 in

 Seattle

.

 



Note

:

 If

 you

 want

 to

 make

 the

 introduction

 more

 engaging

,

 you

 could

 add

 a

 few

 more

 details

 about

 E

wen

's

 interests

 or

 background

.

 For

 example

:


E

wen

 Thompson

 is

 a

 freelance



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris.

 (

10

 points

)


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 The

 capital

 of

 France

 is

 Paris

.

 (

10

 points

)


The

 capital

 of

 France

 is

 Paris

.

 (

10

 points

)


The

 E

iff

el

 Tower

 is

 a

 famous

 landmark

 in

 Paris

.

 (

10

 points

)


The

 famous

 painting

 "

M

ona

 Lisa

"

 is

 located

 in

 the

 Lou

vre

 Museum

 in

 Paris

.

 (

10

 points

)


France

's

 capital

 city

,

 Paris

,

 is

 located

 in

 the

 northern

 part

 of

 the

 country

.

 (

10

 points

)


The

 Se

ine

 River

 runs

 through

 the

 heart

 of

 Paris

.

 (

10

 points

)


The

 Notre

-D

ame

 Cathedral

 is

 a

 historic

 landmark

 in

 Paris

.

 (

10

 points



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting and

 full

 of

 possibilities

.

 From

 advancements

 in

 machine

 learning

 and

 natural

 language

 processing

 to

 increased

 adoption

 in

 various

 industries

,

 we

 can

 expect

 significant

 changes

 in

 the

 way

 we

 live

 and

 work

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Integration

 with

 the

 Internet

 of

 Things

 (

Io

T

)


As

 AI

 technology

 advances

,

 it

 will

 be

 seamlessly

 integrated

 with

 the

 Internet

 of

 Things

 (

Io

T

),

 enabling

 smart

 homes

,

 cities

,

 and

 industries

.

 This

 will

 lead

 to

 increased

 efficiency

,

 productivity

,

 and

 automation

 in

 various

 sectors

,

 such

 as

 transportation

,

 healthcare

,

 and

 energy

 management

.


2

.

 Increased

 use

 of

 Explain

able

 AI

 (

X

AI

)


As

 AI

 becomes

 more

 pervasive

,




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.11it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.75it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.41it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.30it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Mike
I'm a research scientist at the California
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  a co-equal branch of the federal government with
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  the city of Paris. Many of the most famous
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Si

In [9]:
llm.shutdown()