# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")





Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.11it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.51it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I am a Licensed Clinical Social Worker (LCSW). I am passionate about helping people navigate life's challenges and live a more fulfilling life. I believe that everyone deserves to live a life that is meaningful and enjoyable. I provide a safe, non-judgmental, and compassionate space for individuals to explore their thoughts, feelings, and experiences. I use evidence-based practices such as Cognitive Behavioral Therapy (CBT), Dialectical Behavior Therapy (DBT), and Trauma-Focused Cognitive Behavioral Therapy (TF-CBT) to help individuals achieve their goals and improve their overall well-being.
My areas of specialty include:
Tra
Prompt: The president of the United States is
Generated text:  not a king. The president is a representative of the people, elected to lead and serve the country. The president is not above the law, and is accountable to the Constitution and the people.
The president has many powers, but some of the most import

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm always looking for new experiences and learning opportunities. That's me in a nutshell. How can I improve this self-introduction?
The self-introduction is clear and concise, but it could be more engaging and memorable. Here are some suggestions to improve it:
1.  **Add a unique detail**: Consider adding a personal anecd

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, art, fashion, and cuisine. It is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. Paris is a popular tourist destination and is considered one of the most romantic cities in the world. The city has a population of over 2.1 million people and is a major hub for business, culture, and entertainment. Paris is also known for its vibrant neighborhoods, such as Montmartre and Le Marais, which offer a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even larger role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely to become even more widespread, with the potential to automate many tasks and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jamie Elliot. I'm a 25-year-old freelance writer and a bit of a bookworm. I enjoy long walks and good conversations. That's me in a nutshell. I hope you'll get to know me better.
The self-introduction is short and to the point, with no strong personality traits or strong opinions that could give away biases. It’s a neutral introduction that will not lead the reader to expect too much or too little from the character. The introduction also leaves room for the character to develop and for the reader to learn more about him. Jamie Elliot seems like a friendly and approachable person. His love for reading suggests

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city of Paris is a large city located in the northern part of France, along the Seine River. Paris is known for its historical and cultural sig

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

,

 and

 I

'm

 a

 time

-tr

avel

ing

 astronom

er

 from

 the

 year

 

228

7

.

 I

 have

 a

 passion

 for

 studying

 celestial

 events

 and

 uncover

ing

 the

 secrets

 of

 the

 universe

.

 I

'm

 currently

 on

 a

 mission

 to

 observe

 a

 rare

 astronomical

 phenomenon

,

 which

 has

 brought

 me

 to

 this

 particular

 time

 period

.

 I

'm

 excited

 to

 learn

 and

 adapt

 to

 this

 new

 environment

,

 and

 I

'm

 looking

 forward

 to

 meeting

 new

 people

 and

 experiencing

 new

 cultures

.


The

 introduction

 should

 be

 brief

,

 no

 more

 than

 

50

-

60

 words

,

 and

 provide

 a

 neutral

 tone

,

 not

 revealing

 too

 much

 about

 the

 character

's

 personality

 or

 motivations

.

 Here

's

 an

 example

 of

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 brief

 description

 of

 the

 Paris

 Opera

 House

.

 The

 Paris

 Opera

 House

,

 also

 known

 as

 Pal

ais

 Garn

ier

,

 is

 a

 magnificent

 opera

 house

 located

 in

 the

 heart

 of

 Paris

,

 France

.

 It

 is

 a

 masterpiece

 of

 

19

th

-century

 architecture

 and

 is

 one

 of

 the

 most

 famous

 opera

 houses

 in

 the

 world

.

 The

 opera

 house

 features

 a

 grand

 auditor

ium

 with

 a

 capacity

 of

 over

 

1

,900

 seats

,

 and

 its

 exterior

 is

 adorned

 with

 intricate

 car

v

ings

,

 g

ilded

 details

,

 and

 stunning

 stained

-g

lass

 windows

.


Describe

 the

 E

iff

el

 Tower

’s

 significance

 and

 its

 impact

 on

 the

 city

 of

 Paris

.

 The

 E

iff

el

 Tower

 is

 an

 iconic



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 often depicted

 in

 science

 fiction

 as

 a

 ut

opian

 or

 dyst

opian

 world

.

 Discuss

 the

 potential

 consequences

 of

 these

 two

 futures

,

 and

 explain

 why

 you

 think

 the

 actual

 future

 of

 AI

 will

 fall

 somewhere

 in

 between

.


The

 future

 of

 artificial

 intelligence

 (

AI

)

 is

 a

 topic

 of

 much

 debate

 and

 speculation

.

 While

 some

 envision

 a

 ut

opian

 world

 where

 AI

 enhances

 human

 life

 and

 capabilities

,

 others

 predict

 a

 dyst

opian

 future

 where

 AI

 surpass

es

 human

 intelligence

 and poses

 an

 existential

 threat

.

 In

 this

 response

,

 I

 will

 discuss

 the

 potential

 consequences

 of

 these

 two

 futures

 and

 explain

 why

 I

 think

 the

 actual

 future

 of

 AI

 will

 fall

 somewhere

 in

 between

.



**

U

top

ian

 Future

:

**



In

 a




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.80it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.51it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Yulian! I am a freelance writer and
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the head of state and head of government of the
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  not Paris, it is actually the city of Re
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), tor

In [9]:
llm.shutdown()