# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")





Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.01s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.63it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rachel and I'm a Librarian at the West Vancouver Memorial Library. I have been working here for over 6 years and I have to say, I love my job! I get to help people find the information they need, help kids discover the magic of reading, and share my passion for books with the community. When I'm not working, you can find me hiking in the mountains, trying out new recipes in the kitchen, or practicing yoga. I'm a bit of a bookworm, and I love discovering new authors and genres, especially science fiction and fantasy. I'm always excited to chat about books and share my recommendations with
Prompt: The president of the United States is
Generated text:  the head of the federal government of the United States and is the commander-in-chief of the armed forces. The president serves a four-year term and is elected through the Electoral College. The president has significant powers and responsibilities, including the ability to propose legislation, neg

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy writing about various topics, from technology to social issues. I'm currently working on a few projects, including a novel and a series of articles about sustainable living. When I'm not writing, I like to explore the city, try new foods, and practice yoga. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal any personal opinions or biases. It simply states the character's name, age, occupation, and interests. It also doesn't mention any

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city. The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or elaboration. The statement is also accurate, as Paris is indeed the capital of France. This type of statement is useful for providing a quick and easy-to-understand answer to a question, and can be used in a variety of contexts, such as in a trivia game or in a educational setting. The statement is also neutral and objective, without any emotional or biased language. Overall, this statement

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Widespread adoption of AI in customer service: AI-powered chatbots and virtual assistants are becoming increasingly common in customer service, and this trend is expected to continue. AI-powered systems may be able to handle complex customer



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Zara Blackwood. I'm a 25-year-old writer and photographer who has lived in New York City my whole life. I work freelance, capturing the beauty of the city through my lens, and writing short stories in my spare time. That's me in a nutshell.
This is a neutral self-introduction because it simply states the facts about the character without adding any emotional or personal details. It gives a clear and concise overview of who the character is and what they do, but doesn't reveal much about their personality, motivations, or values. This type of introduction can be useful for a character who is still being developed or for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This response is a direct and factual answer to the question about the capital of France. It does not contain any additional information o

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 L

ila

 Mae

 F

oss

 and

 I

’m

 a

 

22

-year

-old

 communications

 major

 at

 the

 University

 of

 Wisconsin

–

Mad

ison

.

 I

 grew

 up

 in

 a

 small

 town

 in

 rural

 Wisconsin

,

 where

 I

 developed

 a

 love

 for

 storytelling

 through

 writing

 and

 photography

.

 I

’ve

 been

 inter

ning

 at

 a

 local

 newspaper

 and

 am

 excited

 to

 explore

 more

 opportunities

 in

 the

 field

 of

 journalism

.

 When

 I

’m

 not

 studying

 or

 working

,

 you

 can

 find

 me

 hiking

 with

 my

 friends

 or

 practicing

 yoga

.

 I

 enjoy

 listening

 to

 indie

 folk

 music

 and

 trying

 out

 new

 coffee

 shops

 in

 the

 city

.


This

 introduction

 has

 the

 following

 characteristics

 of

 a

 neutral

 self

-int

roduction

:


It

 includes

 your

 full

 name

 (

or

 a

 nickname



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 capital

 is

 also

 known

 as

 the

 City

 of

 Light

.


Paris

,

 the

 capital

 of

 France

,

 is

 often

 referred

 to

 as

 the

 City

 of

 Light

.

 This

 nickname

 was

 originally

 given

 to

 the

 city

 during

 the

 

17

th

 and

 

18

th

 centuries

 due

 to

 its

 abundance

 of

 street

lights

,

 which

 made

 it

 a

 beacon

 of

 enlightenment

 and

 intellectual

ism

 during

 the

 Age

 of

 Reason

.

 Today

,

 the

 name

 still

 ev

okes

 the

 city

's

 rich

 cultural

 and

 artistic

 heritage

,

 as

 well

 as

 its

 status

 as

 a

 global

 center

 for

 fashion

,

 cuisine

,

 and

 romance

.


Key

 information

:


Capital

 city

:

 Paris




Alternative

 name

:

 The

 City

 of

 Light




Origin

 of

 the

 nickname

:

 Ab

undance



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 subject

 of

 much

 speculation

 and

 debate

.

 While

 it

’s

 difficult

 to

 predict

 exactly

 what

 will

 happen

,

 here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 **

Increased

 Adoption

 in

 Everyday

 Life

**:

 AI

 will

 become

 even

 more

 ubiquitous

 and

 integrated

 into

 daily

 life

,

 from

 virtual

 assistants

 like

 Siri

 and

 Alexa

 to

 more

 advanced

 chat

bots

 and

 autonomous

 vehicles

.


2

.

 **

Adv

ancements

 in

 Machine

 Learning

**:

 Machine

 learning

,

 a

 type

 of

 AI

 that

 enables

 systems

 to

 learn

 from

 data

 without

 being

 explicitly

 programmed

,

 will

 continue

 to

 improve

,

 leading

 to

 more

 sophisticated

 AI

 applications

.


3

.

 **

R

ise

 of

 Edge

 AI

**:

 As

 devices

 become

 increasingly

 connected

,

 AI

 will

 be

 deployed

 closer

 to

 the

 source




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.45it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Jason. I’m a freelance writer and editor.
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  not only the head of the executive branch, but
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  Paris, the city of love and romance. However
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([40

In [9]:
llm.shutdown()