# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.11it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.72it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.40it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Helen and I am the new Editor of the BPCA magazine. I am thrilled to be a part of this team and to have the opportunity to share the stories of our members and their work with you.
I come from a background in social work and community development, and I am passionate about the positive impact that pest control professionals can have on people’s lives. Whether it’s protecting public health, preventing pest-related stress, or supporting vulnerable communities, I believe that our industry is truly making a difference.
As Editor, my goal is to showcase the best of the BPCA, highlighting the expertise, innovation, and commitment of our members. I
Prompt: The president of the United States is
Generated text:  not just a figurehead, but a leader who plays a critical role in shaping the country's policies and laws. The president is the head of state and the head of government, and is responsible for appointing federal judges, signing or vetoing legisl

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student living in a small town in Japan. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. When I'm not studying or participating in extracurricular activities, I like to spend time with my friends and family, trying out new foods and exploring the local area. I'm a bit of a perfectionist, but I'm working on being more relaxed and enjoying the little things in life. I'm looking forward to seeing what the future holds.
Kaida

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks, including the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also a major center for business, finance, and international relations. Paris is a popular tourist destination, attracting millions of visitors each year. The city has a population of over 2.1 million people, but the metropolitan area has a population of over 12 million people. Paris is a global

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, there are several trends that are likely to shape the development and impact of artificial intelligence in the coming years. Here are some possible future trends in AI:
1. Increased Adoption of AI in Various Industries:
Artificial intelligence is expected to become increasingly adopted in various industries, including healthcare, finance, transportation, and education. This will lead to improved efficiency, productivity, and decision-making in these sectors.
2. Advancements in Machine Learning:
Machine learning is a subset of AI that enables systems to learn from data and improve their performance over time



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Echo. I work as a librarian at the local public library, and I'm really passionate about helping people find the books they're looking for. I love learning about new authors and genres, and I enjoy sharing that knowledge with others. When I'm not working, you can find me curled up with a good book or spending time with my cat, Luna. I'm a bit of a quiet person, but I'm always happy to chat with someone who shares my love of reading. This self-introduction is friendly and inviting, while also highlighting Echo's profession and interests. It's a great way for her to introduce herself to others and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is a major city in Western Europe. It is situated in the northern part of the country. Paris is known for its cultural and historical landmarks, such as the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 A

list

air

 Thompson

.

 I

'm

 a

 

25

-year

-old

 photographer

 from

 Portland

,

 Oregon

.

 I

 spend

 most

 of

 my

 free

 time

 working

 on

 new

 projects

 and

 exploring

 the

 city

 with

 my

 camera

.


A

list

air

 Thompson

 is

 a

 

25

-year

-old

 photographer

 from

 Portland

,

 Oregon

.

 He

 spends

 most

 of

 his

 free

 time

 working

 on

 new

 projects

 and

 exploring

 the

 city

 with

 his

 camera

.


The

 name

 is

 A

list

air

 Thompson

,

 a

 

25

-year

-old

 photographer

 from

 Portland

,

 Oregon

.

 In

 his

 free

 time

,

 he

 works

 on

 various

 projects

 and

 explores

 the

 city

 with

 his

 camera

.

 A

list

air

 Thompson

,

 a

 

25

-year

-old

 photographer

,

 h

ails

 from

 Portland

,

 Oregon

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 north

-central

 part

 of

 France

 and

 is

 situated

 along

 the

 Se

ine

 River

.

 The

 city

 is

 home to

 more

 than

 

2

.

1

 million

 people

.

 Paris

 is

 known

 for

 its

 stunning

 architecture

,

 art

 museums

,

 and

 fashion

 industry

.

 The

 city

 is

 a

 major

 economic

 and

 cultural

 center

 and

 is

 often

 referred

 to

 as

 the

 “

City

 of

 Light

.”

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 era

,

 and

 its

 many

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

,

 attract

 millions

 of

 tourists

 each

 year

.

 With

 its

 unique

 blend

 of

 history

,

 culture

,

 and

 beauty

,

 Paris

 is

 a

 must

-

visit

 destination

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 there

 are

 several

 possible

 trends

 that

 could

 shape

 its

 development

 and

 application

.


Future

 Trends

 in

 Artificial

 Intelligence




Art

ificial

 intelligence

 (

AI

)

 has

 made

 significant

 progress

 in

 recent

 years

,

 and

 its

 potential

 applications

 are

 vast

 and

 varied

.

 As

 AI

 continues

 to

 evolve

,

 it

's

 likely

 to

 impact

 various

 aspects

 of

 our

 lives

,

 from

 work

 and

 education

 to

 healthcare

 and

 entertainment

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 focus

 on

 Explain

ability

 and

 Transparency




As

 AI

 becomes

 more

 pervasive

,

 there

's

 a

 growing

 need

 to

 understand

 how

 it

 makes

 decisions

 and

 recommendations

.

 Explain

ability

 and

 transparency

 will

 become

 essential

 in

 AI

 development

,

 ensuring

 that users

 can

 trust

 AI

 systems




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.27it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.89it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.50it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.43it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Barry Finkel, and I’m a creator and
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  to be held accountable by the people, as per
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  a bustling city filled with rich history, stunning architecture
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), tor

In [9]:
llm.shutdown()