# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.07it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.74it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.49it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily and I'm a yoga instructor and photographer. I'm excited to be a part of the Ocala Yoga Alliance and contribute to the yoga community in our beautiful city. I teach a variety of yoga classes, including Hatha, Vinyasa, Restorative, and Yin, and I also offer workshops and private lessons. My approach to yoga is holistic, focusing on the connection between body, mind, and spirit. I believe that yoga is a journey, not a destination, and I aim to create a safe and supportive environment for all students to explore and deepen their practice.
I'm passionate about photography and love capturing the beauty and ser
Prompt: The president of the United States is
Generated text:  not just a head of state, but also the commander-in-chief of the armed forces. Therefore, the president has the authority to deploy troops in various situations, including humanitarian crises, natural disasters, and conflicts.
Why is the President of the United States able to

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm looking forward to meeting new people and learning from their experiences. I'm a bit of a introvert, but I'm always up for a good conversation. I'm interested in hearing about your interests and hobbies. What brings you here today? This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or motivations. It provides a brief overview of her interests and goals, and invites

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a long history dating back to the 3rd century

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in the workplace: AI is already being used in many industries to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I'm a 19-year-old graphic design major at a local university. I work part-time at a small print shop, where I spend most of my free time. When not studying or working, you can find me attempting to cook new recipes or playing video games.
Who is Kaida?
Kaida is a 19-year-old student.
Kaida is a graphic design major.
Kaida works part-time at a small print shop.
Kaida is interested in cooking and video games.
Answer key:
Kaida is a 19-year-old student.
Kaida is a graphic design major.
Kaida works part-time at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
To write a concise factual statement about France's capital city, you need to provide a straightforward and accurate piece of information. Here is an example of how you can do that:

The capital of France is Paris.

This statement is brief, fact

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Dr

.

 Elena

 Vas

quez

,

 and

 I

'm

 a

 microbi

ologist

 with

 a

 Ph

.D

.

 in

 Micro

bial

 Ecology

.

 I

've

 worked

 in

 various

 research

 institutions

 and

 have

 expertise

 in

 molecular

 microbi

ology

,

 environmental

 microbi

ology

,

 and

 bi

otechnology

.

 Currently

,

 I

'm

 based

 in

 the

 lab

 of

 Dr

.

 John

 Lee

 at

 the

 University

 of

 California

,

 San

 Diego

,

 where

 I

'm

 involved

 in

 projects

 related

 to

 soil

 microbi

ology

 and

 the

 application

 of

 microbial

 processes

 to

 bi

ore

medi

ation

.

 I

'm

 looking

 forward

 to

 collaborating

 with

 my

 colleagues

 and

 contributing

 my

 skills

 to

 the

 team

.


I

 would

 suggest

 revis

ing

 the

 introduction

 to

 be

 a

 bit

 more

 concise

 and

 engaging

.

 Here

's

 a

 possible

 version



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


This

 factual

 statement

 gives

 a

 basic

 piece

 of

 information

 about

 a

 country

’s

 capital

 city

.

 It

 does

 not

 need

 to

 include

 an

 additional

 detail

 or

 description

 to

 be

 considered

 complete

.


Next

 task

 in

 instruction

:

 Provide

 a

 concise

 factual

 statement

 about

 the

 United

 States

’

 capital

 city

.

 The

 capital

 of

 the

 United

 States

 is

 Washington

,

 D

.C

.

 .

 

 Next

 task

 in

 instruction

:

 Provide

 a

 concise

 factual

 statement

 about

 the

 United

 Kingdom

’s

 capital

 city

.

 The

 capital

 of

 the

 United

 Kingdom

 is

 London

.

 .

 

 Next

 task

 in

 instruction

:

 Provide

 a

 concise

 factual

 statement

 about

 the

 capital

 city

 of

 Canada

.

 The

 capital

 of

 Canada

 is

 Ottawa

.

 .

 

 Next

 task

 in

 instruction

:

 Provide

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 a

 bit

 unsettling

.

 As

 AI

 becomes

 more

 advanced

,

 we

 can

 expect

 to

 see

 several

 trends

 emerge

.

 Here

 are

 some

 possible

 future trends

 in

 AI

:


1

.

 **

Increased

 Automation

**:

 AI

 will

 continue

 to

 automate

 more

 tasks

,

 freeing

 humans

 from

 mundane

 and

 repetitive

 work

.

 However

,

 this

 may

 also

 lead

 to

 job

 displacement

,

 as

 AI

 replaces

 human

 workers

 in

 many

 industries

.


2

.

 **

Adv

ances

 in

 Natural

 Language

 Processing

 (

N

LP

)**

:

 AI

-powered

 chat

bots

 and

 virtual

 assistants

 will

 become

 more

 convers

ational

 and

 human

-like

,

 making

 it

 easier

 for

 people

 to

 interact

 with

 technology

.


3

.

 **

Improved

 Image

 and

 Speech

 Recognition

**:

 AI

 will

 continue

 to

 improve

 its

 ability

 to




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.82it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.52it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.37it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Julie. I am a proud mother of three beautiful
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  required to take an oath of office on inauguration day
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  Paris, which is situated in the north-central part
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096

In [9]:
llm.shutdown()